Upload files to S3 from Remote Server
Stream data to S3 from a remote server
Scope
How to send your backend a URL to a file that is then uploaded to S3 without first having to download the whole file to disk or memory. This guide uses Python but the same technique can be used with other languages as well.
To keep the guide short, testing will not be covered.
Assumptions/Not covered/Prequesits
This guide assumes;
- you can (somewhat) read Python with type hinting
- that you have already set up your backend and connect to AWS S3 using Python and
Boot3
The examples in this guide;
- have the imports omitted (can be found in the last example)
- uses the standard Python HTTP library
requestto connect to fetch the file on the remote server - is written in synchronous code since Boto3 is synchronous
Steps
The actual code flow when copying a file to S3 is the following 3 steps:
- Opening up a stream connection to the file we want to download
- Pass the data stream to Boto3 to upload it to S3
- Do a quick check on the stream so we get what we want
Step 1: Open up a stream
There are two ways you can do a "get request". There is the more familiar "regular get":
response = requests.get(url)
and there is the "stream get":
response = requests.get(url, stream=True)
Simplified, there are a few differences between stream get and regular get. When using a regular get, all data is downloaded at once. With a stream get, only the header is fetched and the actual data/content is delayed until you try to access it.
With stream get, the server, from where we get the data, will divide up the response into chunks. It is then possible to either get all the data at once (not a big difference from regular get) or to get one chunk at a time. We can prevent loading the whole file into memory by first getting a chunk, send that chunk to S3 and then get the next chunk. (This will be handled automatically by Boto3.)
Let's now write the function that actually opens up a stream and passes it on to another function:
async def upload_from_url(
url: str,
content_type: str,
object_key: str
) -> None:
with requests.get(url, stream=True) as stream:
logger.info(f"Streaming {url} to S3")
upload_file_from_stream(
stream=stream.raw,
object_key=object_key,
content_type=content_type
)
As seen above, I have to pass a URL (obviously), but you also pass content_type and object_key. content_type is passed to avoid S3 setting the content type to octet/binary. object_key is just the identifier for the file in S3 (normally referred to as just key in the Boto3 documentation).
I will leave it up to you how you get the content_type and the object_key. For the content_type, you could take it directly from the stream. Just remove content_type from the signature of upload_from_url and call upload_file_from_stream like this:
upload_file_from_stream(
stream=stream.raw,
object_key=object_key,
content_type=stream.headers["Content-Type"]
)
Step 2: Pass the data stream to Boto3 to upload it to S3
The method upload_fileobj on an s3 client takes a file-like object, which a raw stream is. So you can easily just upload the stream as-is.
def upload_file_from_stream(
stream: Any,
object_key: str,
content_type: str
) -> None:
s3_client = get_s3_client()
s3_client.upload_fileobj(
stream,
AWS_S3_BUCKET,
object_key,
ExtraArgs={"ContentType": content_type}
)
# where get_s3_client is the following:
_s3_client = None
def get_s3_client() -> s3.Client:
global _s3_client
if _s3_client is None:
_s3_client = boto3.client("s3")
return _s3_client
Step 3: Do a quick check on the stream so we get what we want
The above code works as it should, and it worked excellent for me for a while. However, it overlooks a very important thing; it will think that ALL file transfers succeed as long as it gets a body in the response. I discovered this by having "a few" 1.1kb JPEG files in my S3 drive that I could not open I'm my image editor. The problem was that the remote server that hosted the images I wanted to transfer replied with a 403 together with an XML body containing more details.
One might first think that one way would be for the client requesting the transfer to also send the expected content type. One could then compare the actual content type with the actual content type and throw an error if they are different. That works for images, video, etc. But will not work if you ever want to transfer an XML file or a JSON file (since the error message on failed will most likely be on either of those two formats).
What we instead have to do is checking the status code of the response from the remote server. That is done quite easily. It is just stream.status_code. Let's implement this in our code.
async def upload_from_url(url: str, object_key: str) -> None:
try:
with requests.get(url, stream=True) as stream:
logger.info(f"Streaming {url} to S3")
if stream.status_code != requests.codes.ok:
raise RemoteServerError(
f"Failed to stream {url}. "
f"200 OK response required from remote server "
f"but received: {stream.status_code}"
)
upload_file_from_stream(
stream=stream.raw,
object_key=object_key,
content_type=stream.headers["Content-Type"]
)
except RemoteServerError as exc:
logger.error(
f"upload_form_url for URL {url} "
f"failed with error: {exc}"
)
raise exc
An exception will now be raised if the remote server responds with anything else but a 200 OK response.
Here, RemoteServerError is just a simple
class RemoteServerError(Exception):
pass
but you might want to raise an error that is better suited for your application.
Complete code
import logging
from typing import Any
import boto3
import requests
from boto3_type_annotations import s3
logger = logging.getLogger(__name__)
_s3_client = None
def get_s3_client() -> s3.Client:
global _s3_client
if _s3_client is None:
_s3_client = boto3.client("s3")
return _s3_client
class RemoteServerError(Exception):
pass
async def upload_from_url(url: str, object_key: str) -> None:
try:
with requests.get(url, stream=True) as stream:
logger.info(f"Streaming {url} to S3")
if stream.status_code != requests.codes.ok:
raise RemoteServerError(
f"Failed to stream {url}. "
f"200 OK response required from remote server "
f"but received: {stream.status_code}"
)
upload_file_from_stream(
stream=stream.raw,
object_key=object_key,
content_type=stream.headers["Content-Type"]
)
except RemoteServerError as exc:
logger.error(
f"upload_form_url for URL {url} "
f"failed with error: {exc}"
)
raise exc
def upload_file_from_stream(
stream: Any,
object_key: str,
content_type: str
) -> None:
s3_client = get_s3_client()
s3_client.upload_fileobj(
stream,
AWS_S3_BUCKET,
object_key,
ExtraArgs={"ContentType": content_type}
)