Zip files from Amazon S3 without saving them locally

Wednesday, June 1, 2022 / Reading time: 14 minute/s

I recently had an engineering requirement where I had to build a Lambda function that would grab an arbitrary number of files stored in S3, compress them into a single ZIP archive, then upload the archive into another S3 bucket. This was eventually going to be given to a user using a presigned URL to access.

It would have been simple enough to implement if we could just download all the files into the Lambda function, zip them up, then upload. However, the number of files (and total size of all of it) was unknown and completely arbitrary. Lambda functions only come with 512 MB of ephemeral storage. You can extend this up to 10 GB at a time, but the storage costs can add up. Besides, there's still no guarantee we can keep under 10 GB.

Let's not download the files then

It turns out that we can use a stream to upload content to Amazon S3 when using the AWS SDK. This is true at least with the v2 of the AWS SDK.

This can be very useful. Using something like archiver, we can directly stream files into an archive. If we can then pipe that stream directly into the S3 upload call, then we're technically uploading a ZIP file, without having to compress the contents locally.

First, let's create the archival stream. This is simple enough to do:

import archiver from 'archiver'

// :: ---

const archiveStream = archiver('zip')

archiveStream.on('error', (error) => {
  console.error('Archival encountered an error:', error)
  throw new Error(error)
})

We'll then set up a passthrough stream into an S3 upload. The idea here is that whatever data is put into the stream will be passed through directly to the output --- in this case, an object in an S3 bucket.

Before anything else though, we have to make sure that our S3 client keeps connections open long enough for us to finish uploading. Remember that we're opening up what's essentially a pipe into an S3 object file, and will be dumping data into it continuously until we're ready to finalize / close the stream.

import * as AWS from 'aws-sdk'

// :: ---

const s3 = new AWS.S3({
  httpOptions: {
    timeout: 60 * 10 * 1000, // :: 10 minutes
  },
})

import stream from 'node:stream'

// :: ---

declare const TARGET_BUCKET_NAME: string
declare const ARCHIVE_KEY: string

// :: ---

const passthrough = new stream.PassThrough()

// :: We wrap this in a promise so we have something to await.
const uploadTask = new Promise((resolve) => {
  s3.upload(
    {
      Bucket: TARGET_BUCKET_NAME,
      Key: ARCHIVE_KEY,

      Body: passthrough,
      ContentType: 'application/zip',
    },

    // :: This callback fires when the stream is closed.
    //    As you can see, we're just resolving the promise here,
    //    so whatever awaits this is notified.
    () => {
      console.log('Zip uploaded.')
      resolve([TARGET_BUCKET_NAME, ARCHIVE_KEY])
    }
  )
})

Then finally, we just pipe the archive stream into the upload stream. Anything we put into the archive stream, eventually finds itself passed through the upload stream.

archiveStream.pipe(passthrough)

With that out of the way, all we need to do now is to actually put the files into the stream. Let's talk about that now.

Throw the files in

If we already know which files stored in S3 we want to include in the archive, then we can just use the AWS SDK to grab the file contents, then put them in the stream. Thankfully, the getObject operation using the AWS SDK gives us a Buffer of the object. We can use this Buffer directly with our archive stream.

import path from 'node:path'

// :: ---

declare const SOURCE_BUCKET_NAME: string
declare const OBJECT_KEYS: string[]

// :: ---

for (const key of OBJECT_KEYS) {
  const params = { Bucket: SOURCE_BUCKET_NAME, Key: key }
  const response = await s3.getObject(params).promise()

  // :: `response.Body` is a Buffer
  archiveStream.append(response.Body, path.basename(key))
}

// :: When all the files have been added, then we can
//    finalize the archive stream. This eventually closes
//    the stream, and subsequently closes the passthrough
//    stream we created in the upload task.
archiveStream.finalize()

To finish things off, we can await our uploadTask earlier somewhere in the application, and get the resulting bucket name + object key when it resolves.

// :: Remember that this task resolves only when the
//    passthrough stream closes, and the passthrough
//    stream closes only when the archive stream
//    (that is piping into it) closes too.
//
//    When we finalize the archive stream, pretty much
//    everything else collapses, and this resolves.
const [BUCKET_NAME, OBJECT_KEY] = await uploadTask

// :: Now we can do whatever we want with this.
//    How about we generate a presigned URL?
const params = {
  Bucket: BUCKET_NAME,
  Key: OBJECT_KEY,
  Expires: 60 * 60 * 24, // :: 24 hours
}
const url = await s3.getSignedUrlPromise('getObject', params)
console.log(url)

And there you have it. We just archived files stored in an S3 bucket, and stored that into another S3 bucket without having to save the files locally first.