How to Back Up Your Files Using S3 and Glacier

I don't think I have to convince you that creating regular backups of your valuable files is a good idea. I maintain complete backups of my hard drive both on-site and in the cloud.

Sometimes, though, I'd just like to periodically sync some specific directories to the cloud for easy retrieval. It'd also be nice to be able to archive large amounts of data without worrying about the cost. Both goals are made possible by the convenience of S3 and the inexpensiveness of Glacier.

In this tutorial, I'll teach an easy way of creating backups in S3 and Glacier. We'll use the sync command provided by AWS CLI and versioned buckets to make incremental backups. If you don't have AWS CLI installed, make sure to install it before moving on.

Important Note: If your files are important to you, this shouldn't be your only backup. However, AWS S3 and Glacier make cloud backups very convenient, ridiculously cheap and very easy to recover, so it is a good addition to your backup options.

Syncing Local Files to S3

The first part of this tutorial involves creating an S3 bucket and syncing a local directory to it. Here are the steps:

Create a bucket for your backup files

1
aws s3 mb s3://my.backup.bucket

Enable versioning

This is an important step. We want to enable versioning for our backup bucket. This enables us to never delete any data. We can recover old versions of files and also files that are deleted locally, which is kind of the point of having backups.

1
2
3
aws s3api put-bucket-versioning \
--bucket my.backup.bucket \
--versioning-configuration Status=Enabled

Create a backup

Now that we have a bucket, we can store something in it. The easiest way of doing that is using the sync command in AWS CLI. Note that we're using the --delete flag here. That doesn't mean that the locally deleted files are actually deleted from the bucket as well. Instead, since we're using versioning, those files are merely marked deleted. You can still recover them by downloading older versions of the same files from s3.

The --sse flag tells S3 to encrypt the files. The official documentation has lots of details on the different encryption options.

1
aws s3 sync --sse --delete path/to/my/local/dir s3://my.backup.bucket

Retrieving the latest backup from s3

Backups are worthless unless you can recover them. Using sync against a versioned bucket simply returns the latest version of all files. Excluding the ones that are marked as deleted.

1
aws s3 sync s3://my.backup.bucket path/to/local/dir

Restoring an old backup from s3

S3 doesn't provide an API for point-in-time recovery. In other words, you can't tell S3 to return, say, the versions of objects that were current on May 1st last year.

It is possible, however, to list all versions and find the correct version by comparing the timestamps. s3-pit-recovery, a library I wrote, does exactly that:

  1. Install via npm: npm install -g s3-pit-recovery.

  2. Restore to a point-in-time. After running the command below, pit.my.backup.bucket will look exactly like my.backup.bucket did on May 15th, 2018.

    1
    2
    3
    4
    s3-pit-restore \
    --bucket my.backup.bucket \
    --destinationBucket pit.my.backup.bucket \
    --time '2018-05-15T10:38:00.614Z'

If some of your object versions reside in Glacier, s3-pit-recovery gives you the option to restore them to S3.

Save in Storage Costs by Using Storage Classes

Since nothing is ever deleted from the bucket by using this method, over time, the amount of data can get pretty huge. A great way to save money in storage is to take advantage of the different storage classes in S3. The storage classes we'll use are:

  • STANDARD: This is the default S3 storage class where all objects start their life.

  • STANDARD_IA: IA stands for Infrequent Acess. This storage class offers cheaper storage than the standard class, the trade-off being lower availability and higher cost of retrieval. Hence, as the name suggests, you should use this storage class if you know you're going to access your data infrequently.

  • GLACIER Glacier is a long-term storage service that trades speed of retrieval for low prices. Storage in Glacier is very cheap. At the time of writing, Amazon charges $0.004 per gigabyte per month. The trade-off is that retrieving records from Glacier can take hours. You can't simply download your files from Glacier. You have to first restore them to S3. Simply put, Glacier is an archival service. Don't use it for files you want to be able to restore quickly in an emergency.

Remember that the standard S3 storage class is already very cheap. If you don't have massive amounts of data, storage classes make very little difference. However, if you do have lots of data transferring infrequently, accessed files to STANDARD_IA can cut your S3 expenses to half.

In summary, when you're deciding which storage class you are going to use, the question you should ask yourself is "How often am I going to access this data?". If you have to access the data frequently, you should use STANDARD. If you almost never have to access the data, you should use GLACIER. STANDARD_IA sits between those two.

You can, of course, get scientific and analyze your data to figure out the optimal storage class for each object. AWS even has a tool for this. Here are some quick guidelines, though, that I think are quite sensible:

  • Use STANDARD_IA for files you are storing for emergency recovery. In other words, data you only need in the unlikely event that things go horribly wrong and you have to restore to a point in time when everything was still hunky-dory.

  • Use GLACIER for files you don't expect to ever need, but you still want to archive them for whatever reason.

  • Use STANDARD for everything else.

Since in this tutorial we are dealing with backups, the STANDARD_IA is a pretty obvious choice. Files that are really old can be further transferred to Glacier. What "really old" means exactly I leave up to you to decide. However, in the next section, I give an example setup where the time to wait before transferring objects to Glacier is one year.

To read more about the differences between storage classes click here.

Storage Classes: Implementation

To transition objects to different storage classes, we take advantage of a feature of S3 buckets called lifecycle rules. Lifecycle rules are a way to tell S3 how long it should hold on to data before transferring it to a different storage class, or even deleting it (not recommended). All storage class transfers are thus automated and we don't have to worry about them beyond creating the lifecycle configuration and applying it to our bucket.

First, let's create a new file and name it lifecycle.json. We populate the file with the json below and save.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
{
"Rules": [
{
"ID": "Move all files to STANDARD_IO",
"Status": "Enabled",
"Prefix": "",
"Transition": {
"Days": 30,
"StorageClass": "STANDARD_IA"
},
"NoncurrentVersionTransition": {
"NoncurrentDays": 0,
"StorageClass": "STANDARD_IA"
}
"NoncurrentVersionTransition": {
"NoncurrentDays": 365,
"StorageClass": "GLACIER"
}
}
]
}

The configuration above moves the latest version of all files to STANDARD_IA 30 days after uploading (Transition) or once a new version of the same file is inserted to the bucket. Old versions are kept in S3 for 365 days before transferring to Glacier (NoncurrentVersionTransition).

If you want to get rid of old versions, you can add a NonCurrentVersionExpiration action to your rule. Expired versions will be permanently deleted and cannot be recovered. Please be careful with this rule!

1
2
3
4
5
{
"NonCurrentVersionExpiration": {
"NoncurrentDays": 365
}
}

See AWS documentation to learn more about lifecycle rules.

Finally, let's apply the lifecycle configuration to our S3 bucket.

1
2
3
aws s3api put-bucket-lifecycle \
--bucket my.backup.bucket \
--lifecycle-configuration file://lifecycle.json

Now you have a bucket that's all set up for backups. The only thing left to do is scheduling the backups.

Create a Cron Job for Your Backups

If you're on a Mac or Linux you can use crontab. Open your crontab in an editor.

1
crontab -e

Choose a schedule for your backups. The most important thing here is to choose a time when your computer is likely to be switched on. The following script will run the sync command once every five hours. You can, of course, choose a more frequent backup schedule to play it safe.

1
0 */5 * * * /usr/local/bin/aws s3 sync --sse --delete /path/to/local/dir s3://my.backup.bucket

A nice thing about the sync command is that it doesn't copy any files to S3 unless they have changed. This means you can afford to have it run fairly frequently as most invocations won't result in many file transfers.

Note that cron runs all scripts without the usual environment variables, including PATH, which means that you have to use the full path to your aws command. You can find it by running which aws in your terminal.

Conclusion

Backing files up to S3 is so easy and cheap there's almost no reason not to do it. Please don't rely on one service though. I keep multiple backups of my important files, both locally and in the cloud, and I recommend you do the same.

That concludes the tutorial. Please leave a comment and let me know if you found it useful.