Friday, May 23, 2008

Incremental backups to Amazon S3

Based on this great blog post by Tim McCormack, I managed to write some scripts that back up files to Amazon S3. The files are encrypted with GnuPG and rsync-ed to S3 using a Python-based tool called duplicity.

Here's what I did in order to get all this going on a CentOS 5.1 server running Python 2.5.

1) Signed up for Amazon S3 and got the AWS_ACCESS_KEY_ID and the AWS_SECRET_ACCESS_KEY.

2) Downloaded and installed the following packages: boto, GnuPGInterface, librsync, duplicity. All of them except librsync are Python-based, so they can be installed via 'python setup.py install'. For librsync you need to use './configure; make; make install'.

3) Generated a GPG key pair using "gpg --gen-key". Made a note of the hex fingerprint of the key (you can list the fingerprints of your keys via "gpg --fingerprint").

4) Wrote a simple boto-based Python script to create and list S3 buckets (the equivalent of directories in S3 parlance). Note that boto uses SSL, so your Python installation needs to have SSL enabled.

Here's how the script looks:

#!/usr/bin/env python

ACCESS_KEY_ID = 'theaccesskeyid'
SECRET_ACCESS_KEY = 'thesecretaccesskey'

from boto.s3.connection import S3Connection
conn = S3Connection(ACCESS_KEY_ID, SECRET_ACCESS_KEY)
buckets = [
'mybuckets_myserver_mysqldump',
'mybuckets_myserver_full',
]
for bucket in buckets:
conn.create_bucket(bucket)
rs = conn.get_all_buckets()
print 'Bucket listing:'
for b in rs:
print b.name

5) Wrote a bash script (heavily influenced by Tim McCormack's post) that runs duplicity and backs up the root partition of my Linux server (minus some directories) to S3. The nice thing about duplicity is that it uses rsync, so it only transfers the diffs over the wire. Here's how my script looks like:

export myEncryptionKeyFingerprint=somehexnumber
export mySigningKeyFingerprint=somehexnumber
export AWS_ACCESS_KEY_ID=accesskeyid
export AWS_SECRET_ACCESS_KEY=secretaccesskey
export PASSPHRASE=mypassphrase

/usr/local/bin/duplicity --encrypt-key=$myEncryptionKeyFingerprint
--sign-key=$mySigningKeyFingerprint --exclude=/sys --exclude=/dev
--exclude=/proc --exclude=/tmp --exclude=/mnt --exclude=/media /
s3+http://mybuckets_myserver_full

export AWS_ACCESS_KEY_ID=
export AWS_SECRET_ACCESS_KEY=
export PASSPHRASE=
NOTE: duplicity will interactively prompt you for your GPG key's passphrase, unless you have a variable called PASSPHRASE that contains the passphrase. Since I wanted to run this script as a cron job, I chose the less secure way of specifying the passphrase in clear inside the script. YMMV.

That's about it. Running the script produces an output such as this:

--------------[ Backup Statistics ]--------------
StartTime 1211482825.55 (Thu May 22 12:00:25 2008)
EndTime 1211488426.17 (Thu May 22 13:33:46 2008)
ElapsedTime 5600.62 (1 hour 33 minutes 20.62 seconds)
SourceFiles 174531
SourceFileSize 5080402735 (4.73 GB)
NewFiles 174531
NewFileSize 5080402735 (4.73 GB)
DeletedFiles 0
ChangedFiles 0
ChangedFileSize 0 (0 bytes)
ChangedDeltaSize 0 (0 bytes)
DeltaEntries 174531
RawDeltaSize 1200920038 (1.12 GB)
TotalDestinationSizeChange 2702953170 (2.52 GB)
Errors 0
-------------------------------------------------
The first time you run the script it will take a while, but subsequent runs will only back up the files that were changed since the last run. For example, my second run transferred only 19.3 MB:

--------------[ Backup Statistics ]--------------
StartTime 1211529638.99 (Fri May 23 01:00:38 2008)
EndTime 1211529784.18 (Fri May 23 01:03:04 2008)
ElapsedTime 145.19 (2 minutes 25.19 seconds)
SourceFiles 174522
SourceFileSize 5084478500 (4.74 GB)
NewFiles 64
NewFileSize 2280357 (2.17 MB)
DeletedFiles 28
ChangedFiles 418
ChangedFileSize 217974696 (208 MB)
ChangedDeltaSize 0 (0 bytes)
DeltaEntries 510
RawDeltaSize 2465010 (2.35 MB)
TotalDestinationSizeChange 20211663 (19.3 MB)
Errors 0

ASas
-------------------------------------------------
To restore files from S3, you use duplicity and specify the source as s3+http://mybuckets_myserver_full and the destination as a local directory.

Thanks to Tim McCormack for his detailed blog post, it made things so much easier than digging all this info by Google Fu.

2 comments:

Noah Gift said...

Awesome post Grig. This could be very valuable for a lot of companies, and small businesses.

I was also wondering why everyone wants Python with SSL, and now I know why.

Unknown said...

being a total noob I'm not sure exactly how to use the two files or how to modify them. Any help will be greatly appreciated.

Thank you

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...