Tuesday, May 24, 2011

Setting up RAID 0 across ephemeral drives on EC2 instances (and surviving reboots!)

I've been experimenting with setting up RAID 0 across ephemeral drives on EC2 instances. The initial setup, be it with mdadm and lvm, or directly with lvm, is not that hard -- what has proven challenging is surviving reboots. Unless you perform certain tricks, your EC2 instance will be blissfully unaware of its new setup after a reboot. What's more, if you try to mount the new striped volume at boot time by adding it to /etc/fstab, chances are you won't even be able to ssh into the instance anymore. It happened to me many times while experimenting, hence this blog post.

Update: I realize I didn't go into details about the use case of this type of setup. This is useful if you don't want to incur EBS performance and reliability penalties, and yet you have a data set that is larger than the 400 GB offered by an individual ephemeral drive. Of course, if your instance dies, so do the ephemeral drives (after all they are named like this for a reason...) -- so make sure you have a good backup/disaster recovery strategy for the data you store there!

In the following, I will assume you want to set up RAID 0 across the four ephemeral drives that come with an EC2 m1.xlarge instance, and which are exposed as devices /dev/sdb through /dev/sde. By default, /dev/sdb is mounted as /mnt, while the other drives aren't mounted. 

I also assume you want to create 1 volume group encompassing the RAID 0 array, and within that volume group you want to create 2 logical volumes with associated XFS file systems, and also 1 logical volume for swap.

Step 1 - unmount /dev/sdb

# umount /dev/sdb

(also comment out the entry corresponding to /dev/sdb in /etc/fstab)

Step 2 - install lvm2 and mdadm

For an unattended install of these packages (slightly complicated by the fact that mdadm also needs postfix), I do:

# DEBIAN_FRONTEND=noninteractive apt-get -y install mdadm lvm2

Step 3 - manually load the dm-mod module

# modprobe dm-mod

(this seems to be a bug in devmapper in Ubuntu)

If  you want to set up RAID 0 via lvm directly, you can skip steps 4 and 5. From what I've read, you get better performance if you do the RAID 0 setup with mdadm. Also, if you need any other RAID level, you need to use mdadm.

Step 4 - configure RAID 0 array via mdadm

# mdadm --create /dev/md0 --level=0 --chunk=256 --raid-devices=4 /dev/sdb /dev/sdc /dev/sdd /dev/sde

Verify:

# mdadm --detail /dev/md0
/dev/md0:
        Version : 00.90
  Creation Time : Mon May 23 22:35:20 2011
     Raid Level : raid0
     Array Size : 1761463296 (1679.86 GiB 1803.74 GB)
   Raid Devices : 4
  Total Devices : 4
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon May 23 22:35:20 2011
          State : clean
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

     Chunk Size : 256K

           UUID : 03f63ee3:607fb777:f9441841:42247c4d (local to host adb08lvm)
         Events : 0.1

    Number   Major   Minor   RaidDevice State
       0       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd
       3       8       64        3      active sync   /dev/sde

Step 5 - increase block size to 64 KB for better performance

# blockdev --setra 65536 /dev/md0

Step 6 - create physical volume from the RAID 0 array

# pvcreate /dev/md0

(if you didn't want to use mdadm, you would call pvcreate against each of the /dev/sdb through /dev/sde devices)

Step 7 - create volume group called vg0 spanning the RAID 0 array

# vgcreate vg0 /dev/md0

(if you didn't want to use mdadm, you would run vgcreate and specify the 4 devices /dev/sdb through /dev/sde)

Verify:

# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "vg0" using metadata type lvm2

# pvscan
  PV /dev/md0   VG vg0   lvm2 [1.64 TiB / 679.86 GiB free]
  Total: 1 [1.64 TiB] / in use: 1 [1.64 TiB] / in no VG: 0 [0   ]

Step 8 - create 3 logical volumes within the vg0 volume group

Each local drive is 400 GB, so the total size for the volume group is 1.6 TB. I'll create 2 logical volumes at 500 GB each, and a 10 GB logical volume for swap.

# lvcreate --name data1 --size 500G vg0
# lvcreate --name data2 --size 500G vg0
# lvcreate --name swap --size 10G vg0

Verify:

# lvscan
  ACTIVE            '/dev/vg0/data1' [500.00 GiB] inherit
  ACTIVE            '/dev/vg0/data2' [500.00 GiB] inherit
  ACTIVE            '/dev/vg0/swap' [10.00 GiB] inherit

Step 9 - create XFS file systems and mount them

We'll create XFS file systems for the data1 and data2 logical volumes. The names of the devices used for mkfs are the ones displayed via the lvscan command above. Then we'll mount the 2 file systems as /data1 and /data2.

# mkfs.xfs /dev/vg0/data1
# mkfs.xfs /dev/vg0/data2
# mkdir /data1
# mkdir /data2
# mount -t xfs -o noatime /dev/vg0/data1 /data1
# mount -t xfs -o noatime /dev/vg0/data2 /data2

Step 10 - create and enable swap partition

# mkswap /dev/vg0/swap
# swapon /dev/vg0/swap

At this point, you should have a fully functional setup. The slight problem is that if you add the newly created file systems to /etc/fstab and reboot, you may not be able to ssh back into your instance -- at least that's what happened to me. I was able to ping the IP of the instance, but ssh would fail.

I finally redid the whole thing on a new instance (I created the RAID 0 directly with lvm, bypassing the mdadm step), but didn't add the file systems to /etc/fstab. After rebooting and running lvscan, I noticed that the logical volumes I had created were all marked as 'inactive':

# lvscan
  inactive            '/dev/vg0/data1' [500.00 GiB] inherit
  inactive            '/dev/vg0/data2' [500.00 GiB] inherit
  inactive            '/dev/vg0/swap' [10.00 GiB] inherit

This was after I ran 'modprobe dm-mod' manually, otherwise the lvscan command would complain:

  /proc/misc: No entry for device-mapper found
  Is device-mapper driver missing from kernel?
  Failure to communicate with kernel device-mapper driver.

A Google search revealed this thread which offered a solution: run 'lvchange -ay' against each logical volume so that the volume becomes active. Only after doing this I was able to see the logical volumes and mount them.

So I added these lines to /etc/rc.local:

/sbin/modprobe dm-mod
/sbin/lvscan
/sbin/lvchange -ay /dev/vg0/data1
/sbin/lvchange -ay /dev/vg0/data2
/sbin/lvchange -ay /dev/vg0/swap
/bin/mount -t xfs -o noatime /dev/vg0/data1  /data1
/bin/mount -t xfs -o noatime /dev/vg0/data2  /data2
/sbin/swapon /dev/vg0/swap

After a reboot, everything was working as expected. Note that I am doing the mounting of the file systems and the enabling of the swap within the rc.local script, and not via /etc/fstab. If you try to do it in fstab, it is too early in the boot sequence, so the logical volumes will be inactive and the mount will fail, with the dire consequence that you won't be able to ssh back into your instance (at least in my case).

This was still not enough when creating the RAID 0 array with mdadm. When I used mdadm, even when adding the lines above to /etc/rc.local, the /dev/md0 device was not there after the reboot, so the mount would still fail. The thread I mentioned above does discuss this case at some point, and I also found a Server Fault thread on this topic. The solution in my case was to modify the mdadm configuration file /etc/mdadm/mdadm.conf and:

a) change the DEVICE variable to point to my 4 devices:

DEVICE /dev/sdb /dev/sdc /dev/sdd /dev/sde

b) add an ARRAY variable containing the UUID of /dev/md0 (which you can get via 'mdadm --detail /dev/md0'):

ARRAY /dev/md0 level=raid0 num-devices=4 UUID=03f63ee3:607fb777:f9441841:42247c4d

This change, together with the custom lines in /etc/rc.local, finally enabled me to have a functional RAID 0 array and functional file systems and swap across the ephemeral drives in my EC2 instance.

I hope this will be useful to somebody out there and will avoid some head-against-the-wall moments that I had to go through....

6 comments:

Don said...

What's the use case for this sort of thing? As I'm sure you know, if your instance becomes permanently unreachable or fails, your data is toast.

Grig Gheorghiu said...

Don -- see my update in the initial post. Thanks for the comment!

Amelia @ IT Management said...

This is informative, thanks.

But one quick question: Why would you setup RAID 0? Isn't EC2 fast enough? How about RAID 5? Is your guide about speed over reliability?

Dan Pasette said...

For Step 5, you say you want to set the block size to 64KB, but the blockdev cmd posted is actually setting the "read ahead" to 65,536 512 byte sectors (32MB), which might be what you wish, but is misleading.

Unknown said...

I adapted these instructions for Ubuntu 12.04. Ephemeral storage disappears if the instance is stopped anyway so I mostly just use this for the "tmpdir" for mySQL. Load on the server went way down, it was getting hammered every time someone did a SORT on a 3+ million row table. https://dl.dropbox.com/u/6943630/raid0_ec2_Ubuntu_1204.txt

André said...

How much sure are you that the ephemeral drives are individual physical drives? The performance will suffer a lot with RAID 0 or higher if they are virtual.