Update: I realize I didn't go into details about the use case of this type of setup. This is useful if you don't want to incur EBS performance and reliability penalties, and yet you have a data set that is larger than the 400 GB offered by an individual ephemeral drive. Of course, if your instance dies, so do the ephemeral drives (after all they are named like this for a reason...) -- so make sure you have a good backup/disaster recovery strategy for the data you store there!
In the following, I will assume you want to set up RAID 0 across the four ephemeral drives that come with an EC2 m1.xlarge instance, and which are exposed as devices /dev/sdb through /dev/sde. By default, /dev/sdb is mounted as /mnt, while the other drives aren't mounted.
I also assume you want to create 1 volume group encompassing the RAID 0 array, and within that volume group you want to create 2 logical volumes with associated XFS file systems, and also 1 logical volume for swap.
Step 1 - unmount /dev/sdb
# umount /dev/sdb
(also comment out the entry corresponding to /dev/sdb in /etc/fstab)
Step 2 - install lvm2 and mdadm
For an unattended install of these packages (slightly complicated by the fact that mdadm also needs postfix), I do:
# DEBIAN_FRONTEND=noninteractive apt-get -y install mdadm lvm2
Step 3 - manually load the dm-mod module
# modprobe dm-mod
(this seems to be a bug in devmapper in Ubuntu)
If you want to set up RAID 0 via lvm directly, you can skip steps 4 and 5. From what I've read, you get better performance if you do the RAID 0 setup with mdadm. Also, if you need any other RAID level, you need to use mdadm.
Step 4 - configure RAID 0 array via mdadm
# mdadm --create /dev/md0 --level=0 --chunk=256 --raid-devices=4 /dev/sdb /dev/sdc /dev/sdd /dev/sde
Verify:
# mdadm --detail /dev/md0
/dev/md0:
Version : 00.90
Creation Time : Mon May 23 22:35:20 2011
Raid Level : raid0
Array Size : 1761463296 (1679.86 GiB 1803.74 GB)
Raid Devices : 4
Total Devices : 4
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Mon May 23 22:35:20 2011
State : clean
Active Devices : 4
Working Devices : 4
Failed Devices : 0
Spare Devices : 0
Chunk Size : 256K
UUID : 03f63ee3:607fb777:f9441841:42247c4d (local to host adb08lvm)
Events : 0.1
Number Major Minor RaidDevice State
0 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 8 48 2 active sync /dev/sdd
3 8 64 3 active sync /dev/sde
Step 5 - increase block size to 64 KB for better performance
# blockdev --setra 65536 /dev/md0
Step 6 - create physical volume from the RAID 0 array
# pvcreate /dev/md0
(if you didn't want to use mdadm, you would call pvcreate against each of the /dev/sdb through /dev/sde devices)
Step 7 - create volume group called vg0 spanning the RAID 0 array
# vgcreate vg0 /dev/md0
(if you didn't want to use mdadm, you would run vgcreate and specify the 4 devices /dev/sdb through /dev/sde)
Verify:
# vgscan
Reading all physical volumes. This may take a while...
Found volume group "vg0" using metadata type lvm2
# pvscan
PV /dev/md0 VG vg0 lvm2 [1.64 TiB / 679.86 GiB free]
Total: 1 [1.64 TiB] / in use: 1 [1.64 TiB] / in no VG: 0 [0 ]
Step 8 - create 3 logical volumes within the vg0 volume group
Each local drive is 400 GB, so the total size for the volume group is 1.6 TB. I'll create 2 logical volumes at 500 GB each, and a 10 GB logical volume for swap.
# lvcreate --name data1 --size 500G vg0
# lvcreate --name data2 --size 500G vg0
# lvcreate --name swap --size 10G vg0
Verify:
# lvscan
ACTIVE '/dev/vg0/data1' [500.00 GiB] inherit
ACTIVE '/dev/vg0/data2' [500.00 GiB] inherit
ACTIVE '/dev/vg0/swap' [10.00 GiB] inherit
Step 9 - create XFS file systems and mount them
We'll create XFS file systems for the data1 and data2 logical volumes. The names of the devices used for mkfs are the ones displayed via the lvscan command above. Then we'll mount the 2 file systems as /data1 and /data2.
# mkfs.xfs /dev/vg0/data1
# mkfs.xfs /dev/vg0/data2
# mkdir /data1
# mkdir /data2
# mount -t xfs -o noatime /dev/vg0/data1 /data1
# mount -t xfs -o noatime /dev/vg0/data2 /data2
Step 10 - create and enable swap partition
# mkswap /dev/vg0/swap
# swapon /dev/vg0/swap
At this point, you should have a fully functional setup. The slight problem is that if you add the newly created file systems to /etc/fstab and reboot, you may not be able to ssh back into your instance -- at least that's what happened to me. I was able to ping the IP of the instance, but ssh would fail.
I finally redid the whole thing on a new instance (I created the RAID 0 directly with lvm, bypassing the mdadm step), but didn't add the file systems to /etc/fstab. After rebooting and running lvscan, I noticed that the logical volumes I had created were all marked as 'inactive':
# lvscan
inactive '/dev/vg0/data1' [500.00 GiB] inherit
inactive '/dev/vg0/data2' [500.00 GiB] inherit
inactive '/dev/vg0/swap' [10.00 GiB] inherit
This was after I ran 'modprobe dm-mod' manually, otherwise the lvscan command would complain:
/proc/misc: No entry for device-mapper found
Is device-mapper driver missing from kernel?
Failure to communicate with kernel device-mapper driver.
A Google search revealed this thread which offered a solution: run 'lvchange -ay' against each logical volume so that the volume becomes active. Only after doing this I was able to see the logical volumes and mount them.
So I added these lines to /etc/rc.local:
/sbin/modprobe dm-mod
/sbin/lvscan
/sbin/lvchange -ay /dev/vg0/data1
/sbin/lvchange -ay /dev/vg0/data2
/sbin/lvchange -ay /dev/vg0/swap
/bin/mount -t xfs -o noatime /dev/vg0/data1 /data1
/bin/mount -t xfs -o noatime /dev/vg0/data2 /data2
/sbin/swapon /dev/vg0/swap
After a reboot, everything was working as expected. Note that I am doing the mounting of the file systems and the enabling of the swap within the rc.local script, and not via /etc/fstab. If you try to do it in fstab, it is too early in the boot sequence, so the logical volumes will be inactive and the mount will fail, with the dire consequence that you won't be able to ssh back into your instance (at least in my case).
This was still not enough when creating the RAID 0 array with mdadm. When I used mdadm, even when adding the lines above to /etc/rc.local, the /dev/md0 device was not there after the reboot, so the mount would still fail. The thread I mentioned above does discuss this case at some point, and I also found a Server Fault thread on this topic. The solution in my case was to modify the mdadm configuration file /etc/mdadm/mdadm.conf and:
a) change the DEVICE variable to point to my 4 devices:
DEVICE /dev/sdb /dev/sdc /dev/sdd /dev/sde
b) add an ARRAY variable containing the UUID of /dev/md0 (which you can get via 'mdadm --detail /dev/md0'):
ARRAY /dev/md0 level=raid0 num-devices=4 UUID=03f63ee3:607fb777:f9441841:42247c4d
This change, together with the custom lines in /etc/rc.local, finally enabled me to have a functional RAID 0 array and functional file systems and swap across the ephemeral drives in my EC2 instance.
I hope this will be useful to somebody out there and will avoid some head-against-the-wall moments that I had to go through....
6 comments:
What's the use case for this sort of thing? As I'm sure you know, if your instance becomes permanently unreachable or fails, your data is toast.
Don -- see my update in the initial post. Thanks for the comment!
This is informative, thanks.
But one quick question: Why would you setup RAID 0? Isn't EC2 fast enough? How about RAID 5? Is your guide about speed over reliability?
For Step 5, you say you want to set the block size to 64KB, but the blockdev cmd posted is actually setting the "read ahead" to 65,536 512 byte sectors (32MB), which might be what you wish, but is misleading.
I adapted these instructions for Ubuntu 12.04. Ephemeral storage disappears if the instance is stopped anyway so I mostly just use this for the "tmpdir" for mySQL. Load on the server went way down, it was getting hammered every time someone did a SORT on a 3+ million row table. https://dl.dropbox.com/u/6943630/raid0_ec2_Ubuntu_1204.txt
How much sure are you that the ephemeral drives are individual physical drives? The performance will suffer a lot with RAID 0 or higher if they are virtual.
Post a Comment