Tuesday, December 21, 2010

Using libcloud to manage instances across multiple cloud providers

More and more organizations are moving to ‘the cloud’ these days. In most cases, using ‘the cloud’ means buying compute and storage capacity from a public cloud vendor such as Amazon, Rackspace, GoGrid, Linode, etc. I believe that the next step in cloud usage will be deploying instances across multiple cloud providers, mainly for high availability, but also for performance reasons (for example if a specific provider has a presence in a geographical region closer to your user base).

All cloud vendors offer APIs for accessing their services -- if they don’t, they’re not a genuine cloud vendor in my book at least. The onus is on you as a system administrator to learn how to use these APIs, which can vary wildly from one provider to another. Enter libcloud, a Python-based package that offers a unified interface to various cloud provider APIs. The list of supported vendors is impressive, and more are added all the time. Libcloud was started by Cloudkick but has since migrated to the Apache Foundation as an Incubator Project.

One thing to note is that libcloud goes for breadth at the expense of depth, in that it only supports a subset of the available provider APIs -- things such as creating, rebooting, destroying an instance, and listing all instances. If you need to go in-depth with a given provider’s API, you need to use other libraries that cover all or at least a large portion of the functionality exposed by the API. Examples of such libraries are boto for Amazon EC2 and python-cloudservers for Rackspace.

Introducing libcloud

The current stable version on libcloud is 0.4.0. You can install it from PyPI via

# easy_install apache-libcloud

The main concepts of libcloud are providers, drivers, images, sizes and locations.

A provider is a cloud vendor such as Amazon EC2 and Rackspace. Note that currently each EC2 region (US East, US West, EU West, Asia-Pacific Southeast) is exposed as a different provider, although they may be unified in the future.

The common operations supported by libcloud are exposed for each provider through a driver. If you want to add another provider, you need to create a new driver and implement the interface common to all providers (in the Python code, this is done by subclassing a base NodeDriver class and overriding/adding methods appropriately, according to the specific needs of the provider).

Images are provider-dependent, and generally represent the OS flavors available for deployment for a given provider. In EC2-speak, they are equivalent to an AMI.

Sizes are provider-dependent, and represent the amount of compute, storage and network capacity that a given instance will use when deployed. The more capacity, the more you pay and the happier the provider.

Locations correspond to geographical data center locations available for a given provider; however, they are not very well represented in libcloud. For example, in the case of Amazon EC2, they currently map to EC2 regions rather than EC2 availability zones. However, this will change in the near future (as I will describe below, proper EC2 availability zone management is being implemented). As another example, Rackspace is represented in libcloud as a single location, listed currently as DFW1; however, your instances will get deployed at a data center determined at your Rackspace account creation time (thanks to Paul Querna for clarifying this aspect).

Managing instances with libcloud

Getting a connection to a provider via a driver

All the interactions with a given cloud provider happen in libcloud across a connection obtained via the driver for that provider. Here is the canonical code snippet for that, taking EC2 as an example:


from libcloud.types import Provider
from libcloud.providers import get_driver
EC2_ACCESS_ID = 'MY ACCESS ID'
EC2_SECRET_KEY = 'MY SECRET KEY'
EC2Driver = get_driver(Provider.EC2)
conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)


For Rackspace, the code looks like this:


USER = 'MyUser'
API_KEY = 'MyApiKey'
Driver = get_driver(Provider.RACKSPACE)
conn = Driver(USER, API_KEY)


Getting a list of images available for a provider

Once you get a connection, you can call a variety of informational methods on that connection, for example list_images, which returns a list of NodeImage objects. Be prepared for this call to take a while, especially in Amazon EC2, which in the US East region returns no less than 6,932 images currently. Here is a code snippet that prints the number of available images, and the first 5 images returned in the list:


EC2Driver = get_driver(Provider.EC2)
conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)
images = conn.list_images()
print len(images)
print images[:5]
6982
[<NodeImage: id=aki-00806369, name=karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091001-test-04.manifest.xml, driver=Amazon EC2 (us-east-1) ...>, <NodeImage: id=aki-00896a69, name=karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091002-test-04.manifest.xml, driver=Amazon EC2 (us-east-1) ...>, <NodeImage: id=aki-008b6869, name=redhat-cloud/RHEL-5-Server/5.4/x86_64/kernels/kernel-2.6.18-164.x86_64.manifest.xml, driver=Amazon EC2 (us-east-1) ...>, <NodeImage: id=aki-00f41769, name=karmic-kernel-zul/ubuntu-kernel-2.6.31-301-ec2-i386-20091012-test-06.manifest.xml, driver=Amazon EC2 (us-east-1) ...>, <NodeImage: id=aki-010be668, name=ubuntu-kernels-milestone-us/ubuntu-lucid-i386-linux-image-2.6.32-301-ec2-v-2.6.32-301.4-kernel.img.manifest.xml, driver=Amazon EC2 (us-east-1) ...>]


Here is the output of same code running against the Rackspace driver:


23
[<NodeImage: id=58, name=Windows Server 2008 R2 x64 - MSSQL2K8R2, driver=Rackspace ...>, <NodeImage: id=71, name=Fedora 14, driver=Rackspace ...>, <NodeImage: id=29, name=Windows Server 2003 R2 SP2 x86, driver=Rackspace ...>, <NodeImage: id=40, name=Oracle EL Server Release 5 Update 4, driver=Rackspace ...>, <NodeImage: id=23, name=Windows Server 2003 R2 SP2 x64, driver=Rackspace ...>]


Note that a NodeImage object for a given provider may have provider-specific information stored in most cases in a variable called ‘extra’. It pays to inspect the NodeImage objects by printing their __dict__ member variable. Here is an example for EC2:


print images[0].__dict__
{'extra': {}, 'driver': <libcloud.drivers.ec2.ec2nodedriver 0xb7eebfec="" at="" object="">, 'id': 'aki-00806369', 'name': 'karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091001-test-04.manifest.xml'}


In this case, the NodeImage object has an id, a name and a driver, with no ‘extra’ information.

Same code running against Rackspace, with similar information being returned:


print images[0].__dict__
{'extra': {'serverId': None}, 'driver': <libcloud.drivers.rackspace.rackspacenodedriver 0x88b506c="" at="" object="">, 'id': '4', 'name': 'Debian 5.0 (lenny)'}


Getting a list of sizes available for a provider

When you call list_sizes on a connection to a provider, you retrieve a list of NodeSize objects representing the available sizes for that provider.

Amazon EC2 example:


EC2Driver = get_driver(Provider.EC2)
conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)
sizes = conn.list_sizes()
print len(sizes)
print sizes[:5]
print sizes[0].__dict__
9
[<NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="850" driver="Amazon" ec2="" id="m1.large," instance,="" name="Large" price=".38" ram="7680">, <NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="1690" driver="Amazon" ec2="" extra="" id="c1.xlarge," instance,="" large="" name="High-CPU" price=".76" ram="7680">, <NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="160" driver="Amazon" ec2="" id="m1.small," instance,="" name="Small" price=".095" ram="1740">, <NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="350" driver="Amazon" ec2="" id="c1.medium," instance,="" medium="" name="High-CPU" price=".19" ram="1740">, <NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="1690" driver="Amazon" ec2="" id="m1.xlarge," instance,="" large="" name="Extra" price=".76" ram="15360">]
{'name': 'Large Instance', 'price': '.38', 'ram': 7680, 'driver': <libcloud.drivers.ec2.ec2nodedriver 0xb7f49fec="" at="" object="">, 'bandwidth': None, 'disk': 850, 'id': 'm1.large'}


Same code running against Rackspace:


7
[<NodeSize: ...="" bandwidth="None" disk="10" driver="Rackspace" id="1," name="256" price=".015" ram="256" server,="">, <NodeSize: ...="" bandwidth="None" disk="20" driver="Rackspace" id="2," name="512" price=".030" ram="512" server,="">, <NodeSize: ...="" bandwidth="None" disk="40" driver="Rackspace" id="3," name="1GB" price=".060" ram="1024" server,="">, <NodeSize: ...="" bandwidth="None" disk="80" driver="Rackspace" id="4," name="2GB" price=".120" ram="2048" server,="">, <NodeSize: ...="" bandwidth="None" disk="160" driver="Rackspace" id="5," name="4GB" price=".240" ram="4096" server,="">]
{'name': '256 server', 'price': '.015', 'ram': 256, 'driver': <libcloud.drivers.rackspace.rackspacenodedriver 0x841506c="" at="" object="">, 'bandwidth': None, 'disk': 10, 'id': '1'}


Getting a list of locations available for a provider

As I mentioned before, locations are somewhat ambiguous currently in libcloud.

For example, when you call list_locations on a connection to the EC2 provider (which represents the EC2 US East region), you get information about the region and not about the availability zones (AZs) included in that region:


EC2Driver = get_driver(Provider.EC2)
conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)
print conn.list_locations()
[<NodeLocation: id=0, name=Amazon US N. Virginia, country=US, driver=Amazon EC2 (us-east-1)>]


However, there is a patch sent by Tomaž Muraus to the libcloud mailing list which adds support for EC2 availability zones. For example, the US East region has 4 AZs: us-east-1a, us-east-1b, us-east-1c, us-east-1d. These AZs should be represented by libcloud locations, and indeed the code with the patch applied shows just that:


print conn.list_locations()
[<EC2NodeLocation: id=0, name=Amazon US N. Virginia, country=US, availability_zone=us-east-1a driver=Amazon EC2 (us-east-1)>, <EC2NodeLocation: id=1, name=Amazon US N. Virginia, country=US, availability_zone=us-east-1b driver=Amazon EC2 (us-east-1)>, <EC2NodeLocation: id=2, name=Amazon US N. Virginia, country=US, availability_zone=us-east-1c driver=Amazon EC2 (us-east-1)>, <EC2NodeLocation: id=3, name=Amazon US N. Virginia, country=US, availability_zone=us-east-1d driver=Amazon EC2 (us-east-1)>]


Hopefully the patch will make it soon into the libcloud github repository, and then into the next libcloud release.

(Update 02/24/11The patch did make it in the latest libcloud release which is 0.4.2 at this time)

If you run list_locations on a Rackspace connection, you get back DFW1, even though your instances may actually get deployed at a different data center. Hopefully this too will be fixed soon in libcloud:


Driver = get_driver(Provider.RACKSPACE)
conn = Driver(USER, API_KEY)
print conn.list_locations()
[<NodeLocation: id=0, name=Rackspace DFW1, country=US, driver=Rackspace>]


Launching an instance

The API call for launching an instance with libcloud is create_node. It has 3 required parameters: a name for your new instance, a NodeImage and a NodeSize. You can also specify a NodeLocation (if you don’t, the default location for that provider will be used).

EC2 node creation example

A given provider driver may accept other parameters to the create_node call. For example, EC2 accepts an ex_keyname argument for specifying the EC2 key you want to use when creating the instance.

Note that to create a node, you have to know what image and what size you want to use for that node. Here can come in handy the code snippets I showed above for retrieving images and sizes available for a given provider. You can either retrieve the full list and iterate through the list until you find your desired image and size (either by name or by id), or you can construct NodeImage and NodeSize objects from scratch, based on the desired id.

Example of a NodeImage object for EC2 corresponding to a specific AMI:


image = NodeImage(id="ami-014da868", name="", driver="")


Example of a NodeSize object for EC2 corresponding to an m1.small instance size:


size = NodeSize(id="m1.small", name="", ram=None, disk=None, bandwidth=None, price=None, driver="")


Note that in both examples the only parameter that need to be set is the id, but all the other parameters need to be present in the call, even if they are set to None or the empty string.

In the case of EC2, for the instance to be actually usable via ssh, you also need to pass the ex_keyname parameter and set it to a keypair name that exists in your EC2 account for that region. Libcloud provides a way to create or import a keypair programmatically. Here is a code snippet that creates a keypair via the ex_create_keypair call (specific to the libcloud EC2 driver), then saves the private key in a file in /root/.ssh on the machine running the code:


keyname = sys.argv[1]
resp = conn.ex_create_keypair(name=keyname)
key_material = resp.get('keyMaterial')
if not key_material:
sys.exit(1)
private_key = '/root/.ssh/%s.pem' % keyname
f = open(private_key, 'w')
f.write(key_material + '\n')
f.close()
os.chmod(private_key, 0600)


You can also pass the name of an EC2 security group to create_node via the ex_securitygroup parameter. Libcloud also allows you to create security groups programmatically by means of the ex_create_security_group method specific to the libcloud EC2 driver.

Now, armed with the NodeImage and NodeSize objects constructed above, as well as the keypair name, we can launch an instance in EC2:


node = conn.create_node(name='test1', image=image, size=size, ex_keyname=keyname)


Note that we didn’t specify any location, so we have no control over the availability zone where the instance will be created. With Tomaž’s patch we can actually get a location corresponding to our desired availability zone, then launch the instance in that zone. Here is an example for us-east-1b:


locations = conn.list_locations()
for location in locations:
if location.availability_zone.name == 'us-east-1b':
break
node = conn.create_node(name='tst', image=image, size=size, location=location, ex_keyname=keyname)


Once the node is created, you can call the list_nodes method on the connection object and inspect the current status of the node, along with other information about that node. In EC2, a new instance is initially shown with a status of ‘pending’. Once the status changes to ‘running’, you can ssh into that instance using the private key created above.

Printing node.__dict__ for a newly created instance shows it with ‘pending’ status:


{'name': 'i-f692ae9b', 'extra': {'status': 'pending', 'productcode': [], 'groups': None, 'instanceId': 'i-f692ae9b', 'dns_name': '', 'launchdatetime': '2010-12-14T20:25:22.000Z', 'imageId': 'ami-014da868', 'kernelid': None, 'keyname': 'k1', 'availability': 'us-east-1d', 'launchindex': '0', 'ramdiskid': None, 'private_dns': '', 'instancetype': 'm1.small'}, 'driver': <libcloud.drivers.ec2.ec2nodedriver 0x9e088ec="" at="" object="">, 'public_ip': [''], 'state': 3, 'private_ip': [''], 'id': 'i-f692ae9b', 'uuid': '76fcd974aab6f50092e5a637d6edbac140d7542c'}


Printing node.__dict__ a few minutes after the instance was launched shows the instance with ‘running’ status:


{'name': 'i-f692ae9b', 'extra': {'status': 'running', 'productcode': [], 'groups': ['default'], 'instanceId': 'i-f692ae9b', 'dns_name': 'ec2-184-72-92-114.compute-1.amazonaws.com', 'launchdatetime': '2010-12-14T20:25:22.000Z', 'imageId': 'ami-014da868', 'kernelid': None, 'keyname': 'k1', 'availability': 'us-east-1d', 'launchindex': '0', 'ramdiskid': None, 'private_dns': 'domU-12-31-39-04-65-11.compute-1.internal', 'instancetype': 'm1.small'}, 'driver': <libcloud.drivers.ec2.ec2nodedriver 0x93f42cc="" at="" object="">, 'public_ip': ['ec2-184-72-92-114.compute-1.amazonaws.com'], 'state': 0, 'private_ip': ['domU-12-31-39-04-65-11.compute-1.internal'], 'id': 'i-f692ae9b', 'uuid': '76fcd974aab6f50092e5a637d6edbac140d7542c'}


Note also that the ‘extra’ member variable of the node object shows a wealth of information specific to EC2 -- things such as security group, AMI id, kernel id, availability zone, private and public DNS names, etc. Another interesting thing to note is that the name member variable of the node object is now set to the EC2 instance id, thus guaranteeing uniqueness of names across EC2 node objects.

At this point (assuming the machine where you run the libcloud code is allowed ssh access into the default EC2 security group) you should be able to ssh into the newly created instance using the private key corresponding to the keypair you used to create the instance. In my case, I used the k1.pem private file created via ex_create_keypair and I ssh-ed into the private IP address of the new instance, because I was already on an EC2 instance in the same availability zone:

# ssh -i ~/.ssh/k1.pem domU-12-31-39-04-65-11.compute-1.internal


Rackspace node creation example

Here is another example of calling node_create, this time using Rackspace as the provider. Before I ran this code, I already called list_images and list_sizes on the Rackspace connection object, so I know that I want the NodeImage with id 71 (which happens to be Fedora 14) and the NodeSize with id 1 (the smallest one). The code snippet below will create the node using the image and the size I specify, with a name that I also specify (this name needs to be different for each call of create_node):


Driver = get_driver(Provider.RACKSPACE)
conn = Driver(USER, API_KEY)
images = conn.list_images()
for image in images:
if image.id == '71':
break
sizes = conn.list_sizes()
for size in sizes:
if size.id == '1':
break
node = conn.create_node(name='testrackspace', image=image, size=size)
print node.__dict__


The code prints out:


{'name': 'testrackspace', 'extra': {'metadata': {}, 'password': 'testrackspaceO1jk6O5jV', 'flavorId': '1', 'hostId': '9bff080afbd3bec3ca140048311049f9', 'imageId': '71'}, 'driver': <libcloud.drivers.rackspace.rackspacenodedriver 0x877c3ec="" at="" object="">, 'public_ip': ['184.106.187.226'], 'state': 3, 'private_ip': ['10.180.67.242'], 'id': '497741', 'uuid': '1fbf7c3fde339af9fa901af6bf0b73d4d10472bb'}


Note that the name variable of the node object was set to the name we specified in the create_node call. You don’t log in with a key (at least initially) to a Rackspace node, but instead you’re given a password you can use to log in as root to the public IP that is also returned in the node information:

ssh root@184.106.187.226root@184.106.187.226's password:[root@testrackspace ~]#


Rebooting and destroying instances

Once you have a list of nodes in a given provider, it’s easy to iterate through the list and choose a given node based on its unique name -- which as we’ve seen is the instance id for EC2 and the hostname for Rackspace. Once you identify a node, you can call destroy_node or reboot_node on the connection object to terminate or reboot that node.

Here is a code snippet that performs a destroy_node operation for an EC2 instance with a specific instance id:


EC2Driver = get_driver(Provider.EC2)
conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)
nodes = conn.list_nodes()
for node in nodes:
if node.name == 'i-66724d0b':
conn.destroy_node(node)


Here is another code snippet that performs a reboot_node operation for a Rackspace node with a specific hostname:


Driver = get_driver(Provider.RACKSPACE)
conn = Driver(USER, API_KEY)
nodes = conn.list_nodes()
for node in nodes:
if node.name == 'testrackspace':
conn.reboot_node(node)


The Overmind project

I would be remiss if I didn’t mention a new but very promising project started by Miquel Torres: Overmind. The goal of Overmind is to be a complete server provisioning and configuration management system. For the server provisioning portion, Overmind uses libcloud, while also offering a Django-based Web interface for managing providers and nodes. EC2 and Rackspace are supported currently, but it should be easy to add new providers. If you are interested in trying out Overmind and contributing code or tests, please send a message to the overmind-dev mailing list. Next versions of Overmind aim to add configuration management capabilities using Opscode Chef.

Further reading

Friday, December 10, 2010

A Fabric script for striping EBS volumes

Here's a short Fabric script which might be useful to people who need to stripe EBS volumes in Amazon EC2. Striping is recommended if you want to improve the I/O of your EBS-based volumes. However, striping won't help if one of the member EBS volumes goes AWOL or suffers performance issues. In any case, here's the Fabric script:

import commands
from fabric.api import *

# Globals

env.project='EBSSTRIPING'
env.user = 'myuser'

DEVICES = [
    "/dev/sdd",
    "/dev/sde",
    "/dev/sdf",
    "/dev/sdg",
]

VOL_SIZE = 400 # GB

# Tasks

def install():
    install_packages()
    create_raid0()
    create_lvm()
    mkfs_mount_lvm()

def install_packages():
    run('DEBIAN_FRONTEND=noninteractive apt-get -y install mdadm')
    run('apt-get -y install lvm2')
    run('modprobe dm-mod')
    
def create_raid0():
    cmd = 'mdadm --create /dev/md0 --level=0 --chunk=256 --raid-devices=4 '
    for device in DEVICES:
        cmd += '%s ' % device
    run(cmd)
    run('blockdev --setra 65536 /dev/md0')

def create_lvm():
    run('pvcreate /dev/md0')
    run('vgcreate vgm0 /dev/md0')
    run('lvcreate --name lvm0 --size %dG vgm0' % VOL_SIZE)

def mkfs_mount_lvm():
    run('mkfs.xfs /dev/vgm0/lvm0')
    run('mkdir -p /mnt/lvm0')
    run('echo "/dev/vgm0/lvm0 /mnt/lvm0 xfs defaults 0 0" >> /etc/fstab')
    run('mount /mnt/lvm0')

A few things to note:

  • I assume that you already created and attached 4 EBS volumes to your instance with device names /dev/sdd through /dev/sdg; if your device names or volume count are different, modify the DEVICES list appropriately
  • The size of your target RAID0 volume is set in the VOL_SIZE variable
  • the helper functions are pretty self-explanatory: 
    1. we use mdadm to create a RAID0 device called /dev/md0; we also set the block size to 64 KB via the blockdev call
    2. we create a physical LVM volume on /dev/md0
    3. we create a volume group called vgm0 on /dev/md0
    4. we create a logical LVM volume called lvm0 of size VOL_SIZE, inside the vgm0 group
    5. we format the logical volume as XFS, then we mount it and also modify /etc/fstab
That's it. Hopefully it will be useful to somebody out there.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...