<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-9238405</id><updated>2012-02-03T09:30:47.486-08:00</updated><category term='openvpn'/><category term='buildbot trac'/><category term='monitoring'/><category term='ubuntu'/><category term='cloud'/><category term='agile'/><category term='performance load httperf openwebload'/><category term='humor'/><title type='text'>Agile Testing</title><subtitle type='html'>Thoughts on testing and systems infrastructure with an agile, mostly Pythonic, twist.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><link rel='next' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default?start-index=101&amp;max-results=100'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>464</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-9238405.post-221047676093181510</id><published>2012-01-24T09:03:00.000-08:00</published><updated>2012-01-24T09:03:01.484-08:00</updated><title type='text'>An ode to running a database on bare metal</title><content type='html'>&lt;br /&gt;No, my muse is not quite as strong as to inspire me to write an ode, but I still want to emphasize a few points about the goodness of running a database on bare metal.&lt;br /&gt;&lt;br /&gt;At Evite, we use sharded MySQL for our production database. We designed the current architecture in 2009, when NoSQL was still very much in its infancy, so MySQL seemed a solid choice, a technology that we could at least understand. As I explained&amp;nbsp;&lt;a href="http://agiletesting.blogspot.com/2011/04/lessons-learned-from-deploying.html"&gt;elsewhere&lt;/a&gt;, we do use MySQL in an almost non-relational way, and we sharded from the get-go, with the idea that it's better to scale horizontally than vertically.&lt;br /&gt;&lt;br /&gt;We initially launched with the database hosted at a data center on a few Dell PE2970 servers, each with 16 GB of RAM and 2 quad-core CPUs. Each server was running 2 MySQL instances. We didn't get a chance to dark launch, but the initial load testing we did showed that we should be OK. However, there is nothing like production traffic to really stress test your infrastructure, and we soon realized that we have an insufficient number of servers for the peak traffic we were expecting towards the end of the year.&lt;br /&gt;&lt;br /&gt;We decided to scale horizontally in EC2, with one MySQL instance per m1.xlarge EC2 instance. At the time we also engaged Percona and they helped us fine-tune our Percona XtraDB MySQL configuration so we could get the most out of the m1.xlarge horsepower. We managed to scale sufficiently enough for our high season in 2010, although we had plenty of pain points. We chose to use EBS volumes for our database files, because at the time EBS still gave people the illusion of stability and durability. We were very soon confronted with severe performance issues, manifested as very high CPU I/O wait times, which were sometimes so high as to make the instance useless.&lt;br /&gt;&lt;br /&gt;I described in a&amp;nbsp;&lt;a href="http://agiletesting.blogspot.com/2011/04/lessons-learned-from-deploying.html"&gt;previous post&lt;/a&gt;&amp;nbsp;how proficient we became at failing over from a master that went AWOL to a slave. Our issues with EBS volumes were compounded by the fact that our database access pattern is very write-intensive, and a shared medium such as EBS was far from ideal. Our devops team was constantly on the alert, and it seemed like we were always rebuilding instances and recovering from EC2 instance failures, although the end-user experience was not affected.&lt;br /&gt;&lt;br /&gt;Long story short, we decided to bring the database back in-house, at the data center, on 'real' bare-metal servers. No virtualization, thanks. The whole process went relatively smoothly. One important point I want to make here is that we already had a year's worth of hard numbers at that point regarding the access patterns to our database, iops/sec, MySQL query types, etc, etc. So it made it easy to do proper capacity planning this time, in the presence of production traffic.&lt;br /&gt;&lt;br /&gt;We started by buying 2 Dell&amp;nbsp;&lt;a href="http://www.dell.com/us/business/p/poweredge-c2100/pd"&gt;C2100&lt;/a&gt;&amp;nbsp;servers, monster machines, with dual Intel Xeon X5650 processors (for a total of 24 cores), 144 GB RAM, and 12 x 1 TB hard disks out of which we prepared a 6 TB RAID 10 volume which we further divided in LVM logical volumes for specific types of MySQL files.&lt;br /&gt;&lt;br /&gt;We put 2 MySQL instances on each server, and we engaged&amp;nbsp;&lt;a href="http://www.percona.com/"&gt;Percona&lt;/a&gt;&amp;nbsp;again to help us fine-tune the configuration, this time including not only MySQL, but also the hardware and the OS. They were super helpful to us, as usual. Here are only some of the things they recommended, which we implemented:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;set vm.swappiness kernel setting to 0 in /etc/sysctl.conf&lt;/li&gt;&lt;li&gt;set InnoDB flush method to O_DIRECT because we can rely on the RAID controller to do the caching (we also mounted XFS with the nobarrier option in conjunction with this change)&lt;/li&gt;&lt;li&gt;disable MySQL query cache, which uses a global mutex that can cause performance issues when used on a multi-core server&lt;/li&gt;&lt;li&gt;various other optimizations which were dependent on our setup, things like tweaking MySQL configuration options such as key_buffer_size and innodb_io_capacity&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;One important MySQL configuration option that we had to tweak was&amp;nbsp;innodb_buffer_pool_size. If we set it too high, the server could start swapping. If we set it too low, the disk I/O on the server could become too problematic. Since we had 144 GB of RAM and we were running 2 MySQL instances per server, we decided to give each instance 60 GB of RAM. This proved to strike a good balance.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Once the fine-tuning was done, we directed production traffic away from 4 EC2 m1.xlarge instances to 2 x 2 MySQL instances, with each pair running on a C2100. We then sat back and wallowed for a while in the goodness of the I/O numbers we were observing. Basically, the servers were barely working. This is how life should be.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We soon migrated all of our MySQL masters back into the data center. We left the slaves running in EC2 (still one m1.xlarge slave per MySQL master instance), but we changed them from being EBS-backed to using the&amp;nbsp;&lt;a href="http://agiletesting.blogspot.com/2011/05/setting-up-raid-0-across-ephemeral.html"&gt;local ephemeral disk in RAID 0 with LVM&lt;/a&gt;. We look at EC2 in this case as a secondary data center, used only in emergency situations.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One thing that bit us in our bare-metal setup was....a bare-metal issue around the LSI MegaRAID controllers. I already blogged about the problems we had with the&amp;nbsp;&lt;a href="http://agiletesting.blogspot.com/2011/09/slow-database-check-raid-battery.html"&gt;battery relearning cycle&lt;/a&gt;, and with&amp;nbsp;&lt;a href="http://agiletesting.blogspot.com/2011/10/more-fun-with-lsi-megaraid-controllers.html"&gt;decreased performance in the presence of bad drives&lt;/a&gt;. But these things were easy to fix (again thanks to our friends at Percona for diagnosing these issues correctly in the first place...)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I am happy to report that we went through our high season for 2011 without a glitch in this setup. Our devops team slept much better at night too! One nice thing about having EC2 as a 'secondary data center' is that if need be, we can scale out horizontally &amp;nbsp; by launching more EC2 instances. In fact, we doubled the number of MySQL slave instances for the duration of our high season, with the thought that if we need to, we can double the number of shards at the application layer, and thus scale horizontally that way. We didn't have to do any tweaking fortunately, but we were able to -- a strategy which would otherwise be hard to pull off if we didn't have any cloud presence, unless we bought a lot of extra capacity at the data center.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This brings me to one of the points I want to make in this post:&amp;nbsp;&lt;b&gt;it is a very valuable strategy to be able to use the cloud to roll out a new architecture (which you designed from the get-go however to be horizontally scalable) and to gauge its performance in the presence of real production traffic&lt;/b&gt;. You will get less than optimal performance per instance (because of virtualization vs. real hardware) , but since you can scale horizontally, you should be able to sustain the desired level of traffic for your application. You will get hard numbers that will help you do capacity planning and you will be able to bring the database infrastructure back to real hardware if you so wish, like we did. Note that&amp;nbsp;&lt;a href="http://gigaom.com/2010/06/08/how-zynga-survived-farmville/"&gt;Zynga has a similar strategy&lt;/a&gt;&amp;nbsp;-- they roll out new games in EC2 and once they get a handle on how much traffic a game has, they bring it back into the data center (although it looks like they still use a private cloud and not bare metal).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Another point I want to make is that&amp;nbsp;&lt;b&gt;the cloud is not ready yet for write-intensive transactional databases&lt;/b&gt;, mainly because of the very poor I/O performance that you get on virtual instances in the cloud (compounded by shared network storage such as EBS).&amp;nbsp;&lt;a href="http://perfcap.blogspot.com/"&gt;Adrian Cockcroft&lt;/a&gt;&amp;nbsp;will reply that Netflix is doing just fine and they're exclusively in EC2. I hope they are doing just fine, and I hope his devops team is getting some good sleep at night, but I'm not sure. I need to perhaps qualify my point and say that the cloud is not ready for traditional transactional databases such as MySQL and PostgreSQL, which require manual sharding to be horizontally scalable. If I had to look at redesigning our database architecture today, I'd definitely try out HBase, Riak and maybe Cassandra. The promise there at least is that adding a new node to the cluster in these technologies is much less painful than in the manual sharding and scaling scenario. This still doesn't guarantee that you won't end up paying for a lot of instances to compensate for poor individual I/O per instance. Maybe a cloud vendor like Joyent with their&amp;nbsp;&lt;a href="http://www.joyent.com/products/smartmachines/"&gt;SmartMachines&lt;/a&gt;&amp;nbsp;will make a difference in this area (in fact, it is on our TODO list to test out their &lt;a href="http://wiki.joyent.com/display/smart/Joyent+Percona+SmartMachine"&gt;Percona SmartMachine&lt;/a&gt;).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Note however that there's something to be said about using good ol' RDBMS technologies. Ryan Mack says this in a&amp;nbsp;&lt;a href="http://www.facebook.com/note.php?note_id=10150468255628920"&gt;Facebook Engineering post&lt;/a&gt;:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;i&gt;"After a few discussions we decided to build on four of our core technologies:&amp;nbsp;&lt;a href="https://www.facebook.com/MySQLatFacebook"&gt;MySQL/InnoDB&lt;/a&gt;&amp;nbsp;for storage and replication, Multifeed (the technology that powers News Feed) for ranking, Thrift for communications, and memcached for caching.&amp;nbsp;&lt;b&gt;We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits&lt;/b&gt;."&lt;/i&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The emphasis on the last sentence is mine. It's the operational aspect of a new architecture that will kill you first. With a well understood architecture, at least you have a chance to tame it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Yet another point I'd like to make is:&amp;nbsp;&lt;b&gt;do not base your disaster recovery strategy in EC2 around EBS volumes&lt;/b&gt;, especially if you have a write-intensive database. It's not worth the performance loss, and most of all it's not worth the severe and unpredictable fluctuation in performance. It works much better in our experience to turn the ephemeral disks of an m1.xlarge EC2 instance into a RAID 0 array and put LVM on top of that, and use it for storing the various MySQL file types. We are then able to do LVM snapshots of that volume, and upload the snapshots to S3. To build a new slave, we can restore the snapshot from S3, then catch up the replication with the master. Works fine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;There you have it. An ode in prose to running your database on bare metal. Try it, you may sleep better at night!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-221047676093181510?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/221047676093181510/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=221047676093181510' title='10 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/221047676093181510'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/221047676093181510'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2012/01/ode-to-running-database-on-bare-metal.html' title='An ode to running a database on bare metal'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>10</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1670023335589569145</id><published>2012-01-05T11:21:00.000-08:00</published><updated>2012-01-05T11:21:06.854-08:00</updated><title type='text'>Graphing, alerting and mission control with Graphite and Nagios</title><content type='html'>&lt;br /&gt;&lt;div style="background-color: transparent;"&gt;&lt;span id="internal-source-marker_0.4769676486030221"&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;We’ve been using &lt;/span&gt;&lt;a href="http://graphite.wikidot.com/documentation" style="font-weight: bold;"&gt;&lt;span style="background-color: transparent; color: #000099; font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;Graphite&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt; more and more for graphing of OS- and application-related metrics (here are some old-ish notes of mine on &lt;/span&gt;&lt;a href="http://agiletesting.blogspot.com/2011/04/installing-and-configuring-graphite.html" style="font-weight: bold;"&gt;&lt;span style="background-color: transparent; color: #000099; font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;installing and configuring Graphite&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;.) We measure and graph variables as diverse as:&lt;/span&gt;&lt;ul style="font-weight: bold;"&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;relative and absolute state of charge of the LSI MegaRAID controller battery (why? because we’ve been burned by &lt;/span&gt;&lt;a href="http://agiletesting.blogspot.com/2011/09/slow-database-check-raid-battery.html"&gt;&lt;span style="background-color: transparent; color: #000099; vertical-align: baseline; white-space: pre-wrap;"&gt;battery issues&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt; &lt;/span&gt;&lt;a href="http://agiletesting.blogspot.com/2011/09/slow-database-check-raid-battery.html"&gt;&lt;span style="background-color: transparent; color: #000099; vertical-align: baseline; white-space: pre-wrap;"&gt;before&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;)&lt;/span&gt;&lt;/li&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;database server I/O wait time (critical for EC2 instances which are notorious for their poor I/O performance; at this point we only run MySQL slaves in EC2, and we do not &lt;/span&gt;&lt;span style="background-color: transparent; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;repeat DO NOT use EBS volumes&lt;/span&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt; for DB servers, instead we &lt;/span&gt;&lt;a href="http://agiletesting.blogspot.com/2011/05/setting-up-raid-0-across-ephemeral.html"&gt;&lt;span style="background-color: transparent; color: #000099; vertical-align: baseline; white-space: pre-wrap;"&gt;stripe the local disks into a RAID 0 array with LVM&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;)&lt;/span&gt;&lt;/li&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;memcached stats such as current connections, get hits and misses, delete hits and misses, evictions&lt;/span&gt;&lt;/li&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;percentage and absolute number of HTTP return codes as served by nginx and haproxy&lt;/span&gt;&lt;/li&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;count of messages in various queues (our own and Amazon SQS)&lt;/span&gt;&lt;/li&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;count of outgoing mail messages&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;We do have a large Munin install base, so we found &lt;/span&gt;&lt;a href="http://twitter.com/#!/adamhjk" style="font-weight: bold;"&gt;&lt;span style="background-color: transparent; color: #000099; font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;Adam Jacob&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;’s &lt;/span&gt;&lt;a href="https://github.com/adamhjk/munin-graphite/blob/master/munin-graphite.rb" style="font-weight: bold;"&gt;&lt;span style="background-color: transparent; color: #000099; font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;munin-graphite.rb&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt; script very useful in sending all data captured by Munin to Graphite.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Why Graphite and not Munin or Ganglia? Mainly because it’s so easy to send arbitrarily named metrics to Graphite, but also because we can capture measurements at 1 second granularity (although this is possible with some tweaking with RRD-based tools as well).&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;On the Graphite server side, we set up different retention policies depending on the type of data we capture. For example, for app server logs (nginx and haproxy) we have the following retention policy specified in /opt/graphite/conf/storage-schemas.conf:&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small; font-weight: bold;"&gt;&lt;span style="background-color: transparent; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;[appserver]&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;pattern = ^appserver\.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;retentions = 60s:30d,10m:180d,60m:2y&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;This tells Graphite we want to keep data aggregated at 60 second intervals for 30 days, 10 minute data for 6 months and hourly data for 2 years.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;The main mechanism we use for sending data to Graphite is tailing various log files at different intervals, parsing the entries in order to extract the metrics we’re interested in, and sending those metrics to Graphite by the tried-and-true method called ‘just open a socket’.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;For example, we tail the nginx access log file via a log tracker script written in Python (and run as a daemon with supervisor), and we extract values such as the timestamp, the request URL, the HTTP return code, bytes sent and the request time in milliseconds. The default interval for collecting these values is 1 minute. For HTTP return codes, we group the codes such as 2xx, 3xx, 4xx, 5xx together, so we can report on each type of return code. We aggregate the values per collection interval, then send the counts to Graphite, named something like appserver.app01.500.reqs, which represents the HTTP 500 error count on server app01.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;A more elegant way would be to use a tool such as &lt;/span&gt;&lt;a href="https://github.com/etsy/logster" style="font-weight: bold;"&gt;&lt;span style="background-color: transparent; color: #000099; font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;logster&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt; to capture various log entries, but we haven’t had the time to write logster plugins for the 2 main services we’re interested in, nginx and haproxy. Our solution is deemed temporary, but as we all know, there’s nothing more permanent than a temporary solution.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;For some more unusual metrics that we measure ourselves, such as LSI MegaRaid battery charge state, we run a shell script in an infinite loop and produce a value every second, then we send it to Graphite. To obtain the value we run something that resembles line noise:&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;$ MegaCli64 -AdpBbuCmd -GetBbuStatus -a0 | grep -i "Relative State of Charge" | awk '{print $5}'&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;(thanks to my colleague Marco Garcia for coming up with this)&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Once the data is captured in Graphite, we can do several things with it:&lt;/span&gt;&lt;ul style="font-weight: bold;"&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;visualize it using the Graphite dashboards&lt;/span&gt;&lt;/li&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;alert on it using custom Nagios plugins&lt;/span&gt;&lt;/li&gt;&lt;li style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; list-style-type: disc; text-decoration: none; vertical-align: baseline;"&gt;&lt;span style="background-color: transparent; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;capture it in our own more compact dashboards, so we can have multiple graphs displayed on one ‘mission control’ page&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Nagios plugins&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;My colleague Marco Garcia wrote a Nagios plugin in Python to alert on HTTP 500 errors. To obtain the data from Graphite, he queries a special ‘rawData’ URL of this form:&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;http://graphite.yourdomain.com/render/?from=-5minutes&amp;amp;until=-1minutes&amp;amp;target=asPercent(sum(appserver.app01.500.*.reqs),sum(appserver.app01.*.*.reqs))&amp;amp;rawData&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;which returns something like&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;span style="background-color: transparent; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;asPercent(sumSeries(appserver.app01.500.*.reqs),sumSeries(appserver.app01.*.*.reqs)),1325778360,1325778600,60|value1,value2,value3,value4&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;where the 4 values represent the 4 data points, one per minute, from 5 minutes ago to 1 minute ago. Each data point is the percentage of HTTP 500 errors calculated against the total number of HTTP requests.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;The script then compares the values returned with a warning threshold (0.5) and a critical threshold (1.0). If all values are greater than the respective threshold, we issue a warning or a critical Nagios alert.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Mission control dashboard&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;My colleague Josh Frederick came up with the idea of presenting all relevant graphs based on data captured in Graphite in a single page which he dubbed ‘mission control’. This enables us to make correlations at a glance between things such as increased I/O wait on DB servers and spikes in HTTP 500 errors.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;To generate a given graph, Josh uses some Javascript magic to come up with an URL such as:&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;http://graphite.yourdomain.com/render/?width=500&amp;amp;height=200&amp;amp;hideLegend=1&amp;amp;from=-60minutes&amp;amp;until=-0minutes&amp;amp;bgcolor=FFFFFF&amp;amp;fgcolor=000000&amp;amp;areaMode=stacked&amp;amp;title=500s%20as%20%&amp;amp;target=asPercent(sum(appserver.app*.500.*.reqs),%20sum(appserver.app*.*.*.reqs))&amp;amp;_ts=1325783046438&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;which queries Graphite for the percentage of HTTP 500 errors from all HTTP requests across all app servers for the last 60 minutes. The resulting graph looks like this:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;b&gt;&lt;img height="200px;" src="https://lh6.googleusercontent.com/OdcV3XW_d20gFlrHpx-rY0Q45BVWtlEuv2biJjw0P1KJ3D1uenr6oq4eHVCu964VRfKGdIn9lYdQ0sizPBaJtHI-44OPiJcsAXMoawxq_2fde5ursZQ" width="500px;" /&gt;&lt;/b&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Our mission control page currently has 29 such graphs.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;We also have (courtesy of Josh again) a different set of charts based on the &lt;/span&gt;&lt;a href="http://code.google.com/p/google-visualization-python/" style="font-weight: bold;"&gt;&lt;span style="background-color: transparent; color: #000099; font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;Google Visualization Python API&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;. We get the data from Graphite in CSV format (by adding format=csv to the Graphite query URLs), then we display the data using the Google Visualization API.&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;If you don’t want to roll your own JS-based dashboard talking to Graphite, there’s a tool called &lt;/span&gt;&lt;a href="https://github.com/potch/statsdash" style="font-weight: bold;"&gt;&lt;span style="background-color: transparent; color: #000099; font-family: Arial; font-size: 15px; font-weight: normal; vertical-align: baseline; white-space: pre-wrap;"&gt;statsdash&lt;/span&gt;&lt;/a&gt;&lt;span style="background-color: transparent; font-family: Arial; font-size: 15px; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt; that may be of help.&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1670023335589569145?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1670023335589569145/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1670023335589569145' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1670023335589569145'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1670023335589569145'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2012/01/graphing-alerting-and-mission-control.html' title='Graphing, alerting and mission control with Graphite and Nagios'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-6684308450060771316</id><published>2011-12-22T13:02:00.000-08:00</published><updated>2011-12-22T13:02:10.010-08:00</updated><title type='text'>Load balancing and SSL in EC2</title><content type='html'>Here is another &lt;a href="http://sysadvent.blogspot.com/2011/12/day-22-load-balancing-solutions-on-ec2.html"&gt;post&lt;/a&gt; I wrote for this year's &lt;a href="http://sysadvent.blogspot.com/"&gt;Sysadvent&lt;/a&gt; blog. It briefly mentions some ways you can do load balancing in EC2, and focuses on how to upload SSL certificates to an Elastic Load Balancer using command-line tools. Any comments appreciated!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-6684308450060771316?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/6684308450060771316/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=6684308450060771316' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/6684308450060771316'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/6684308450060771316'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/12/load-balancing-and-ssl-in-ec2.html' title='Load balancing and SSL in EC2'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1095247236879638401</id><published>2011-12-12T09:18:00.000-08:00</published><updated>2011-12-12T09:18:05.059-08:00</updated><title type='text'>Analyzing logs with Pig and Elastic MapReduce</title><content type='html'>This is a&lt;a href="http://sysadvent.blogspot.com/2011/12/day-10-analyzing-logs-with-pig-and.html"&gt; blog post&lt;/a&gt; I wrote for this year's &lt;a href="http://sysadvent.blogspot.com/"&gt;Sysadvent&lt;/a&gt; series. If you're not familiar with the Sysadvent blog, you should be, if you are at all interested in system administration/devops topics. It is maintained by the indefatigable &lt;a href="http://www.semicomplete.com/"&gt;Jordan Sissel&lt;/a&gt;, and it contains posts contributed by various authors, 25 posts per year, from Dec 1st through Dec 25th. The posts cover a variety of topics, and to give you a taste, here are the articles so far this year:&lt;br /&gt;&lt;br /&gt;Day 1: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-1-dont-bash-your-process-outputs.html"&gt;Don't bash your process outputs&lt;/a&gt;" by Phil Hollenback&lt;br /&gt;Day 2: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-2-strategies-for-java-deployment.html"&gt;Strategies for Java deployment&lt;/a&gt;" by Kris Buytaert&lt;br /&gt;Day 3: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-3-share-skills-and-permissions-with.html"&gt;Share skills and permissions with code&lt;/a&gt;" by Jordan Sissel&lt;br /&gt;Day 4: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-4-guide-to-packaging-systems.html"&gt;A guide to packaging systems&lt;/a&gt;" by Jordan Sissel&lt;br /&gt;Day 5: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-5-tracking-requests-with-request.html"&gt;Tracking requests with Request Tracker&lt;/a&gt;" by Christopher Webber&lt;br /&gt;Day 6: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-6-always-be-hacking.html"&gt;Always be hacking&lt;/a&gt;" by John Vincent&lt;br /&gt;Day 7: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-7-change-and-proximity-of.html"&gt;Change and proximity of communication&lt;/a&gt;" by Aaron Nichols&lt;br /&gt;Day 8: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-8-running-services-with-systemd.html"&gt;Running services with systemd&lt;/a&gt;" by Jordan Sissel&lt;br /&gt;Day 9: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-9-data-in-shell.html"&gt;Data in the shell&lt;/a&gt;" by Jordan Sissel&lt;br /&gt;Day 10: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-10-analyzing-logs-with-pig-and.html"&gt;Analyzing logs with Pig and Elastic MapReduce&lt;/a&gt;" by yours truly&lt;br /&gt;Day 11: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-11-simple-disk-based-server-backups.html"&gt;Simple disk-based server backups with rsnapshot&lt;/a&gt;" by Phil Hollenback&lt;br /&gt;Day 12: "&lt;a href="http://sysadvent.blogspot.com/2011/12/day-12-reverse-engineer-servers-with.html"&gt;Reverse-engineer servers with Blueprint&lt;/a&gt;" by Richard Crowley&lt;br /&gt;&lt;br /&gt;Jordan needs more articles for this year, so if you have something to contribute, please propose it on the &lt;a href="http://groups.google.com/group/sysadvent"&gt;mailing list&lt;/a&gt;.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1095247236879638401?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1095247236879638401/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1095247236879638401' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1095247236879638401'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1095247236879638401'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/12/analyzing-logs-with-pig-and-elastic.html' title='Analyzing logs with Pig and Elastic MapReduce'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-6441107211319853379</id><published>2011-12-07T11:13:00.001-08:00</published><updated>2011-12-07T11:44:26.791-08:00</updated><title type='text'>Crowd Mood - an indicator of health for products/projects</title><content type='html'>I thought I just coined a new term -- &lt;b&gt;Crowd Mood --&lt;/b&gt;&amp;nbsp;but a quick Google search revealed a 2009 paper on "Crowd Behavior at Mass Gatherings: A&amp;nbsp;Literature Review" (&lt;a href="http://pdm.medicine.wisc.edu/Volume_24/issue_1/zeitz.pdf"&gt;PDF&lt;/a&gt;) which says:&lt;br /&gt;&lt;br /&gt;&lt;i&gt;In the mass-gathering literature, the use of terms&amp;nbsp;“crowd behavior”, “crowd type”, “crowd management”, and “crowd mood” are&amp;nbsp;used in variable contexts. More practically, the term “crowd mood” has become&amp;nbsp;&lt;/i&gt;&lt;i&gt;an accepted measure of probable crowd behavior outcomes.&amp;nbsp;This is particularly true in the context of crowds during protests/riots, where attempts have&amp;nbsp;been made to identify factors that lead to a change of mood that may underpin more violent behavior.&lt;/i&gt;&lt;br /&gt;&lt;br /&gt;Instead of protests and riots, the crowd behavior I'm referring to is the reaction of users of software products or projects. I think the overall Crowd Mood of these users is a good indicator of the health of those products/projects. I may state the obvious here, and maybe it's been done already, but I'm not aware of large-scale studies that try to correlate the success or failure of a particular software project with the mood of its users, as expressed in messages to mailing lists, in blog posts, and of course Twitter.&lt;br /&gt;&lt;br /&gt;I'm aware of&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Sentiment_analysis"&gt;Sentiment Analysis&lt;/a&gt;&amp;nbsp;and I know there are companies who offer this service by mining Twitter. But a more rigorous study would include other data sources. I have in mind something similar to this study by Kaleev Leetaru: "&lt;a href="http://www.uic.edu/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/3663/3040"&gt;Culturonomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space&lt;/a&gt;". This study mined information primarily from the archives of the&amp;nbsp;"Summary of World Broadcasts (SWB)" global news monitoring service. It analyzed the tone or mood of the news regarding a particular event/person/place, and it established correlations between the sentiment it mined (for example negative news regarding Egypt's then-president Mubarak) and events that happen shortly afterwards, such as the Arab Spring of 2011. The study actually talks not only about correlation, but also about forecasting events based on current news tone.&lt;br /&gt;&lt;br /&gt;I believe similar studies mining the Crowd Mood would be beneficial to any large-scale software product or project. For example, Canonical would be well-advised to conduct such a study in order to determine whether their decision to drop Gnome in favor of Unity was good or not (my feeling? it was BAD! -- and I think the Crowd Mood surrounding this decision would confirm it).&lt;br /&gt;&lt;br /&gt;Another example: Python 3 and the decision not to continue releasing Python 2 past 2.8. Good or bad? I say BAD! (see also Armin Ronacher's recent &lt;a href="http://lucumr.pocoo.org/2011/12/7/thoughts-on-python3/"&gt;blog post&lt;/a&gt; on the subject, which brilliantly expresses the issues around this decision).&lt;br /&gt;&lt;br /&gt;Yet one more example: the recent changes in the Google UI, especially GMail. BAD!&lt;br /&gt;&lt;br /&gt;These examples have a common theme in my opinion: the unilateral decision by a company or project to make non-backwards-compatible changes without really consulting its users. I know Apple can pull this off, but they're the exception, not the rule. The attitude of "trust us, we know what's good for you" leads to failure in the long run.&lt;br /&gt;&lt;br /&gt;Somebody should build a product (or even better, an Open Source project) around Crowd Mood analysis. Maybe based on Google's Prediction API, which has &lt;a href="http://e.google.com/apis/predict/docs/sentiment_analysis.html"&gt;functionality for sentiment analysis&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-6441107211319853379?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/6441107211319853379/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=6441107211319853379' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/6441107211319853379'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/6441107211319853379'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/12/crowd-mood-indicator-of-health-for.html' title='Crowd Mood - an indicator of health for products/projects'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-4949175369966639603</id><published>2011-11-17T12:56:00.001-08:00</published><updated>2011-11-17T14:43:45.353-08:00</updated><title type='text'>Seven years of blogging in less than 500 words</title><content type='html'>&lt;br /&gt;&lt;div style="margin-bottom: 0in;"&gt;In a couple of days, this blog will be7 years old. Hard to believe so much time has passed since my first&amp;nbsp;&lt;a href="http://agiletesting.blogspot.com/2004/11/hello-world.html"&gt;Hello World&lt;/a&gt; post. &lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;I keep reading that blogging is on thewane, and it's true, mainly because of the popularity of Twitter. ButI strongly believe that blogging is still important, and that morepeople should do it. For me, it's a way to give back to thecommunity. I can't even remember how many times I found solutions tomy technical problems by reading a blog post. I personally try topost something on my blog every single time I solve an issue thatI've struggled with. If you post documentation to a company wiki(assuming it's not confidential), I urge you to try to also blogpublicly about it – think of it as a public documentation that canhelp both you and others. &lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;Blogging is also a great way to furtheryour career. Back in September 2008 I blogged about &lt;a href="http://agiletesting.blogspot.com/2008/09/experiences-with-amazon-ec2-and-ebs.html"&gt;my experiences with EC2&lt;/a&gt;. I had launched an m1.small instance and I had started toplay with it. Lo and behold, I got a job offer based on that post. Iaccepted the offer, and that job allowed me to greatly enhance myexperience with cloud computing. This brings me to a more generalpoint: if you want to work on something you know nothing about, startsmall on your own, persevere, and yes, blog about it! In myexperience, you'll invariably find yourself in a position to be paidto work on it.&lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;It's also pretty interesting for me tolook at my blog posts and to recognize in them the evolution of myown career. I started to blog when I was working as a tester. I waspassionate about &lt;a href="http://agiletesting.blogspot.com/2004/11/writing-fitnesse-tests-in-python.html"&gt;automated testing&lt;/a&gt;, &lt;a href="http://agiletesting.blogspot.com/2005/01/python-unit-testing-part-3-pytest-tool.html"&gt;Python&lt;/a&gt; and &lt;a href="http://agiletesting.blogspot.com/2006/04/should-acceptance-tests-be-included-in.html"&gt;agile processes and tools&lt;/a&gt;. Thisled to digging more into automation tools, then into &lt;a href="http://agiletesting.blogspot.com/2010/03/automated-deployments-with-fabric-tips.html"&gt;automated deployments&lt;/a&gt; and &lt;a href="http://agiletesting.blogspot.com/2010/07/working-with-chef-cookbooks-and-roles.html"&gt;configuration management&lt;/a&gt; in the context of &lt;a href="http://agiletesting.blogspot.com/2010/11/how-to-whip-your-infrastructure-into.html"&gt;system architecture&lt;/a&gt;. This led into &lt;a href="http://agiletesting.blogspot.com/2011/03/working-in-multi-cloud-with-libcloud.html"&gt;cloud computing&lt;/a&gt;, and these days this isleading into &lt;a href="http://agiletesting.blogspot.com/2011/11/experiences-with-amazon-elastic.html"&gt;Big Data analytics&lt;/a&gt;. Who knows what's next? Whatever itis, I'm sure it will be exciting, because I make an effort to make itso!&lt;br /&gt;&lt;br /&gt;I think it's also important to recognize that learning a tool or a technique is necessary but not sufficient: at the end of the day it's what you do with it that matters. For me the progression has been testing-&amp;gt;automation-&amp;gt;cloud computing-&amp;gt;data analytics-&amp;gt;??? (I actually started my career as a programmer, but here I include only the period coinciding with my blogging years.)&lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;This may be self-help-kind-of-corny,but a quote which is attributed (maybe &lt;a href="http://www.goethesociety.org/pages/quotescom.html"&gt;wrongly&lt;/a&gt;) to Goethe reallyapplies to everybody:&lt;/div&gt;&lt;div style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;“Whatever you can do, or dream you can do, begin it. Boldness has genius, power, and magic in it. Begin it now."&lt;br /&gt;&lt;div style="margin-bottom: 0in;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-4949175369966639603?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/4949175369966639603/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=4949175369966639603' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/4949175369966639603'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/4949175369966639603'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/11/seven-years-of-blogging-in-less-than.html' title='Seven years of blogging in less than 500 words'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-5094233647771441612</id><published>2011-11-09T10:57:00.000-08:00</published><updated>2011-11-09T10:57:53.850-08:00</updated><title type='text'>Troubleshooting memory allocation errors in Elastic MapReduce</title><content type='html'>Yesterday we ran into an issue with some Hive scripts running within an Amazon Elastic MapReduce cluster. Here's the error we got:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre&gt;Caused by: java.io.IOException: Spill failed&lt;br /&gt; at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:877)&lt;br /&gt; at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:474)&lt;br /&gt; at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:289)&lt;br /&gt; ... 11 more&lt;br /&gt;Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory&lt;br /&gt; at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)&lt;br /&gt; at org.apache.hadoop.util.Shell.runCommand(Shell.java:176)&lt;br /&gt; at org.apache.hadoop.util.Shell.run(Shell.java:161)&lt;br /&gt; at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)&lt;br /&gt; at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)&lt;br /&gt; at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)&lt;br /&gt; at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)&lt;br /&gt; at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1238)&lt;br /&gt; at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:703)&lt;br /&gt; at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1190)&lt;br /&gt;Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory&lt;br /&gt; at java.lang.UNIXProcess.&lt;init&gt;(UNIXProcess.java:148)&lt;br /&gt; at java.lang.ProcessImpl.start(ProcessImpl.java:65)&lt;br /&gt; at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)&lt;/init&gt;&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;Googling around for&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: monospace; white-space: pre;"&gt;&lt;b&gt;java.io.IOException: java.io.IOException: error=12&lt;/b&gt;, Cannot allocate memory&lt;/span&gt;, it seems it's a common problem. See this AWS Developer Forums &lt;a href="https://forums.aws.amazon.com/thread.jspa?threadID=49024"&gt;thread&lt;/a&gt;, this Hadoop core-user mailing list &lt;a href="http://old.nabble.com/Cannot-run-program-%22bash%22:-java.io.IOException:-error=12,-Cannot-allocate-memory-td19891450.html"&gt;thread&lt;/a&gt;, and this &lt;a href="http://tech.groups.yahoo.com/group/bixo-dev/message/801"&gt;explanation&lt;/a&gt; by Ken Krugler from Bixo Labs.&lt;br /&gt;&lt;br /&gt;Basically, it boils down to the fact that when Java tries to fork a new process (in this case a bash shell), Linux will try to allocate as much memory as the current Java process, even though not all that memory will be required. There are several workarounds (read in particular the AWS Forum thread), but a solution that worked for us was to simply add swap space to the Elastic MapReduce slave nodes.&lt;br /&gt;&lt;br /&gt;You can ssh into a slave node from the EMR master node by using the same private key you used when launching the EMR cluster, and by targeting the internal IP address of the slave node. In our case, the slaves are m1.xlarge instances, and they have 4 local disks (/dev/sdb through /dev/sde) mounted as /mnt, /mnt1, /mnt2 and /mnt3, with 414 GB available on each file system. I ran this simple script via sudo on each slave to add 4 swap files of 1 GB each, one on each of the 4 local disks.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;$ cat make_swap.sh&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;#!/bin/bash&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;SWAPFILES='&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;/mnt/swapfile1&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;/mnt1/swapfile1&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;/mnt2/swapfile1&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;/mnt3/swapfile1&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;'&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;for SWAPFILE in $SWAPFILES; do&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;dd if=/dev/zero of=$SWAPFILE bs=1024 count=1048576&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;mkswap $SWAPFILE&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;swapon $SWAPFILE&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;echo "$SWAPFILE swap swap defaults 0 0" &amp;gt;&amp;gt; /etc/fstab&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;done&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This solved our issue. No more failed Map tasks, no more failed Reduce tasks. Maybe this will be of use to some other frantic admins out there (like I was yesterday) who are not sure how to troubleshoot the intimidating Hadoop errors they're facing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-5094233647771441612?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/5094233647771441612/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=5094233647771441612' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5094233647771441612'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5094233647771441612'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/11/troubleshooting-memory-allocation.html' title='Troubleshooting memory allocation errors in Elastic MapReduce'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2558118378473105113</id><published>2011-11-04T13:55:00.000-07:00</published><updated>2011-11-04T13:55:09.872-07:00</updated><title type='text'>Experiences with Amazon Elastic MapReduce</title><content type='html'>&lt;br /&gt;We started to use AWS Elastic MapReduce (EMR) in earnest a short time ago, with the help of&amp;nbsp;&lt;a href="http://twitter.com/#!/lusciouspear"&gt;Bradford Stephens&amp;nbsp;&lt;/a&gt;from&amp;nbsp;&lt;a href="http://drawntoscale.com/"&gt;Drawn to Scale&lt;/a&gt;. We needed somebody to jumpstart our data analytics processes and workflow, and Bradford's help was invaluable. At some point we'll probably build our own Hadoop cluster either in EC2 or in-house, but for now EMR is doing the job just fine.&lt;br /&gt;&lt;br /&gt;We started with an EMR cluster containing the master + 5 slave nodes, all m1.xlarge. We still have that cluster up and running, but in the mean time I've also experimented with launching clusters, running our data analytics processes on them, then shutting them down -- which is the 'elastic' type of workflow that takes full advantage of the pay-per-hour model of EMR.&lt;br /&gt;&lt;br /&gt;Before I go into the details of launching and managing EMR clusters, here's the general workflow that we follow for our data analytics processes on a nightly basis:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;We gather data from various sources such as production databases, ad impression reports, mail server logs, and other 3rd party sources. All this data is in CSV format, mostly comma-separated or tab-separated. Each CSV file is timestamped with YYYY-MM-DD and corresponds to a 'table' or 'entity' that we want to analyse later.&lt;/li&gt;&lt;li&gt;We gzip all CSV files and upload them to various S3 buckets.&lt;/li&gt;&lt;li&gt;On the EMR Hadoop master node, we copy the csv.gz files we need from S3 into HDFS.&lt;/li&gt;&lt;li&gt;We create Hive tables, one table for each type of 'entity'.&lt;/li&gt;&lt;li&gt;We run Hive queries against these tables and save the results in HDFS.&lt;/li&gt;&lt;li&gt;We export the results of the Hive queries to MySQL so we can further analyse them when we need to, and so we can visualize them in a dashboard.&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;I think this is a fairly common workflow for doing data analytics with Hadoop. It can definitely be optimized. Currently we create the Hive tables from scratch in step 4. Ideally we'll want to save them to S3, and only append new data to them, but the append operation seems to only exist in Hive 0.8 which is not yet available in EMR. But as suboptimal as it is, even if it takes a few hours each night, this process allows us to run queries that were simply impossible to execute outside of Hadoop.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here are some experiments I've done with using EMR in its truly 'elastic' mode.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Installing the EMR Ruby CLI&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Eventually we'll use something like&amp;nbsp;&lt;a href="https://github.com/boto/boto"&gt;boto&lt;/a&gt;'s EMR bindings to manage our EMR clusters, but for quick experimentation I preferred a purely command-line tool, and the Ruby-based elastic-mapreduce tool seemed to be the only one available. To install, download the zip file from&amp;nbsp;&lt;a href="http://aws.amazon.com/developertools/Elastic-MapReduce/2264"&gt;here&lt;/a&gt;, then unzip it somewhere on an EC2 instance where you can store your AWS credentials (a management-type instance usually). I installed in /opt/emr on one of our EC2 instances. At this point it's also a good idea to become familiar with the&amp;nbsp;&lt;a href="http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/"&gt;EMR Developer Guide&lt;/a&gt;, which has examples of various elastic-mapreduce use cases. I also found a good&amp;nbsp;&lt;a href="https://github.com/tc/elastic-mapreduce-ruby"&gt;README&lt;/a&gt;&amp;nbsp;on GitHub.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Next, create a credentials.json file containing some information about your AWS credentials and the keypair that will be used when launching the EMR cluster. The format of this JSON file is:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;{&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; "access-id": "YOUR_AWS_ACCESS_ID",&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; "private-key": "YOUR_AWS_SECRET_KEY",&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; "key-pair": "YOUR_EC2_KEYPAIR_NAME",&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; "key-pair-file": "PATH_TO_PRIVATE_SSH_KEY_CORRESPONDING_TO_KEYPAIR",&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; "region": "us-east-1",&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; "log-uri": "s3://somebucket.yourcompany.com/logs"&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Launching an EMR cluster&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span id="internal-source-marker_0.10861712298355997" style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./elastic-mapreduce -c /opt/emr/credentials.json --create --name "test1" --alive --num-instances 3 --master-instance-type m1.small --slave-instance-type m1.small --hadoop-version 0.20 --hive-interactive --hive-versions 0.7.1 &lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;The command line above launches an EMR cluster called test1 with 3 &amp;nbsp;instances (1 Hadoop master and 2 slave nodes), installs Hadoop 0.20 and Hive 0.71 on it, then keeps the cluster up and running (because we specified --alive).&amp;nbsp;For experimentation purposes I recommend using m1.small instances.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This command returns a jobflow ID, which you'll need for all other commands that reference this specific cluster.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Getting information about a specific jobflow&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./elastic-mapreduce --describe --jobflow JOBFLOWID&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This command returns a JSON document containing a wealth of information about the EMR cluster. Here's an&amp;nbsp;&lt;a href="http://./elastic-mapreduce%20--describe%20--jobflow%20j-"&gt;example output&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Listing the state of a jobflow&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;#&amp;nbsp;./elastic-mapreduce --list --jobflow JOBFLOWID&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;JOBFLOWID&amp;nbsp; &amp;nbsp; &amp;nbsp;WAITING &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;ec2-A-B-C-D.compute-1.amazonaws.com &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; test1&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp;COMPLETED &amp;nbsp; &amp;nbsp; &amp;nbsp;Setup Hive &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This command is useful when you're trying to ascertain whether the initial configuration of the cluster is done. The state of the cluster immediately after you launch it will change from STARTING to RUNNING (during which step Hadoop and Hive are installed) and finally to WAITING.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Enabling and disabling termination protection for a jobflow&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;If you want to make sure that the jobflow won't be terminated, you can turn the termination protection on (it's off by default):&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span id="internal-source-marker_0.10861712298355997" style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./elastic-mapreduce --set-termination-protection true --jobflow JOBFLOWID&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;To disable the termination protection, set it to false.&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;b&gt;Adding core nodes to the EMR cluster&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;There are 2 types of Hadoop slave nodes: core (which contribute to the HDFS cluster) and task (which run Hadoop tasks but are not part of the HDFS cluster). If you want to add core nodes, you can use this command:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;div style="background-color: transparent;"&gt;&lt;span id="internal-source-marker_0.10861712298355997" style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./elastic-mapreduce --modify-instance-group CORE --instance-count NEW_COUNT&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;where NEW_COUNT is the new overall count of core nodes that you are targetting (so for example if you had 5 core nodes and you wanted to add 2 more, the NEW_COUNT will be 7).&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Note that you can only request an increased NEW_COUNT for core nodes, never a decreased count.&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Also note that if one or more slave nodes are misbehaving (more on that in a bit), it's better to terminate them (via ElasticFox or the AWS console for example) than to add new core nodes. When you terminate them, the EMR jobflow will automatically launch extra nodes so that the core node count is kept the same.&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;b&gt;Accessing the Hadoop master node and working with the HDFS cluster&lt;/b&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;You can ssh into the Hadoop master by using the private key you specified in credentials.json.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ssh -i /path/to/private_ssh_jey hadoop@public_ip_or_dns_of_master&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;Or you can use:&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="white-space: pre-wrap;"&gt;# ./elastic-mapreduce --jobflow JOBFLOWID --ssh&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;Once you're logged in as user hadoop on the master node, you can run various HDFS commands such as:&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;creating an HDFS directory: &lt;/span&gt;&lt;span class="Apple-style-span" style="white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;$ hadoop fs -mkdir /user/hadoop/mydir&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;copying files from S3 into HDFS: &lt;/span&gt;&lt;span class="Apple-style-span" style="white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;$ hadoop fs -cp s3://somebucket.yourcompany.com/dir1/data*gz /user/hadoop/mydir&lt;/span&gt;&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;listing files in an HDFS directory: &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New'; font-size: 12px; white-space: pre-wrap;"&gt;$ hadoop fs -ls /user/hadoop/mydir&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;deleting files in an HDFS directory: &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New'; font-size: 12px; white-space: pre-wrap;"&gt;$ hadoop fs -rm /user/hadoop/*.gz&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span" style="font-family: Arial;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;copying files from HDFS to a local file system on the master node: &lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New'; font-size: 12px; white-space: pre-wrap;"&gt;$ hadoop fs -copyToLocal /user/hadoop/mydir/*.gz /home/hadoop/somedir&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New'; font-size: 12px; white-space: pre-wrap;"&gt;Th&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;There is also a very useful admin-type command which allows you to see the state of the slave nodes, and the HDFS file system usage. Here's an example which I ran on a cluster with 3 slave nodes (I added the 3nd node at a later time, which is why its DFS usage is less than the usage on the first 2 nodes):&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;$ hadoop dfsadmin -report&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Configured Capacity: 5319863697408 (4.84 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Present Capacity: 5052263763968 (4.6 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Remaining: 4952298123264 (4.5 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Used: 99965640704 (93.1 GB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Used%: 1.98%&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Under replicated blocks: 3&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Blocks with corrupt replicas: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Missing blocks: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;-------------------------------------------------&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Datanodes available: 3 (3 total, 0 dead)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Name: 10.68.125.161:9200&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Decommission Status : Normal&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Configured Capacity: 1773287899136 (1.61 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Used: 47647641600 (44.38 GB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Non DFS Used: 89327800320 (83.19 GB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Remaining: 1636312457216(1.49 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Used%: 2.69%&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Remaining%: 92.28%&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Last contact: Fri Nov 04 20:31:31 UTC 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Name: 10.191.121.76:9200&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Decommission Status : Normal&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Configured Capacity: 1773287899136 (1.61 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Used: 49191796736 (45.81 GB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Non DFS Used: 89329020928 (83.19 GB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Remaining: 1634767081472(1.49 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Used%: 2.77%&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Remaining%: 92.19%&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Last contact: Fri Nov 04 20:31:29 UTC 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Name: 10.68.105.36:9200&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Decommission Status : Normal&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Configured Capacity: 1773287899136 (1.61 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Used: 3126202368 (2.91 GB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Non DFS Used: 88943112192 (82.83 GB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Remaining: 1681218584576(1.53 TB)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Used%: 0.18%&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;DFS Remaining%: 94.81%&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Last contact: Fri Nov 04 20:31:30 UTC 2011&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When there's something wrong with any slave data node, e.g. it can't be contacted by the master, then the number of dead nodes will be non-zero, and the 'Last contact' date for that slave will be off compared to the healthy nodes.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Another useful admin command is&amp;nbsp;&lt;/div&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;$ hadoop fsck /&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;which should report HEALTHY:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Status: HEALTHY&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Total size:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;99648130222 B&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Total dirs:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;76&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Total files:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;5044 (Files currently being written: 1)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Total blocks (validated):&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;5215 (avg. block size 19107982 B) (Total open file blocks (not validated): 1)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Minimally replicated blocks:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;5215 (100.0 %)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Over-replicated blocks:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;0 (0.0 %)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Under-replicated blocks:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;3 (0.057526365 %)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Mis-replicated blocks:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;  &lt;/span&gt;0 (0.0 %)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Default replication factor:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Average block replication:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt; &lt;/span&gt;1.0011505&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Corrupt blocks:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;  &lt;/span&gt;0&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Missing replicas:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;  &lt;/span&gt;21 (0.4022218 %)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Number of data-nodes:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;  &lt;/span&gt;3&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp;Number of racks:&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;  &lt;/span&gt;1&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;The filesystem under path '/' is HEALTHY&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Terminating a jobflow&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As you would expect, the command is:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;#&amp;nbsp;./elastic-mapreduce --terminate --jobflow JOBFLOWID&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;b&gt;Inspecting Hive logs&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After running a Hive query on the Hadoop master node, you can inspect the logs created in this directory (this is for Hive 0.71):&amp;nbsp;&lt;span class="Apple-style-span" style="white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;/mnt/var/lib/hive_07_1/tmp/history&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Putting it all together&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I put many of these pieces together in one script (see gist&amp;nbsp;&lt;a href="https://gist.github.com/1340429"&gt;here&lt;/a&gt;) that I run every night from the EC2 management instance, and that does the following:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;Launches EMR cluster with desired instance types for the master node and the slave nodes, and with the desired instance count (this count includes the master node); the command line also specified that Hadoop 0.20 and Hive 0.71 need to be installed, and it keeps the cluster up and running via the --alive option.&lt;/li&gt;&lt;li&gt;Waits in a loop for the state of the jobflow to be WAITING and sleeps 10 seconds in between checks.&lt;/li&gt;&lt;li&gt;Retrieves the public DNS name of the Hadoop master from the JSON description of the new jobflow.&lt;/li&gt;&lt;li&gt;Copies from the management node to the Hadoop master a local directory containing scripts that will be run on the master.&lt;/li&gt;&lt;li&gt;Runs a script (called run_hive_scripts.sh) on the Hadoop master via ssh; this script does all the further Hadoop and Hive processing, which includes&lt;/li&gt;&lt;ul&gt;&lt;li&gt;creation of HDFS directories&lt;/li&gt;&lt;li&gt;copying of csv.gz files from S3 into HDFS&lt;/li&gt;&lt;li&gt;creation of Hive files&lt;/li&gt;&lt;li&gt;running of Hive queries&lt;/li&gt;&lt;li&gt;saving of Hive queries to a local directory on the Hadoop master&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Retrieves the Hive output files by scp-ing them back from the Hadoop master&lt;/li&gt;&lt;li&gt;Terminates the EMR cluster&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;At this point, the Hive output files are processed and the data is inserted into a MySQL instance for further analysis and visualization.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That's about it for now. We have a lot more work to do before we declare ourselves satisfied with the state of our data analytics platform, and I'll blog more about it as soon as we cross more things off our todo list.&lt;/div&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2558118378473105113?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2558118378473105113/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2558118378473105113' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2558118378473105113'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2558118378473105113'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/11/experiences-with-amazon-elastic.html' title='Experiences with Amazon Elastic MapReduce'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-420736299649520371</id><published>2011-10-28T12:12:00.000-07:00</published><updated>2011-10-28T12:22:23.886-07:00</updated><title type='text'>More fun with the LSI MegaRAID controllers</title><content type='html'>As I mentioned in a &lt;a href="http://agiletesting.blogspot.com/2011/09/slow-database-check-raid-battery.html"&gt;previous post&lt;/a&gt;, we've had some issues with the LSI MegaRAID controllers on our Dell C2100 database servers. Previously we noticed periodical slow-downs of the databases related to decreased I/O throughput. It turned out it was the LSI RAID battery going through its relearning cycle.&lt;br /&gt;&lt;br /&gt;Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.&lt;br /&gt;&lt;br /&gt;For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this &lt;a href="http://agiletesting.blogspot.com/2011/09/slow-database-check-raid-battery.html"&gt;post&lt;/a&gt; for how to install them).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Check RAID event log for errors&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;This will save all RAID-related events since the last reboot to events.log. In our case, we noticed these lines:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;===========&lt;br /&gt;Device ID: 11&lt;br /&gt;Enclosure Index: 12&lt;br /&gt;Slot Number: 11&lt;br /&gt;Error: 3&lt;br /&gt;&lt;br /&gt;seqNum: 0x00001654&lt;br /&gt;Time: Fri Oct 28 04:45:26 2011&lt;br /&gt;&lt;br /&gt;Code: 0x00000071&lt;br /&gt;Class: 0&lt;br /&gt;Locale: 0x02&lt;br /&gt;Event Description: Unexpected sense: PD 0b(e0x0c/s11) Path&lt;br /&gt;5000c50034731385, CDB: 2a 00 0b a1 4c 00 00 00 18 00, Sense: 6/29/02&lt;br /&gt;Event Data:&lt;br /&gt;===========&lt;br /&gt;Device ID: 11&lt;br /&gt;Enclosure Index: 12&lt;br /&gt;Slot Number: 11&lt;br /&gt;CDB Length: 10&lt;br /&gt;CDB Data:&lt;br /&gt;002a 0000 000b 00a1 004c 0000 0000 0000 0018 0000 0000 0000 0000 0000&lt;br /&gt;0000 0000 Sense Length: 18&lt;br /&gt;Sense Data:&lt;br /&gt;0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0002&lt;br /&gt;0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000&lt;br /&gt;0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000&lt;br /&gt;0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000&lt;br /&gt;0000 0000 0000 0000 0000 0000 0000 0000&lt;br /&gt;&lt;br /&gt;seqNum: 0x00001655&lt;br /&gt;&lt;br /&gt;Time: Fri Oct 28 04:45:26 2011&lt;br /&gt;&lt;br /&gt;Code: 0x00000060&lt;br /&gt;Class: 1&lt;br /&gt;Locale: 0x02&lt;br /&gt;Event Description: Predictive failure: PD 0b(e0x0c/s11)&lt;br /&gt;Event Data:&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Check state of particular physical drive&lt;/b&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You need the &lt;b&gt;enclosure index&lt;/b&gt; (12 in the output above) and the &lt;b&gt;device ID&lt;/b&gt; (11 in the output above). Then you run:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;./MegaCli64 -pdInfo -PhysDrv [12:11] aALL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In our case, we saw that Media Error Count was non-zero.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Mark physical drive offline&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We decided to mark the faulty drive offline, to see if that takes work off of the RAID controller and improves I/O throughput in the system.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;#&amp;nbsp;./MegaCli64 -PDOffline &amp;nbsp;-PhysDrv [12:11] aALL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Indeed, we noticed how the I/O wait time and the load average on the system dropped to normal levels.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Hot swap faulty drive&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We had a spare drive which we hot-swapped with the faulty drive. The new drive started to rebuild. You can see its status if you run:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -pdInfo -PhysDrv [12:11] aALL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Enclosure Device ID: 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Slot Number: 11&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Device Id: 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Sequence Number: 3&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Media Error Count: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Other Error Count: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Predictive Failure Count: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Last Predictive Failure Event Seq Number: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;PD Type: SAS&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Raw Size: 931.512 GB [0x74706db0 Sectors]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Non Coerced Size: 931.012 GB [0x74606db0 Sectors]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Coerced Size: 931.0 GB [0x74600000 Sectors]&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Rebuild&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;SAS Address(0): 0x5000c500347da135&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;SAS Address(1): 0x0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Connected Port Number: 0(path0)&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Inquiry Data: SEAGATE ST31000424SS &amp;nbsp; &amp;nbsp;KS689WK4Z08J &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;FDE Capable: Not Capable&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;FDE Enable: Disable&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Secured: Unsecured&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Locked: Unlocked&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Foreign State: None&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Device Speed: Unknown&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Link Speed: Unknown&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Media Type: Hard Disk Device&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Exit Code: 0x00&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Check progress of the rebuilding process&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -PDRbld -ShowProg -PhysDrv [12:11] -aALL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Rebuild Progress on Device at Enclosure 12, Slot 11 Completed 47% in 88 Minutes.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Exit Code: 0x00&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Check event log again&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Once the rebuilding is done, you can check the event log by specifying for example that you want to see the last 10 events:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -AdpEventLog -GetLatest 10 -f t1.log -aALL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In our case we saw something like this:&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-family: arial, sans-serif; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;# cat t1.log&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x00001720&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 12:48:27 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x000000f9&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x01&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: VD 00/0 is now OPTIMAL&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Target Id: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x0000171f&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 12:48:27 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x00000051&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x01&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Target Id: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Previous state: 2&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;New state: 3&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x0000171e&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 12:48:27 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x00000072&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x02&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: State change on PD 0d(e0x0c/s11) from REBUILD(14) to ONLINE(18)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Device ID: 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Enclosure Index: 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Slot Number: 11&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Previous state: 20&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;New state: 24&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x0000171d&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 12:48:27 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x00000064&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x02&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: Rebuild complete on PD 0d(e0x0c/s11)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Device ID: 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Enclosure Index: 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Slot Number: 11&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x0000171c&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 12:48:27 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x00000063&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x02&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: Rebuild complete on VD 00/0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Target Id: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x000016b7&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 08:55:42 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x00000072&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x02&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: State change on PD 0d(e0x0c/s11) from OFFLINE(10) to REBUILD(14)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Device ID: 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Enclosure Index: 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Slot Number: 11&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Previous state: 16&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;New state: 20&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x000016b6&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 08:55:42 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x0000006a&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x02&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: Rebuild automatically started on PD 0d(e0x0c/s11)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Device ID: 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Enclosure Index: 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Slot Number: 11&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x000016b5&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 08:55:42 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x00000072&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x02&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: State change on PD 0d(e0x0c/s11) from UNCONFIGURED_GOOD(0) to OFFLINE(10)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Device ID: 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Enclosure Index: 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Slot Number: 11&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Previous state: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;New state: 16&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x000016b4&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 08:55:42 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x000000f7&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x02&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: Inserted: PD 0d(e0x0c/s11) Info: enclPd=0c, scsiType=0, portMap=00, sasAddr=5000c500347da135,0000000000000000&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Device ID: 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Enclosure Device ID: 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Enclosure Index: 1&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Slot Number: 11&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SAS Address 1: 5000c500347da135&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;SAS Address 2: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;seqNum: 0x000016b3&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Time: Fri Oct 28 08:55:42 2011&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Code: 0x0000005b&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Class: 0&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Locale: 0x02&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Description: Inserted: PD 0d(e0x0c/s11)&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Event Data:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;===========&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Device ID: 13&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Enclosure Index: 12&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Slot Number: 11&lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: arial, sans-serif;"&gt;&lt;br /&gt;&lt;/div&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Here are some other useful troubleshooting commands:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Check firmware state across all physical drives&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -PDList -a0 | grep Firmware&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Check virtual drive state&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;You can also check the virtual drive state:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -LDInfo -L0 -aALL&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Adapter 0 -- Virtual Drive Information:&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Virtual Disk: 0 (Target Id: 0)&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Name:&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Size:5.455 TB&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;State: Degraded&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Stripe Size: 64 KB&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Number Of Drives per span:2&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Span Depth:6&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Access Policy: Read/Write&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Disk Cache Policy: Disk's Default&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Encryption Type: None&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Exit Code: 0x00&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;Here you can see that the state is degraded. Running the previous PDlist command shows that one drive is failed:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -PDList -aALL | grep Firmware&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Failed&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;Firmware state: Online&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;Our next step is to write a custom Nagios plugin to check for events that are out of the ordinary. A good indication of an error state is the transition from 'Previous state: 0' to 'New state: N' where N &amp;gt; 0, e.g.:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Previous state: 0&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;New state: 16&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;span class="Apple-style-span" style="color: #222222; font-size: x-small;"&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;div&gt;Thanks to my colleague Marco Garcia for digging deep into the MegaCli documentation and finding some of these obscure commands.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Resources&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://kb.lsi.com/KnowledgebaseArticle16516.aspx"&gt;MegaCli command reference&lt;/a&gt; from LSI&lt;/li&gt;&lt;li&gt;&lt;a href="http://ronaldbradford.com/mysql-expert/cheatsheets/dell-perc-raid.htm"&gt;Perc RAID cheatsheet&lt;/a&gt; from ronaldbradford.com&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-420736299649520371?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/420736299649520371/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=420736299649520371' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/420736299649520371'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/420736299649520371'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/10/more-fun-with-lsi-megaraid-controllers.html' title='More fun with the LSI MegaRAID controllers'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-5015358467531699460</id><published>2011-10-06T08:18:00.000-07:00</published><updated>2011-10-06T08:18:30.808-07:00</updated><title type='text'>Good email sending practices</title><content type='html'>&lt;br /&gt;I'm not going to call these 'best practices' but I hope they'll be useful if you're looking for ways to improve your email sending capabilities so as to maximize the odds that a message intended for a given recipient actually reaches that recipient's inbox.&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Make sure your mail servers are not configured as open relays&lt;/li&gt;&lt;ul&gt;&lt;li&gt;This should go without saying, but it should still be your #1 concern&lt;/li&gt;&lt;li&gt;Use ACLs and only allow relaying from the IPs of your application servers&lt;/li&gt;&lt;li&gt;Check your servers by using a&amp;nbsp;&lt;a href="http://www.abuse.net/relay.html"&gt;mail relay testing&lt;/a&gt;&amp;nbsp;service&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Make sure you have reverse DNS entries for the IP addresses you're sending mail from&lt;/li&gt;&lt;ul&gt;&lt;li&gt;This is another one of the oldies but goldies that you should double-check&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Use DKIM&lt;/li&gt;&lt;ul&gt;&lt;li&gt;From the&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Dkim"&gt;Wikipedia entry&lt;/a&gt;:&amp;nbsp;&lt;i&gt;DomainKeys Identified Mail (DKIM) is a method for associating a domain name to an email message, thereby allowing a person, role, or organization to claim some responsibility for the message. The association is set up by means of a digital signature which can be validated by recipients. Responsibility is claimed by a signer —independently of the message's actual authors or recipients— by adding a DKIM-Signature: field to the message's header. The verifier recovers the signer's public key using the DNS, and then verifies that the signature matches the actual message's content.&lt;/i&gt;&lt;/li&gt;&lt;li&gt;Check out&amp;nbsp;&lt;a href="http://dkim.org/"&gt;dkim.org&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Check out my blog post on&amp;nbsp;&lt;a href="http://agiletesting.blogspot.com/2010/03/dkim-setup-with-postfix-and-opendkim.htm"&gt;setting up OpenDKIM with postfix&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Use SPF&lt;/li&gt;&lt;ul&gt;&lt;li&gt;From the&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Sender_Policy_Framework"&gt;Wikipedia entry&lt;/a&gt;:&amp;nbsp;&lt;i&gt;Sender Policy Framework (SPF) is an email validation system designed to prevent email spam by detecting email spoofing, a common vulnerability, by verifying sender IP addresses. SPF allows administrators to specify which hosts are allowed to send mail from a given domain by creating a specific SPF record (or TXT record) in the Domain Name System (DNS). Mail exchangers use the DNS to check that mail from a given domain is being sent by a host sanctioned by that domain's administrators.&lt;/i&gt;&lt;/li&gt;&lt;li&gt;Some SPF-testing tools are available at&amp;nbsp;&lt;a href="http://www.openspf.org/Tools"&gt;this openspf.org page&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Use anti-virus/anti-spam filtering tools&amp;nbsp;as a precaution for guarding against sending out malicious content&lt;/li&gt;&lt;ul&gt;&lt;li&gt;a&amp;nbsp;&lt;a href="http://www.postfix.org/addon.html#content"&gt;list&lt;/a&gt;&amp;nbsp;of such tools for postfix&lt;/li&gt;&lt;li&gt;for sendmail there are 'milters' such as&amp;nbsp;&lt;a href="http://www.mimedefang.org/faq"&gt;MIMEDefang&lt;/a&gt;&amp;nbsp;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Monitor the mail queues on your mail servers&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Watch for out-of-ordinary spikes or drops which can mean that somebody might try to exploit your mail subsystem&lt;/li&gt;&lt;li&gt;Use tools such as nagios, munin and ganglia to monitor and alert on your mail queues&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Monitor the rate of your outgoing email&lt;/li&gt;&lt;ul&gt;&lt;li&gt;Again spikes or drops should alert you to suspicious activity or even errors in your application that can have an adverse effect on your mail subsystem&lt;/li&gt;&lt;li&gt;Check out my blog post on&amp;nbsp;&lt;a href="http://agiletesting.blogspot.com/2010/07/tracking-and-visualizing-mail-logs-with.htm"&gt;visualizing mail logs&lt;/a&gt;&amp;nbsp;although these days we are using Graphite to store and visualize these data points&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;If you have the budget,&amp;nbsp;use deliverability and reputation monitoring services such as&amp;nbsp;&lt;a href="http://returnpath.net/"&gt;ReturnPath&lt;/a&gt;&lt;/li&gt;&lt;ul&gt;&lt;li&gt;These services typically monitor the IP addresses of your mail servers and register them with various major ISPs&lt;/li&gt;&lt;li&gt;They can alert you when users at major ISPs are complaining about emails originating from you (most likely marking them as spam)&lt;/li&gt;&lt;li&gt;They can also whitelist your mail server IPs with some ISPs&lt;/li&gt;&lt;li&gt;They can provide lists of mailboxes they maintain at major ISPs and allow you to send email to those mailboxes so you can see the percentage of your messages that reach the inbox, or are missing, or go to a spam folder&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Again if you have the budget (it's not that expensive), use a service such as&amp;nbsp;&lt;a href="http://litmus.com/"&gt;Litmus&lt;/a&gt;&amp;nbsp;which shows you how your email messages look in various mail clients on a variety of operating systems and mobile devices&lt;/li&gt;&lt;li&gt;If you don't have the budget, at least check that your mail server IPs are not blacklisted by various RBL sites&lt;/li&gt;&lt;ul&gt;&lt;li&gt;You can use a multi-RBL check tool such as&amp;nbsp;&lt;a href="http://www.anti-abuse.org/multi-rbl-check/"&gt;this one&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Make sure the headers of your messages match the type of email you're sending&lt;/li&gt;&lt;ul&gt;&lt;li&gt;For example, if your messages are MIME multipart, make sure you mark them as such in the Content-type headers. You should have a Content-type header showing 'multipart/related' followed by other headers showing various types of content such as 'multipart/alternative', 'text/plain' etc.&lt;/li&gt;&lt;li&gt;This is usually done via an API; if you're using Python, this can be done with something like&lt;/li&gt;&lt;pre class="code"&gt;message = MIMEMultipart("related")&lt;br /&gt;message_alternative = MIMEMultipart("alternative")&lt;br /&gt;message.attach(message_alternative)&lt;br /&gt;message_alternative.attach(MIMEText(plaintext_body.encode('utf-8'), "plain", 'utf-8'))&lt;br /&gt;message_alternative.attach(MIMEText(html_body.encode('utf-8'), 'html', 'utf-8'))&lt;br /&gt;&lt;/pre&gt;&lt;/ul&gt;&lt;ul&gt;&lt;br /&gt;&lt;/ul&gt;Perhaps the most important recommendation I have: unless you really know what you're doing, look into outsourcing your email sending needs to a 3rd party service which preferably offers an API:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Amazon offers the&amp;nbsp;&lt;a href="http://aws.amazon.com/ses/"&gt;Simple Email Service&lt;/a&gt;&amp;nbsp;(SES) as part of their AWS offerings&lt;/li&gt;&lt;li&gt;Mailgun has a simple&amp;nbsp;&lt;a href="http://documentation.mailgun.net/Documentation/DetailedDocsAndAPIReference#Getting_Started"&gt;API&lt;/a&gt;&lt;/li&gt;&lt;li&gt;Dyn Inc has recently entered the&amp;nbsp;&lt;a href="http://dyn.com/email/?via=topnav"&gt;email sending&lt;/a&gt;&amp;nbsp;arena too&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-5015358467531699460?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/5015358467531699460/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=5015358467531699460' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5015358467531699460'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5015358467531699460'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/10/good-email-sending-practices.html' title='Good email sending practices'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1985706688830051522</id><published>2011-09-16T13:55:00.000-07:00</published><updated>2011-09-16T13:55:42.180-07:00</updated><title type='text'>Slow database? Check RAID battery!</title><content type='html'>&lt;b&gt;Executive Summary:&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If your Dell database servers get slow suddenly, and I/O seems sluggish, do yourself a favor and check if the RAID battery is currently going through its 'relearning' cycle. If this is so, then the Write-Back policy is disabled and Write-Through is enabled -- as a result writes become very slow compared to the standard operation.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Details:&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This turns out to be a fairly well known problem with RAID controllers in Dell servers, specifically LSI controllers. The default mode of operation for the RAID battery is to periodically go through a so-called 'relearn cycle', where it discharges, then charges and recalibrates itself by finding the current charge. In this timeframe, as I mentioned, Write-Back is disabled and Write-Through is enabled.&lt;br /&gt;&lt;br /&gt;For our MySQL servers, we have innodb_flush_log_at_trx_commit set to 1, which means that every commit if flushed to disk. In consequence, the Write-Through mode will severely impact the performance of the writes to the database. A symptom is that CPU I/O wait is high, and the database gets sluggish. Pain all around.&lt;br /&gt;&lt;br /&gt;We started to experience this database&amp;nbsp;slowness on 3 database server at almost the same time. Two of them were configured as slaves, and one as master. The symptoms included high CPU I/O wait, slow queries on the master, and replication lag on the slaves. Nothing pointed to something specific to MySQL. We opened an emergency ticket with &lt;a href="http://www.percona.com/"&gt;Percona&lt;/a&gt; and were fortunate to be assigned to &lt;a href="http://www.percona.com/about-us/our-team/aurimas-mikalauskas/"&gt;Aurimas Mikalauskas&lt;/a&gt;, a Percona principal consultant and a MySQL/RAID hardware guru. It took him less than a minute to correctly diagnose the issue based on these symptoms. Now that we knew what the issue was, some Google searches turned out other articles and blog posts talking about it. It turns out one of the &lt;a href="http://yo61.com/dell-drac-bbu-auto-learn-tests-kill-disk-performance.html"&gt;most frequently cited posts&lt;/a&gt; belongs to &lt;a href="http://twitter.com/#!/robinbowes"&gt;Robin Bowes&lt;/a&gt;, my ex-coworker from RIS Technology/Reliam! It also turns out Percona engineers blogged about this issue extensively (see &lt;a href="http://www.mysqlperformanceblog.com/2011/02/10/battery-learning-still-problem-many-years-after/"&gt;this post&lt;/a&gt; which references other posts).&lt;br /&gt;&lt;br /&gt;In any case, for future reference, here is what we did on all the servers that have the LSI MegaRaid controller (these servers are Dell C2100s in our case):&lt;br /&gt;&lt;br /&gt;&lt;b&gt;1) Install MegaCli utilities&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I had a hard time finding these utilities, since the LSI support site doesn't seem to have them anymore. I found this &lt;a href="http://ftzdomino.blogspot.com/2009/03/some-useful-megacli-commands.html"&gt;blog post&lt;/a&gt; talking about a zip file containing the tools, then I googled the zip filename and I found an updated version on this &lt;a href="http://download.gocept.com/gentoo/mirror/distfiles/"&gt;Gentoo-related site&lt;/a&gt;. Then I followed the steps in the blog post above to extract the statically-linked binaries:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;#&amp;nbsp;apt-get install rpm2cpio&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# mkdir megacli; cd megacli&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;#&amp;nbsp;wget http://download.gocept.com/gentoo/mirror/distfiles/4.00.11_Linux_MegaCLI.zip&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;#&amp;nbsp;unzip 4.00.11_Linux_MegaCLI.zip&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# unzip MegaCliLin.zip&amp;nbsp;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# rpm2cpio MegaCli-4.00.11-1.i386.rpm| cpio -idmv&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;At this point I had 2 statically-linked binaries called MegaCli and MegaCli64 in&amp;nbsp;megacli/opt/MegaRAID/MegaCli.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;2) Inspect event log for RAID controller&lt;/b&gt; to figure out what has been going on in that subsystem (this command was actually run by Aurimas during the troubleshooting he did):&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL&lt;br /&gt;# cat events.log&lt;br /&gt;...&lt;br /&gt;Event Description: Time established as 09/09/11 15:27:18; (48 seconds since power on)&lt;br /&gt;--&lt;br /&gt;Time: Fri Sep  9 16:27:36 2011&lt;br /&gt;Event Description: Battery temperature is normal&lt;br /&gt;--&lt;br /&gt;Time: Fri Sep  9 16:56:51 2011&lt;br /&gt;Event Description: Battery started charging&lt;br /&gt;--&lt;br /&gt;Time: Fri Sep  9 17:08:46 2011&lt;br /&gt;Event Description: Battery charge complete&lt;br /&gt;--&lt;br /&gt;Time: Sat Sep 10 19:54:16 2011&lt;br /&gt;Event Description: Battery relearn will start in 4 days&lt;br /&gt;--&lt;br /&gt;Time: Mon Sep 12 19:53:46 2011&lt;br /&gt;Event Description: Battery relearn will start in 2 day&lt;br /&gt;--&lt;br /&gt;Time: Tue Sep 13 19:54:36 2011&lt;br /&gt;Event Description: Battery relearn will start in 1 day&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 14:54:16 2011&lt;br /&gt;Event Description: Battery relearn will start in 5 hours&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 19:55:26 2011&lt;br /&gt;Event Description: Battery relearn pending: Battery is under charge&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 19:55:26 2011&lt;br /&gt;Event Description: Battery relearn started&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 19:55:29 2011&lt;br /&gt;Event Description: BBU disabled; changing WB virtual disks to WT, Forced WB VDs are not affected&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 19:55:29 2011&lt;br /&gt;Event Description: Policy change on VD 00/0 to [ID=00,dcp=01,ccp=00,ap=0,dc=0] from [ID=00,dcp=01,ccp=01,ap=0,dc=0]&lt;br /&gt;Previous LD Properties&lt;br /&gt;Current Cache Policy: 1&lt;br /&gt;Default Cache Policy: 1&lt;br /&gt;New LD Properties&lt;br /&gt;Current Cache Policy: 0&lt;br /&gt;Default Cache Policy: 1&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 19:56:31 2011&lt;br /&gt;Event Description: Battery is discharging&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 19:56:31 2011&lt;br /&gt;Event Description: Battery relearn in progress&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 22:43:21 2011&lt;br /&gt;Event Description: Battery relearn completed&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 22:44:26 2011&lt;br /&gt;Event Description: Battery started charging&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 23:53:46 2011&lt;br /&gt;Event Description: BBU enabled; changing WT virtual disks to WB&lt;br /&gt;--&lt;br /&gt;Time: Wed Sep 14 23:53:46 2011&lt;br /&gt;Event Description: Policy change on VD 00/0 to [ID=00,dcp=01,ccp=01,ap=0,dc=0] from [ID=00,dcp=01,ccp=00,ap=0,dc=0]&lt;br /&gt;Previous LD Properties&lt;br /&gt;Current Cache Policy: 0&lt;br /&gt;Default Cache Policy: 1&lt;br /&gt;New LD Properties&lt;br /&gt;Current Cache Policy: 1&lt;br /&gt;Default Cache Policy: 1&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;So as you can see, the battery relearn started at 19:55:26, then 3 seconds later the Write-Back policy was changed to Write-Through, and it stayed like this until 23:53:46, when it was changed back to Write-Back. This shows that the I/O was impacted for 4 hours. Luckily for us it was outside of our high &amp;nbsp;traffic period for the day, but it was still painful.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;3) Disable autoLearnMode for the RAID battery&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This is so we don't have this type of surprise in the future. The autoLearnMode variable is ON by default. You can see its current setting if you run this command:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -AdpBbuCmd -GetBbuStatus -a0&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;We followed Robin's blog post (thanks, Robin!) and did:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# echo "autoLearnMode=1" &amp;gt; tmp.txt&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;#&amp;nbsp;./MegaCli64 -AdpBbuCmd -SetBbuProperties -f tmp.txt -a0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;4) Force battery relearn cycle&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;It is still recommended to run the battery relearn cycle manually periodically, so we did it on all servers that are not yet in production. For the rest of the servers we'll do it at night, during a time frame when traffic is lowest. In the future, we'll take maintenance windows every N months (where N is probably 6 or 12) and force the relearn cycle.&lt;br /&gt;&lt;br /&gt;Here's the command to force the relearn:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small;"&gt;# ./MegaCli64 -AdpBbuCmd -BbuLearn -a0&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;For reference, LSI has good documentation for the MegaCli utilities on one of their &lt;a href="http://kb.lsi.com/KnowledgebaseArticle16516.aspx"&gt;KB sites&lt;/a&gt;. Another good reference is this &lt;a href="http://ronaldbradford.com/mysql-expert/cheatsheets/dell-perc-raid.htm"&gt;Dell PERC cheatsheet&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;I hope this will be a good troubleshooting guide for people faced with mysterious I/O slowness. Thanks again to Aurimas from Percona for his help. These guys are awesome!&lt;br /&gt;&lt;br /&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1985706688830051522?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1985706688830051522/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1985706688830051522' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1985706688830051522'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1985706688830051522'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/09/slow-database-check-raid-battery.html' title='Slow database? Check RAID battery!'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-3565161603167884332</id><published>2011-09-08T15:56:00.000-07:00</published><updated>2011-09-08T15:56:05.497-07:00</updated><title type='text'>Setting up RAID1 with mdadm on a running Ubuntu server</title><content type='html'>For the last couple of weeks we've been working on setting up a bunch of Dell C2100 servers with Ubuntu 10.04. These servers come with 2 x 500 GB internal disks that can be set up in RAID1 with the on-board RAID controller. However, when we did that during the Ubuntu install, we never managed to get back into the OS after the initial reboot, or in some cases GRUB refused to write to the array (I think it was when we tried 11.04 in a desperation move). To make matters worse, even software RAID via mdadm stopped working, with the servers going to the initramfs BusyBox prompt after the initial reboot. My guess is that it all stems from GRUB not writing properly to /dev/md0 (the root partition RAID-ed during the install) and instead writing to /dev/sda1 and /dev/sdb1. So we decided to install the root partition and the swap space on /dev/sda and leave /dev/sdb alone.&lt;br /&gt;&lt;br /&gt;I started to look for articles on how to set up RAID1 post-install, when you have the OS installed on /dev/sda, and you want to add /dev/sdb to a RAID1 setup with mdadm. After some fruitless searches, I finally hit the jackpot with &lt;a href="http://www.howtoforge.com/how-to-set-up-software-raid1-on-a-running-system-incl-grub2-configuration-ubuntu-10.04"&gt;this article&lt;/a&gt; on HowtoForge written by &lt;a href="http://twitter.com/#!/falko"&gt;Falko Timme&lt;/a&gt;. I really don't have much to add, just follow the instructions closely and it will work ;-) Kudos to Falko for a great resource.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-3565161603167884332?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/3565161603167884332/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=3565161603167884332' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3565161603167884332'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3565161603167884332'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/09/setting-up-raid1-with-mdadm-on-running.html' title='Setting up RAID1 with mdadm on a running Ubuntu server'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-3553433087246321236</id><published>2011-08-19T10:24:00.000-07:00</published><updated>2011-08-19T10:24:55.735-07:00</updated><title type='text'>New location for the Python Testing Tools Taxonomy</title><content type='html'>I was taken by surprise by Baiju Muthukadan's &lt;a href="http://baijum.blogspot.com/2011/08/new-location-for-python-testing-tools.html"&gt;announcement&lt;/a&gt; which I read on Planet Python -- the &lt;a href="http://pycheesecake.org/wiki/PythonTestingToolsTaxonomy"&gt;Python Testing Tools Taxonomy&lt;/a&gt; page which I started years ago has a new &lt;a href="http://wiki.python.org/moin/PythonTestingToolsTaxonomy"&gt;incarnation&lt;/a&gt; on the Python wiki. I think it's a good thing (although I still wish I had been notified as a courtesy). In any case, feel free to add more tools to the page!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-3553433087246321236?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/3553433087246321236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=3553433087246321236' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3553433087246321236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3553433087246321236'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/08/new-location-for-python-testing-tools.html' title='New location for the Python Testing Tools Taxonomy'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1831938801624868989</id><published>2011-08-17T12:42:00.000-07:00</published><updated>2011-08-17T12:42:17.613-07:00</updated><title type='text'>Anybody using lxc or OpenVZ in production?</title><content type='html'>I asked a similar question yesterday on Twitter ("Anybody using Linux Containers (lxc) in production, preferably with Ubuntu?") and it seemed to have struck a chord, because many people asked me to post the answers to this question, and many other people answered the question.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Both Linux Containers (or &lt;a href="http://lxc.sourceforge.net/"&gt;lxc&lt;/a&gt; as the project is known) and &lt;a href="http://wiki.openvz.org/Main_Page"&gt;OpenVZ&lt;/a&gt; are lightweight virtualization systems that operate at the file system level, and as such can be attractive to people who are looking to split a big physical server into containers, while achieving resource isolation per container. I personally want to look into both primarily as a means to run several MySQL instances per physical server while ensuring better resource isolation , especially in regards to RAM.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In any case, I thought it would be interesting to post the replies I got on Twitter to my question.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;From &lt;a href="http://twitter.com/#!/AlTobey"&gt;AlTobey&lt;/a&gt;:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;"I'm using straight cgroups without namespaces in production. It's pretty nice for fine-grained scheduler control."&lt;br /&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="color: #444444; font-family: Arial, 'Helvetica Neue', sans-serif; font-size: 15px; line-height: 19px;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #444444; font-family: Arial, 'Helvetica Neue', sans-serif; font-size: 15px; line-height: 19px;"&gt;&lt;span class="Apple-style-span" style="color: black; font-family: 'Times New Roman'; font-size: small; line-height: normal;"&gt;From &lt;a href="http://twitter.com/#!/ohlol"&gt;ohlol&lt;/a&gt;:&lt;/span&gt;&lt;span class="Apple-style-span" style="color: black; font-family: 'Times New Roman'; font-size: small; line-height: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="color: black; font-family: 'Times New Roman'; font-size: small; line-height: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="color: black; font-family: 'Times New Roman'; font-size: small; line-height: normal;"&gt;"I just began using lxc. Have three hosts in it so far as a test run. Not doing NAT, just plain bridging right now."&lt;/span&gt;&lt;span class="Apple-style-span" style="color: black; font-family: 'Times New Roman'; font-size: small; line-height: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="color: #444444; font-family: Arial, 'Helvetica Neue', sans-serif; font-size: 15px; line-height: 19px;"&gt;&lt;span class="Apple-style-span" style="color: black; font-family: 'Times New Roman'; font-size: small; line-height: normal;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;From &lt;a href="http://twitter.com/#!/vvuksan"&gt;vvuksan&lt;/a&gt;:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;"I have been using it for about a week on my laptop to replace Vagrant/Virtualbox. Works great so far."&lt;br /&gt;&lt;br /&gt;"I just posted a short write up of how I use LXC on my laptop &lt;a href="http://t.co/CQXTPMv"&gt;http://t.co/CQXTPMv&lt;/a&gt;"&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/#!/ohlol"&gt;ohlol&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"Have you tried lxc without libvirt? I found it to be a bit easier to deal with."&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/vvuksan"&gt;vvuksan&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"Yes that is a red herring. You do not need libvirt. I had it installed already so went with it by default."&lt;br /&gt;&lt;br /&gt;&lt;div&gt;"It just helps me not have to set up dnsmasq, iptables etc. :-) But you can certainly do away with it."&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/#!/ohlol"&gt;ohlol&lt;/a&gt;:&lt;br /&gt;&lt;/div&gt;&lt;div&gt;"Have you tried doing an apt-get upgrade in lxc? What a PITA :)"&lt;/div&gt;&lt;div&gt;&lt;br /&gt;"btw, if you ever get to that point: &lt;a href="http://t.co/2WvYaeX"&gt;http://t.co/2WvYaeX&lt;/a&gt; helped get me to a working solution"&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/#!/ichilton"&gt;ichilton&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"ive been using OpenVZ in production with Debian Stable (on the host and guests) for over a year with no problems...."&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/#!/griggheo"&gt;griggheo&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"@ichilton you had to recompile the kernel for OpenVZ support in Debian right?"&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/#!/ichilton"&gt;ichilton&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"I didn't, there was an OpenVZ kernel package but it was Lenny at the time and not upgraded yet - will have to check Squeeze."&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/#!/ichilton"&gt;ichilton&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"@vvuksan interested why you did that originally and what the advantages are in hindsight?"&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/vvuksan"&gt;vvuksan&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"Speed. The dev env needs 5-6 boxes running at the same time and with Vbox my laptop becomes really slow. With LXC not so much."&lt;br /&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/#!/sstatik"&gt;sstatik&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"LXC should be considerably smoother in 11.10 for both 11.10/10.04 guests. I want to see laptop-based microclouds become common."&lt;br /&gt;&lt;/div&gt;&lt;div&gt;From &lt;a href="http://twitter.com/#!/mitchellh"&gt;mitchellh&lt;/a&gt;:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;"@sstatik @griggheo Laptop based microclouds are the future. We're just missing quality software to help manage it."&lt;/div&gt;&lt;div&gt;&lt;br /&gt;From &lt;a href="http://twitter.com/#!/heckj"&gt;heckj&lt;/a&gt;:&lt;br /&gt;&lt;br /&gt;"@sstatik @griggheo documentation and details getting better? its arcane to use in 11.04, and that is 1000x better than 10.x..."&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;So there you have it, a small snapshot of why and how people are using lxc/OpenVZ, especially on Ubuntu. I'll post my own experiences as I start playing with lxc and potentially OpenVZ.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1831938801624868989?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1831938801624868989/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1831938801624868989' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1831938801624868989'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1831938801624868989'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/08/anybody-using-lxc-or-openvz-in.html' title='Anybody using lxc or OpenVZ in production?'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-4585065632236974757</id><published>2011-07-27T14:58:00.000-07:00</published><updated>2011-07-27T14:59:14.640-07:00</updated><title type='text'>Processing mail logs with Elastic MapReduce and Pig</title><content type='html'>These are some notes I took while trying out Elastic MapReduce (EMR), and more specifically its Pig functionality, by processing sendmail mail logs. A big help was Eric Lubow's &lt;a href="http://eric.lubow.org/2011/hadoop/pig-queries-parsing-json-on-amazons-elastic-map-reduce-using-s3-data/"&gt;blog post&lt;/a&gt;&amp;nbsp;on EMR and Pig. Before I go into &amp;nbsp;details, here's my general processing flow:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;N mail servers (running sendmail) send their mail logs to a central server running syslog-ng.&lt;/li&gt;&lt;li&gt;A process running on the central logging server tails the aggregated mail log (at 5 minute intervals), parses the lines it finds, extracts relevant information from each line, and saves the output in JSON format to a local file (actually there are 2 types of files generated, one for sender information and one for recipient information, corresponding to the 'from' and 'to' lines in the mail log -- see below)&lt;/li&gt;&lt;li&gt;Another process compresses the generated files in bzip2 format and uploads them to S3.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;I have 2 sets of files, one set with names similar to "from-2011-07-12-20-58" and containing JSON records of the following form, one per line:&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;{"nrcpts": "1", "src": "info@example.com", "sendmailid": "p6D0r0u1006229", "relay": "app03.example.com", "classnumber": "0", "msgid": "WARQZCXAEMSSVWPPOOYZXRLQIKMFUY.155763@example.com", "&lt;/div&gt;&lt;div&gt;pid": "6229", "month": "Jul", "time": "20:53:00", "day": "12", "mailserver": "mail5", "size": "57395"}&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The second set contains files with names similar to "to-2011-07-12-20-58" and containing JSON records of the following form, one per line:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;{"sendmailid": "p6D0qwvm006395", "relay": "gmail-smtp-in.l.google.com.", "dest": "somebody@gmail.com", "pid": "6406", "stat": "Sent (OK 1310518380 pd12si6025606vcb.162)", "month": "Jul", "delay": "00:00:02", "time": "20:53:00", "xdelay": "00:00:02", "day": "12", "mailserver": "mail2"}&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For the initial EMR/Pig setup, I followed &lt;a href="http://aws.amazon.com/articles/2729"&gt;"Parsing Logs with Apache Pig and Elastic MapReduce"&lt;/a&gt;. It's fairly simple to end up with an EC2 instance running Hadoop and Pig that you can play with.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I then ssh-ed into the EMR master instance (note that it was still shown in 'Waiting' state in the EMR console, but once it got assigned an IP and internal name I was able to ssh into it).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In order for Pig to be able to process input in JSON format, you need to use Kevin Weil's &lt;a href="https://github.com/kevinweil/elephant-bird"&gt;elephant-bird&lt;/a&gt; library. I followed Eric Lubow's post to get that set up:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span id="internal-source-marker_0.26019668276421726" style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ mkdir git &amp;amp;&amp;amp; mkdir pig-jars&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ cd git &amp;amp;&amp;amp; wget --no-check-certificate https://github.com/kevinweil/elephant-bird/tarball/eb1.2.1_with_jsonloader&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ tar xvfz eb1.2.1_with_jsonloader &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ cd kevinweil-elephant-bird-ecf8356/&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ cp lib/google-collect-1.0.jar ~/pig-jars/&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ cp lib/json-simple-1.1.jar ~/pig-jars/&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ ant nonothing&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ cd build/classes/&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ jar -cf ../elephant-bird-1.2.1-SNAPSHOT.jar com&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ cp ../elephant-bird-1.2.1-SNAPSHOT.jar ~/pig-jars/&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;I then copied 3 elephant-bird jar files to S3 so I can register them every time I run Pig. I did that via the grunt command prompt:&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;$ pig -x local&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;div style="background-color: transparent; white-space: normal;"&gt;&lt;span id="internal-source-marker_0.26019668276421726" style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; cp file:///home/hadoop/pig-jars/google-collect-1.0.jar s3://MY_S3_BUCKET/jars/pig/ &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; cp file:///home/hadoop/pig-jars/json-simple-1.1.jar s3://&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;MY_S3_BUCKET/jars/pig/ &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; cp file:///home/hadoop/pig-jars/elephant-bird-1.2.1-SNAPSHOT.jar s3://&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;MY_S3_BUCKET/jars/pig/ &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;At this point, I was ready to process some of the files I uploaded to S3.&amp;nbsp;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;I first tried processing a single file, using Pig's local mode (which doesn't involve HDFS). It turns out that Pig doesn't load compressed files correctly via elephant-bird when you run in local mode, so I tested this on an uncompressed file previously uploaded to S3:&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small; white-space: pre-wrap;"&gt;$ pig -x local&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;div style="background-color: transparent; white-space: normal;"&gt;&lt;span id="internal-source-marker_0.26019668276421726" style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; REGISTER s3://&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;MY_S3_BUCKET/jars/pig/google-collect-1.0.jar; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; REGISTER s3://&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;MY_S3_BUCKET/jars/pig/json-simple-1.1.jar; &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; REGISTER &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: #000099; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: underline; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;a href="http://draft.blogger.com/goog_1556453940"&gt;s3://&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="-webkit-text-decorations-in-effect: none; color: black;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;a href="http://draft.blogger.com/goog_1556453940"&gt;MY_S3_BUCKET&lt;/a&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;/jars/pig/elephant-bird-1.2.1-SNAPSHOT.jar;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; json = LOAD 's3://&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;MY_S3_BUCKET/mail_logs/2011-07-12/to-2011-07-12-16-49' USING com.twitter.elephantbird.pig.load.JsonLoader();&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;Note that I used the JSON loader from the elephant-bird JAR file.&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: medium;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;I wanted to know the top 3 mail servers from the file I loaded&amp;nbsp;&lt;/span&gt;&lt;/span&gt;(this is again heavily inspired by Eric Lubow's example in his blog post)&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;:&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace; font-size: x-small; white-space: pre-wrap;"&gt;grunt&amp;gt; mailservers = FOREACH json GENERATE $0#'mailserver' AS mailserver;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: medium;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;div style="background-color: transparent; white-space: normal;"&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; mailserver_count = FOREACH (GROUP mailservers BY $0) GENERATE $0, COUNT($1) AS cnt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; mailserver_sorted_count = LIMIT(ORDER mailserver_count BY cnt DESC) 3;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; DUMP mailserver_sorted_count;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I won't go into detail as far as the actual Pig operations I ran -- I recommend going through some Pig Latin tutorials or buying the O'Reilly 'Programming Pig' book. Suffice to say that I extracted the 'mailserver' JSON field, then I grouped the records by mail server and counted how many there are in each group. Finally, I dumped the 3 top mail servers found.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's a slightly more interesting exercise: finding out the top 10 mail recipients by looking at all the to-* files uploaded to S3 (still uncompressed in this case):&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; to = LOAD 's3://MY_S3_BUCKET/mail_logs/2011-07-13/to*' USING com.twitter.elephantbird.pig.load.JsonLoader(); &lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; to_emails = FOREACH to GENERATE $0#'dest' AS dest; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; to_count = FOREACH (GROUP to_emails BY $0) GENERATE $0, COUNT($1) AS cnt; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; to_sorted_count = LIMIT(ORDER to_count BY cnt DESC) 10; &amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; DUMP to_sorted_count;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;I tried the same processing steps on bzip2-compressed files using Pig's Hadoop mode (which you invoke by just running 'pig' and not 'pig -x local'). The files were loaded correctly this time, but the MapReduce phase failed with messages similar to this in&amp;nbsp;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt; /mnt/var/log/apps/pig.log:&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: 15px; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;br /&gt;&lt;div style="background-color: transparent; white-space: normal;"&gt;&lt;span id="internal-source-marker_0.26019668276421726" style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;Pig Stack Trace&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;---------------&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;ERROR 6015: During execution, encountered a Hadoop error.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias to_sorted_count&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at org.apache.pig.PigServer.openIterator(PigServer.java:482)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:546)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:165)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:141)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at org.apache.pig.Main.main(Main.java:374)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at java.lang.reflect.Method.invoke(Method.java:597)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at org.apache.hadoop.util.RunJar.main(RunJar.java:156)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 6015: During execution, encountered a Hadoop error.&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:862)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:474)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:109)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:255)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:244)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:94)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:363)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;at .apache.hadoop.mapred.MapTask.run(MapTask.java:312)&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;Caused by: java.io.IOException: Type mismatch in key from map: expected org.apache.pig.impl.io.NullableBytesWritable, recieved org.apache.pig.impl.io.NullableText&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-tab-span" style="white-space: pre;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;	&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;... 9 more&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent; font-family: 'Times New Roman'; font-size: medium; white-space: normal;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: medium;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;A quick Google search revealed JIRA Pig ticket &lt;a href="https://issues.apache.org/jira/browse/PIG-919"&gt;#919&lt;/a&gt; which offered a workaround. &lt;/span&gt;&lt;span class="Apple-style-span" style="font-family: 'Times New Roman';"&gt;&lt;span class="Apple-style-span" style="font-size: small;"&gt;Basically this happens when a value coming out of a map is used in a group/cogroup/join. By default the type of that value is bytearray, and you need to cast it to chararray to make things work (I confess I didn't dig too much into the nitty-gritty of this issue yet, I was just happy I made it work).&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;br /&gt;So what I had to do was to modify a single line and cast the value used in the GROUP BY clause to chararray:&lt;br /&gt;&lt;br /&gt;&lt;div style="background-color: transparent;"&gt;&lt;span id="internal-source-marker_0.26019668276421726" style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;grunt&amp;gt; to_count = FOREACH (&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: bold; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;GROUP to_emails BY (chararray)$0&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span style="background-color: transparent; color: black; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;) GENERATE $0, COUNT($1) AS cnt; &amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span style="background-color: transparent; color: black; font-family: Arial; font-size: 11pt; font-style: normal; font-variant: normal; font-weight: normal; text-decoration: none; vertical-align: baseline; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: medium;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;At this point, I was able to watch Elastic MapReduce in action, slower than in local mode becase I only had 1 m1.small instance. I'll try it next with several instances and hopefully see a near-linear improvement.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: medium;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div style="background-color: transparent;"&gt;&lt;span class="Apple-style-span" style="font-family: Arial; font-size: medium;"&gt;&lt;span class="Apple-style-span" style="font-size: 15px; white-space: pre-wrap;"&gt;That's it for now. This was just a toy example, but it got me started with EMR and Pig. Hopefully I'll follow up with more interesting log processing and analysis.&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-4585065632236974757?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/4585065632236974757/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=4585065632236974757' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/4585065632236974757'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/4585065632236974757'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/07/processing-mail-logs-with-elastic.html' title='Processing mail logs with Elastic MapReduce and Pig'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2315215206870074617</id><published>2011-07-22T10:10:00.000-07:00</published><updated>2011-07-22T10:11:26.440-07:00</updated><title type='text'>Results of a survey of the SoCal Piggies group</title><content type='html'>My colleague Warren Runk had the idea of putting together a survey to be sent to the mailing list of the SoCal Python Interest Group (aka SoCal Piggies), with the purpose of finding out which topics or activities would be most interesting to the members of the group in terms of future meetings. We had 10 topics in the survey, and people responded by choosing their top 5. We also had free-form response fields for 2 questions: "What do you like most about the meetings?" and "What meeting improvements are most important to you?".&lt;br /&gt;&lt;br /&gt;We had 26 responses. Here are the votes results for the 10 topics we proposed:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;#1 (18 votes):&amp;nbsp;"Good practice, pitfall avoidance, and module introductions for beginners"&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;#2 (17 votes): "5 minute lightning talks"&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;b&gt;&lt;br /&gt;#3 - #4 (15 votes): "Excellent code examples from established Python projects" &lt;/b&gt;&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;and&lt;/span&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;b&gt;&amp;nbsp;"New and upcoming Python open source projects"&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;#5 (14 votes): "30 minute presentations"&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;#6 (13 votes): "Ice breakers/new member introductions"&lt;/span&gt;&lt;/b&gt;&lt;/div&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;#7 (12 votes): "Algorithm discussions and dissections"&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;#8 (11 votes): "Good testing practices and pointers to new methods/tools"&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;#9 (10 votes): "Moderated relevant/cutting edge general tech discussions"&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;#10 (9 votes): "Short small group programming exercises"&lt;/span&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;It's pretty clear that people are interested most of all in good Python programming practices and practical examples of 'excellent' code from established projects. Presentations are popular too, with lightning talks edging the longer 30-minute talks. A pretty good percentage of the people attending our meetings are beginners, so we're going to try to focus on making our meetings more beginner-friendly.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As far as what people like most about the meetings, here are a few things:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;"I love hearing about how Python is being used in multiple locations throughout large corporations. &amp;nbsp;It helps me to promote Python at every opportunity when I can say that Python is being used at Acme Corp for XYZ!"&lt;/li&gt;&lt;li&gt;"High level introductions to Python modules. Often this is not the main thrust of a &amp;nbsp;talk, but the speaker chose some module for a given task and that helps me expand my horizon."&lt;/li&gt;&lt;li&gt;"Becoming aware of how various companies use python, which libraries and tools are used most often, the opportunity to connect with members during breaks."&lt;/li&gt;&lt;li&gt;"I like being exposed to things I don't normally see at work of if I've seen them I get to see them from a different angle. &amp;nbsp;"&lt;/li&gt;&lt;li&gt;"I don't have other geeks at my office so I like having the chance to hang out and get to know other Python programmers."&lt;/li&gt;&lt;li&gt;...and many people expressed their satisfaction in seeing Raymod Hettinger's presentation at Disney Animation Studios (thanks to Paul Hildebrandt for putting that together!)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Here's what people said when asked about possible improvements:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;"More best practices and module intros."&lt;/li&gt;&lt;li&gt;"Keep the meetings loose, don't have too many controls. "&lt;/li&gt;&lt;li&gt;"In addition to the aptly proposed "ice-breakers / introductions" how can we current members more-actively welcome beginners?"&lt;/li&gt;&lt;li&gt;"Time (and some format) to discuss the issues brought up in the talks. Sometimes I think it'd be useful for the group to get more directly involved in vetting/providing critique for some of the decisions a speaker made. Controversial points made in talks are great, but sometimes I think everyone might benefit from a few other perspectives."&lt;/li&gt;&lt;li&gt;"Friendlier onboarding of new members would be great."&lt;/li&gt;&lt;li&gt;"Keeping the total noobs in mind"&lt;/li&gt;&lt;li&gt;"I would like introductions. I have met a couple people at each of the meetings that I have attended, but I would also like to know who else is there."&lt;/li&gt;&lt;li&gt;"I would like the opportunity to meet resourceful programmers and learn techniques and abilities that I can't pick up from youtube or online tutorials!"&lt;/li&gt;&lt;li&gt;"I think we should try to come up with and stick with a consistent format. I like the discussion-style presentation so long as it does not detract from the topic at hand. I think we need to make sure that people stick with shorter presentations, so that there is plenty of time for Q&amp;amp;A without the risk of running on too long. &amp;nbsp;30 minutes should really be 30 minutes! "&lt;/li&gt;&lt;li&gt;"It would be good to identify the difficulty/skill level of a presentation ahead of time so that beginners are not scared off or at least know what they're getting into. Perhaps we could try to always mix it up by warming up with a beginner/intermediate preso and follow up with an intermediate/advanced."&lt;/li&gt;&lt;/ul&gt;I think you can see a theme here -- friendliness and attention towards beginners is a wish that many people have. I believe in the past we tended to ignore this side in our meetings, so we definitely need to do a better job at it.&lt;br /&gt;&lt;br /&gt;We had a meeting last night where we discussed some of these topics. We tried to appoint point persons for given topics. These persons would be responsible for doing research on that topic (for example 'New and upcoming Python open source projects') and give a short presentation to the group at every meeting, while also looking for other group members to delegate this responsibility to in the future. I think this 'lieutenant' system will work well, but time will tell. My personal observation from the 7 years I've been organizing this group is that the hardest part is to get people to volunteer in any capacity, and most of all in presenting to the group. But this infusion of new ideas is very welcome, and I hope it will invigorate the participation in our group.&lt;br /&gt;&lt;br /&gt;I hope the results of this survey and the feedback we got will be useful to other Python user groups out there.&lt;br /&gt;&lt;br /&gt;I want to thank Warren Runk and Danny Greenfeld for their feedback, ideas and participation in making the SoCal Piggies Group better.&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2315215206870074617?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2315215206870074617/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2315215206870074617' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2315215206870074617'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2315215206870074617'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/07/results-of-survey-of-socal-piggies.html' title='Results of a survey of the SoCal Piggies group'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-7311029219198682753</id><published>2011-07-20T11:04:00.000-07:00</published><updated>2011-07-20T11:04:01.132-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='cloud'/><category scheme='http://www.blogger.com/atom/ns#' term='ubuntu'/><category scheme='http://www.blogger.com/atom/ns#' term='openvpn'/><title type='text'>Accessing the data center from the cloud with OpenVPN</title><content type='html'>This post was inspired by a recent exercise I went through at the prompting of my colleague Dan Mesh. The goal was to have Amazon EC2 instances connect securely to servers at a data center using OpenVPN.&lt;br /&gt;&lt;br /&gt;In this scenario, we have a server within the data center running OpenVPN in server mode. The server has a publicly accessible IP (via a firewall NAT) with port 1194 exposed via UDP. Cloud instances which run OpenVPN in client mode are connecting to the server, get a route pushed to them to an internal network within the data center, and are then able to access servers on that internal network over a VPN tunnel.&lt;br /&gt;&lt;br /&gt;Here are some concrete details about the network topology that I'm going to discuss.&lt;br /&gt;&lt;br /&gt;Server A at the data center has an internal IP address of 10.10.10.10 and is part of the internal network 10.10.10.0/24. There is a NAT on the firewall mapping external IP X.Y.Z.W to the internal IP of server A. There is also a rule that allows UDP traffic on port 1194 to X.Y.Z.W.&lt;br /&gt;&lt;br /&gt;I have an EC2 instance from which I want to reach server B on the internal data center network, with IP 10.10.10.20.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Install and configure OpenVPN on server A&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Since server A is running Ubuntu (10.04 to be exact), I used this very good &lt;a href="https://help.ubuntu.com/10.04/serverguide/C/openvpn.html"&gt;guide&lt;/a&gt;, with an important exception: I didn't want to configure the server in bridging mode, I preferred the simpler tunneling mode. In bridging mode, the internal network which server A is part of (10.10.10.0/24 in my case) is directly exposed to OpenVPN clients. In tunneling mode, there is a tunnel created between clients and server A on a separated dedicated network. I preferred the tunneling option because it doesn't require any modifications to the network setup of server A (no bridging interface required), and because it provides better security for my requirements (I can target individual servers on the internal network and configure them to be accessed via VPN). YMMV of course.&lt;br /&gt;&lt;br /&gt;For the initial installation and key creation for OpenVPN, I followed the &lt;a href="https://help.ubuntu.com/10.04/serverguide/C/openvpn.html"&gt;guide&lt;/a&gt;. When it came to configuring the OpenVPN server, I created these entries in /etc/openvpn/server.conf:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;server 172.16.0.0 255.255.255.0&lt;/b&gt;&lt;br /&gt;&lt;b&gt;push "route 10.10.10.0 255.255.255.0"&lt;/b&gt;&lt;br /&gt;&lt;b&gt;tls-auth ta.key 0&amp;nbsp;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The first directive specifies that the OpenVPN tunnel will be established on a new 172.16.0.0/24 network. The server will get the IP 172.16.0.1, while OpenVPN clients that connect to the server will get 172.16.0.6 etc.&lt;br /&gt;&lt;br /&gt;The second directive pushes a static route to the internal data center network 10.10.10.0/24 to all connected OpenVPN clients. This way each client will know how to get to machines on that internal network, without the need to create static routes manually on the client.&lt;br /&gt;&lt;br /&gt;The tls_auth entry provides extra security to help prevent DoS attacks and UDP port flooding.&lt;br /&gt;&lt;br /&gt;Note that I didn't have to include any bridging-related scripts or other information in server.conf.&lt;br /&gt;&lt;br /&gt;At this point, if you start the OpenVPN service on server A via 'service openvpn start', you should see an extra tun0 network interface when you run ifconfig. Something like this:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;tun0 &amp;nbsp; &amp;nbsp; &amp;nbsp;Link encap:UNSPEC &amp;nbsp;HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 &amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;inet addr:172.16.0.1 &amp;nbsp;P-t-P:172.16.0.2 &amp;nbsp;Mask:255.255.255.255&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;UP POINTOPOINT RUNNING NOARP MULTICAST &amp;nbsp;MTU:1500 &amp;nbsp;Metric:1&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;RX packets:2 errors:0 dropped:0 overruns:0 frame:0&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;TX packets:2 errors:0 dropped:0 overruns:0 carrier:0&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;collisions:0 txqueuelen:100&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;RX bytes:168 (168.0 B) &amp;nbsp;TX bytes:168 (168.0 B)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Also, the routing information will now include the 172.16.0.0 network:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;# netstat -rn&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;Kernel IP routing table&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;Destination &amp;nbsp; &amp;nbsp; Gateway &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Genmask &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Flags &amp;nbsp; MSS Window &amp;nbsp;irtt Iface&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;172.16.0.2 &amp;nbsp; &amp;nbsp; &amp;nbsp;0.0.0.0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 255.255.255.255 UH &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 tun0&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;172.16.0.0 &amp;nbsp; &amp;nbsp; &amp;nbsp;172.16.0.2 &amp;nbsp; &amp;nbsp; &amp;nbsp;255.255.255.0 &amp;nbsp; UG &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 tun0&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;...etc&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Install and configure OpenVPN on clients&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here again I followed the Ubuntu OpenVPN guide. The steps are very simple:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) apt-get install openvpn&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;2) scp the following files (which were created on the server during the OpenVPN server install process above) from server A to the client, into the /etc/openvpn directory:&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;ca.crt&lt;/div&gt;&lt;div&gt;ta.key&lt;/div&gt;&lt;div&gt;client_hostname.crt&amp;nbsp;&lt;/div&gt;&lt;div&gt;client_hostname.key&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;3) Customize client.conf:&lt;br /&gt;&lt;br /&gt;# cp /usr/share/doc/openvpn/examples/sample-config-files/client.conf /etc/openvpn&lt;br /&gt;&lt;br /&gt;Edit client.conf and specify:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;remote X.Y.Z.W 1194 &amp;nbsp; &amp;nbsp;&lt;/b&gt; (where X.Y.Z.W is the external IP of server A)&lt;br /&gt;&lt;br /&gt;&lt;b&gt;cert client_hostname.crt&lt;/b&gt;&lt;br /&gt;&lt;b&gt;key&amp;nbsp;&lt;/b&gt;&lt;b&gt;client_hostname.key&lt;/b&gt;&lt;br /&gt;&lt;b&gt;tls-auth ta.key 1&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Now if you start the OpenVPN service on the client via 'service openvpn start', you should see a tun0 interface when you run ifconfig:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;tun0 &amp;nbsp; &amp;nbsp; &amp;nbsp;Link encap:UNSPEC &amp;nbsp;HWaddr 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 &amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;inet addr:172.16.0.6 &amp;nbsp;P-t-P:172.16.0.5 &amp;nbsp;Mask:255.255.255.255&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;UP POINTOPOINT RUNNING NOARP MULTICAST &amp;nbsp;MTU:1500 &amp;nbsp;Metric:1&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;RX packets:2 errors:0 dropped:0 overruns:0 frame:0&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;TX packets:2 errors:0 dropped:0 overruns:0 carrier:0&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;collisions:0 txqueuelen:100&amp;nbsp;&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;RX bytes:168 (168.0 B) &amp;nbsp;TX bytes:168 (168.0 B)&lt;/span&gt;&lt;/span&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You should also see routing information related to both the tunneling network 172.16.0.0/24 and to the internal data center network 10.10.10.0/0 (which was pushed from the server):&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;# netstat -rn&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Kernel IP routing table&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;Destination &amp;nbsp; &amp;nbsp; Gateway &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Genmask &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Flags &amp;nbsp; MSS Window &amp;nbsp;irtt Iface&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;172.16.0.5 &amp;nbsp; &amp;nbsp; &amp;nbsp;0.0.0.0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; 255.255.255.255 UH &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 tun0&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;172.16.0.1 &amp;nbsp; &amp;nbsp; &amp;nbsp;172.16.0.5 &amp;nbsp; &amp;nbsp; &amp;nbsp;255.255.255.255 UGH &amp;nbsp; &amp;nbsp; &amp;nbsp; 0 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 tun0&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;10.0.10.0 &amp;nbsp; &amp;nbsp; &amp;nbsp; 172.16.0.5 &amp;nbsp; &amp;nbsp; &amp;nbsp;255.255.255.0 &amp;nbsp; UG &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 tun0&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;....etc&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point, the client and server A should be able to ping each other on their 172.16 IP addresses. From the client you should be able to ping server A's IP 172.16.0.1, and from server A you should be able to ping the client's IP 172.16.0.6.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Create static route to tunneling network on server B and enable IP forwarding on server A&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Remember that the goal was for the client to access server B on the internal data center network, with IP address 10.10.10.20. For this to happen, I needed to add a static route on server B to the tunneling network 172.16.0.0/24, with server A's IP 10.10.10.10 as the gateway:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;#&amp;nbsp;route add -net 172.16.0.0/24 gw 10.10.10.10&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The final piece of the puzzle is to allow server A to act as a router at this point, by enabling IP forwarding (which is disabled by default). So on server A I did:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# sysctl -w net.ipv4.ip_forward=1&lt;/div&gt;&lt;div&gt;# echo "net.ipv4.ip_forward=1" &amp;gt;&amp;gt; /etc/sysctl.conf&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point, I was able to access server B from the client by using server B's 10.10.10.20 IP address.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We've just started to experiment with this setup, so I'm not yet sure if it's production ready. I wanted to jot down these things though because they weren't necessarily obvious, despite some decent blog posts and OpenVPN documentation. Hopefully they'll help somebody else out there too.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-7311029219198682753?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/7311029219198682753/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=7311029219198682753' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/7311029219198682753'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/7311029219198682753'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/07/accessing-data-center-from-cloud-with.html' title='Accessing the data center from the cloud with OpenVPN'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-4874830494914819166</id><published>2011-06-30T11:39:00.000-07:00</published><updated>2011-06-30T11:39:48.774-07:00</updated><title type='text'>A strategy for handling DNS in EC2 with Route 53</title><content type='html'>In my &lt;a href="http://agiletesting.blogspot.com/2011/06/managing-amazon-route-53-dns-with-boto.html"&gt;previous post&lt;/a&gt; I showed how to use the boto library to manage Route 53 DNS zones. Here I will show a strategy for handling DNS within an EC2 infrastructure using Route 53.&lt;br /&gt;&lt;br /&gt;Let's assume you have a registered domain name called mycompanycloud.com. You want all your EC2 instances to use that domain name to communicate with each other. Assume you launch a database instance that you want to refer to as db01.mycompanycloud.com. What you do is you add a CNAME record in the DNS zone for mycompanycloud.com and point it to the external AWS name assigned to that instance. For example:&lt;br /&gt;&lt;pre class="code"&gt;&lt;/pre&gt;&lt;pre class="code"&gt;# route53 add_record ZONEID db01.mycompanycloud.com CNAME ec2-51-10-11-89.compute-1.amazonaws.com 3600&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The advantage of this method is that DNS queries for db01.mycompanycloud.com from within EC2 will eventually resolve the CNAME to the internal IP address of the instance, while DNS queries from outside EC2 will resolve it to the external IP address -- which is in general exactly what you want.&lt;br /&gt;&lt;br /&gt;There's one more caveat: if you need the default DNS and search domain in /etc/resolv.conf to be mycompanycloud.com, you need to configure the DHCP client to use that domain, by adding this line to /etc/dhcp3/dhclient.conf:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;supersede domain-name "mycompanycloud.com ec2.internal compute-1.internal" ;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Then edit/overwrite /etc/resolv.conf and specify:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;nameserver 172.16.0.23&lt;br /&gt;domain mycompanycloud.com&lt;br /&gt;search mycompanycloud.com ec2.internal compute-1.internal&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The line in dhclient.conf will ensure that your custom resolv.conf file will be preserved across reboots -- which is not usually the case in EC2 with the default DHCP behavior (thanks to Gerald Chao for pointing out this solution to me).&lt;br /&gt;&lt;br /&gt;Of course, you should have all this in the Chef or Puppet recipes you use when you build out a new instance.&lt;br /&gt;&lt;br /&gt;I've been applying this strategy for a while and it works out really well, and it also allows me to not run and take care of my own BIND servers in EC2.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-4874830494914819166?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/4874830494914819166/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=4874830494914819166' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/4874830494914819166'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/4874830494914819166'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/06/strategy-for-handling-dns-in-ec2-with.html' title='A strategy for handling DNS in EC2 with Route 53'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-3744939102352297023</id><published>2011-06-20T18:34:00.000-07:00</published><updated>2011-06-20T18:34:23.968-07:00</updated><title type='text'>Managing Amazon Route 53 DNS with boto</title><content type='html'>Here's a quick post that shows how to manage &lt;a href="http://aws.amazon.com/route53/"&gt;Amazon Route 53&lt;/a&gt; DNS zones and records using the ever-useful &lt;a href="https://github.com/boto/boto"&gt;boto&lt;/a&gt; library from &lt;a href="http://www.elastician.com/"&gt;Mitch Garnaat&lt;/a&gt;. Route 53 is a typical pay-as-you-go inexpensive AWS service which you can use to host your DNS zones. I wanted to play with it a bit, and some Google searches revealed two good blog posts: "&lt;a href="http://blog.coredumped.org/2011/02/boto-and-amazon-route53.html"&gt;Boto and Amazon Route53&lt;/a&gt;" by Chris Moyer and "&lt;a href="http://robballou.com/blog/2011/using-boto-to-manage-route-53/"&gt;Using boto to manage Route 53&lt;/a&gt;" by Rob Ballou. I want to thank those two guys for blogging about Route 53, their posts were a great help to me in figuring things out.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Install boto&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;My machine is running Ubuntu 10.04 with Python 2.6. I ran 'easy_install boto', which installed&amp;nbsp;boto-2.0rc1. This also installs several utilities in /usr/local/bin, of interest to this article being /usr/local/bin/route53 which provides an easy command-line-oriented way of interacting with Route 53.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Create boto configuration file&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I created ~/.boto containing the Credentials section with the AWS access key and secret key:&lt;br /&gt;&lt;pre class="code"&gt;&lt;/pre&gt;&lt;pre class="code"&gt;# cat ~./boto&lt;br /&gt;[Credentials]&lt;br /&gt;aws_access_key_id = "YOUR_ACCESS_KEY"&lt;br /&gt;aws_secret_access_key = "YOUR_SECRET_KEY"&lt;br /&gt;&lt;/pre&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Interact with Route 53 via the route53 utility&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;If you just run 'route53', the command will print the help text for its usage. For our purpose, we'll make sure there are no errors when we run:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;# route53 ls&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If you don't have any DNS zones already created, this will return nothing.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Create a new DNS zone with route53&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;We'll create a zone called 'mytestzone':&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;# route53 create mytestzone.com&lt;br /&gt;Pending, please add the following Name Servers:&lt;br /&gt; ns-674.awsdns-20.net&lt;br /&gt; ns-1285.awsdns-32.org&lt;br /&gt; ns-1986.awsdns-56.co.uk&lt;br /&gt; ns-3.awsdns-00.com&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;&lt;i&gt;Note that you will have to properly register 'mytestzone.com' with a registrar, then point the name server information at that registrat to the name servers returned when the Route 53 zone was created (in our case the 4 name servers above).&lt;/i&gt;&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;At this point, if you run '&lt;b&gt;route53 ls&lt;/b&gt;' again, you should see your newly created zone. You need to make note of the zone ID:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;root@m2:~# route53 ls&lt;br /&gt;================================================================================&lt;br /&gt;| ID:   MYZONEID&lt;br /&gt;| Name: mytestzone.com.&lt;br /&gt;| Ref:  my-ref-number&lt;br /&gt;================================================================================&lt;br /&gt;{}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;You can also get the existing records from a given zone by running the '&lt;b&gt;route53 get&lt;/b&gt;' command which also takes the zone ID as an argument:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;# route53 get MYZONEID&lt;br /&gt;Name                                   Type  TTL                  Value(s)&lt;br /&gt;mytestzone.com.                        NS    172800               ns-674.awsdns-20.net.,ns-1285.awsdns-32.org.,ns-1986.awsdns-56.co.uk.,ns-3.awsdns-00.com.&lt;br /&gt;mytestzone.com.                        SOA   900                  ns-674.awsdns-20.net. awsdns-hostmaster.amazon.com. 1 7200 900 1209600 86400&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;Adding and deleting DNS records using route53&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Let's add an A record to the zone we just created. The route53 utility provides an 'add_record' command which takes the zone ID as an argument, followed by the name, type, value and TTL of the new record, and an optional comment. The TTL is also optional, and defaults to 600 seconds if not specified. Here's how to add an A record with a TTL of 3600 seconds:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;/pre&gt;&lt;pre class="code"&gt;# route53 add_record MYZONEID test.mytestzone.com A SOME_IP_ADDRESS 3600&lt;br /&gt;{u'ChangeResourceRecordSetsResponse': {u'ChangeInfo': {u'Status': u'PENDING', u'SubmittedAt': u'2011-06-20T23:01:23.851Z', u'Id': u'/change/CJ2GH5O38HYKP0'}}}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Now if you run '&lt;b&gt;route53 get MYZONEID&lt;/b&gt;' you should see your newly added record.&lt;br /&gt;&lt;br /&gt;To delete a record, use the 'route53 del_record' command, which takes the same arguments as add_record. Here's how to delete the record we just added:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;# route53 del_record Z247A81E3SXPCR test.mytestzone.com. A SOME_IP_ADDRESS&lt;br /&gt;{u'ChangeResourceRecordSetsResponse': {u'ChangeInfo': {u'Status': u'PENDING', u'SubmittedAt': u'2011-06-21T01:14:35.343Z', u'Id': u'/change/C2B0EHROD8HEG8'}}}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;Managing Route 53 programmatically with boto&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;As useful as the route53 command-line utility is, sometimes you need to interact with the Route 53 service from within your program. Since this post is about boto, I'll show some Python code that uses the Route 53 functionality.&lt;br /&gt;&lt;br /&gt;Here's how you open a connection to the Route 53 service:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;from boto.route53.connection import Route53Connection&lt;br /&gt;conn = Route53Connection()&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(this assumes you have the AWS credentials in the ~/.boto configuration file)&lt;br /&gt;&lt;br /&gt;Here's how you retrieve and walk through all your Route 53 DNS zones, selecting a zone by name:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;ROUTE53_ZONE_NAME = "mytestzone.com."&lt;br /&gt;&lt;br /&gt;zones = {}&lt;br /&gt;conn = Route53Connection()&lt;br /&gt;&lt;br /&gt;results = conn.get_all_hosted_zones()&lt;br /&gt;zones = results['ListHostedZonesResponse']['HostedZones']&lt;br /&gt;found = 0&lt;br /&gt;for zone in zones:&lt;br /&gt;    print zone&lt;br /&gt;    if zone['Name'] == ROUTE53_ZONE_NAME:&lt;br /&gt;        found = 1&lt;br /&gt;        break&lt;br /&gt;if not found:&lt;br /&gt;    print "No Route53 zone found for %s" % ROUTE53_ZONE_NAME&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(note that you need the ending period in the zone name that you're looking for, as in "mytestzone.com.")&lt;br /&gt;&lt;br /&gt;Here's how you add a CNAME record with a TTL of 60 seconds to an existing zone (assuming the 'zone' variable contains the zone you're looking for). You need to operate on the zone ID, which is the identifier following the text '/hostedzone/' in the 'Id' field of the variable 'zone'.&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;from boto.route53.record import ResourceRecordSets&lt;br /&gt;zone_id = zone['Id'].replace('/hostedzone/', '')&lt;br /&gt;changes = ResourceRecordSets(conn, zone_id)&lt;br /&gt;change = changes.add_change("CREATE", 'test2.%s' % ROUTE53_ZONE_NAME, "CNAME", 60)&lt;br /&gt;change.add_value("some_other_name")&lt;br /&gt;changes.commit()&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;To delete a record, you use the exact same code as above, but with "DELETE" instead of "CREATE".&lt;br /&gt;&lt;br /&gt;I leave other uses of the 'route53' utility and of the boto Route 53 API as an exercise to the reader.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-3744939102352297023?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/3744939102352297023/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=3744939102352297023' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3744939102352297023'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3744939102352297023'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/06/managing-amazon-route-53-dns-with-boto.html' title='Managing Amazon Route 53 DNS with boto'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-5083720582082812441</id><published>2011-06-01T10:03:00.000-07:00</published><updated>2011-06-01T13:15:32.571-07:00</updated><title type='text'>Technical books that influenced my career</title><content type='html'>Here's a list of 25 technical books that had a strong influence on my career, presented in a somewhat chronological order of my encounters with them:&lt;br /&gt;&lt;br /&gt;&lt;ol&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Computer-Programming-Volumes-1-4A-Boxed/dp/0321751043"&gt;The Art of Computer Programming&lt;/a&gt;", esp. vol. 3 "&lt;a href="http://www.amazon.com/Art-Computer-Programming-Sorting-Searching/dp/0201896850/"&gt;Sorting and Searching&lt;/a&gt;" - Donald Knuth&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Operating-Systems-Internals-Design-Principles/dp/013230998X"&gt;Operating Systems&lt;/a&gt;" - William Stallings&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Introduction-Algorithms-Thomas-H-Cormen/dp/0262033844/"&gt;Introduction to Algorithms&lt;/a&gt;" - Thomas Cormen et al.&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Programming-Language-2nd-Brian-Kernighan/dp/0131103628"&gt;The C Programming Language&lt;/a&gt;" - Brian Kernighan and Dennis Ritchie&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Programming-Windows%C2%AE-Fifth-Microsoft/dp/157231995X/"&gt;Programming Windows&lt;/a&gt;" - Charles Petzold&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Writing-Solid-Code-Microsoft-Programming/dp/1556155514"&gt;Writing Solid Code&lt;/a&gt;" - Steve Maguire&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Practice-Programming-Brian-W-Kernighan/dp/020161586X/"&gt;The Practice of Programming&lt;/a&gt;" - Brian Kernighan and Rob Pike&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Computer-Networks-Fifth-Approach-Networking/dp/0123850592/"&gt;Computer Networks - a Systems Approach&lt;/a&gt;" - Larry Peterson and Bruce Davie&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/TCP-Illustrated-Vol-Addison-Wesley-Professional/dp/0201633469/"&gt;TCP/IP Illustrated&lt;/a&gt;" - W. Richard Stevens&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Distributed-Systems-Concepts-Design-4th/dp/0321263545/"&gt;Distributed Systems - Concepts And Design&lt;/a&gt;" - George Coulouris et al.&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/DNS-BIND-5th-Cricket-Liu/dp/0596100574/"&gt;DNS and BIND&lt;/a&gt;" - Cricket Liu and Paul Albitz&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/UNIX-Linux-System-Administration-Handbook/dp/0131480057/"&gt;UNIX and Linux System Administration Handbook&lt;/a&gt;" - Evi Nemeth et al.&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Mythical-Man-Month-Software-Engineering-Anniversary/dp/0201835959/"&gt;The Mythical Man-Month&lt;/a&gt;" - Fred Brooks&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Programming-Perl-3rd-Larry-Wall/dp/0596000278"&gt;Programming Perl&lt;/a&gt;" - Larry Wall et al.&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Counter-Hack-Reloaded-Step---Step/dp/0131481045/"&gt;Counter Hack Reloaded: a Step-by-Step Guide to Computer Attacks and Effective Defenses&lt;/a&gt;" - Edward Skoudis and Tom Liston&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Programming-Python-Mark-Lutz/dp/0596158106"&gt;Programming Python&lt;/a&gt;" - Mark Lutz&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Lessons-Learned-Software-Testing-Kaner/dp/0471081124/"&gt;Lessons Learned in Software Testing&lt;/a&gt;" - Cem Kaner, James Bach, Bret Pettichord&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Refactoring-Improving-Design-Existing-Code/dp/0201485672/"&gt;Refactoring - Improving the Design of Existing Code&lt;/a&gt;" - Martin Fowler&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Pragmatic-Programmer-Journeyman-Master/dp/020161622X/"&gt;The Pragmatic Programmer&lt;/a&gt;" - Andrew Hunt and David Thomas&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Becoming-Technical-Leader-Problem-Solving-Approach/dp/0932633021"&gt;Becoming a Technical Leader&lt;/a&gt;" - Gerald Weinberg&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Extreme-Programming-Explained-Embrace-Change/dp/0201616416/"&gt;Extreme Programming Explained&lt;/a&gt;" - Kent Beck&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Programming-Amazon-Web-Services-SimpleDB/dp/0596515812/"&gt;Programming Amazon Web Services&lt;/a&gt;" - James Murty&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Building-Scalable-Web-Sites-Applications/dp/0596102356/"&gt;Building Scalable Web Sites&lt;/a&gt;" - Cal Henderson&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Restful-Web-Services-Leonard-Richardson/dp/0596529260"&gt;RESTful Web Services&lt;/a&gt;" - Leonard Richardson, Sam Ruby&lt;/li&gt;&lt;li&gt;"&lt;a href="http://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/0596518579/"&gt;The Art of Capacity Planning&lt;/a&gt;" - John Allspaw&lt;/li&gt;&lt;/ol&gt;&lt;div&gt;What is your list?&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-5083720582082812441?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/5083720582082812441/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=5083720582082812441' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5083720582082812441'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5083720582082812441'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/06/technical-books-that-influenced-my.html' title='Technical books that influenced my career'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2732183998365137373</id><published>2011-05-24T11:49:00.000-07:00</published><updated>2011-05-24T12:57:31.500-07:00</updated><title type='text'>Setting up RAID 0 across ephemeral drives on EC2 instances (and surviving reboots!)</title><content type='html'>I've been experimenting with setting up RAID 0 across ephemeral drives on EC2 instances. The initial setup, be it with mdadm and lvm, or directly with lvm, is not that hard -- what has proven challenging is surviving reboots. Unless you perform certain tricks, your EC2 instance will be blissfully unaware of its new setup after a reboot. What's more, if you try to mount the new striped volume at boot time by adding it to /etc/fstab, chances are you won't even be able to ssh into the instance anymore. It happened to me many times while experimenting, hence this blog post.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Update&lt;/b&gt;: &lt;i&gt;I realize I didn't go into details about the use case of this type of setup. This is useful if you don't want to incur EBS performance and reliability penalties, and yet you have a data set that is larger than the 400 GB offered by an individual ephemeral drive. Of course, if your instance dies, so do the ephemeral drives (after all they are named like this for a reason...) -- so make sure you have a good backup/disaster recovery strategy for the data you store there!&lt;/i&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In the following, I will assume you want to set up RAID 0 across the four ephemeral drives that come with an EC2 m1.xlarge instance, and which are exposed as devices /dev/sdb through /dev/sde. By default, /dev/sdb is mounted as /mnt, while the other drives aren't mounted.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;I also assume you want to create 1 volume group encompassing the RAID 0 array, and within that volume group you want to create 2 logical volumes with associated XFS file systems, and also 1 logical volume for swap.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;b&gt;Step 1 - unmount /dev/sdb&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;# umount /dev/sdb&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(also comment out the entry corresponding to /dev/sdb in /etc/fstab)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 2 - install lvm2 and mdadm&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For an unattended install of these packages (slightly complicated by the fact that mdadm also needs postfix), I do:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# DEBIAN_FRONTEND=noninteractive apt-get -y install mdadm lvm2&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 3 - manually load the dm-mod module&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# modprobe dm-mod&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(this seems to be a &lt;a href="https://bugs.launchpad.net/ubuntu/+source/devmapper/+bug/106696"&gt;bug&lt;/a&gt; in devmapper in Ubuntu)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If &amp;nbsp;you want to set up RAID 0 via lvm directly, you can skip steps 4 and 5. From what I've read, you get better performance if you do the RAID 0 setup with mdadm. Also, if you need any other RAID level, you need to use mdadm.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 4 - configure RAID 0 array via mdadm&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;#&amp;nbsp;mdadm --create /dev/md0 --level=0 --chunk=256 --raid-devices=4 /dev/sdb /dev/sdc /dev/sdd /dev/sde&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Verify:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# mdadm --detail /dev/md0&lt;/div&gt;&lt;div&gt;/dev/md0:&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;Version : 00.90&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;Creation Time : Mon May 23 22:35:20 2011&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; Raid Level : raid0&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; Array Size : 1761463296 (1679.86 GiB 1803.74 GB)&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; Raid Devices : 4&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;Total Devices : 4&lt;/div&gt;&lt;div&gt;Preferred Minor : 0&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Persistence : Superblock is persistent&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Update Time : Mon May 23 22:35:20 2011&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;State : clean&lt;/div&gt;&lt;div&gt;&amp;nbsp;Active Devices : 4&lt;/div&gt;&lt;div&gt;Working Devices : 4&lt;/div&gt;&lt;div&gt;&amp;nbsp;Failed Devices : 0&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;Spare Devices : 0&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; Chunk Size : 256K&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; UUID : 03f63ee3:607fb777:f9441841:42247c4d (local to host adb08lvm)&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; Events : 0.1&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;Number &amp;nbsp; Major &amp;nbsp; Minor &amp;nbsp; RaidDevice State&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; 0 &amp;nbsp; &amp;nbsp; &amp;nbsp; 8 &amp;nbsp; &amp;nbsp; &amp;nbsp; 16 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;0 &amp;nbsp; &amp;nbsp; &amp;nbsp;active sync &amp;nbsp; /dev/sdb&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; 1 &amp;nbsp; &amp;nbsp; &amp;nbsp; 8 &amp;nbsp; &amp;nbsp; &amp;nbsp; 32 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;1 &amp;nbsp; &amp;nbsp; &amp;nbsp;active sync &amp;nbsp; /dev/sdc&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; 2 &amp;nbsp; &amp;nbsp; &amp;nbsp; 8 &amp;nbsp; &amp;nbsp; &amp;nbsp; 48 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;2 &amp;nbsp; &amp;nbsp; &amp;nbsp;active sync &amp;nbsp; /dev/sdd&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; 3 &amp;nbsp; &amp;nbsp; &amp;nbsp; 8 &amp;nbsp; &amp;nbsp; &amp;nbsp; 64 &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;3 &amp;nbsp; &amp;nbsp; &amp;nbsp;active sync &amp;nbsp; /dev/sde&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 5 - increase block size to 64 KB for better performance&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# blockdev --setra 65536 /dev/md0&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 6 - create physical volume from the RAID 0 array&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# pvcreate /dev/md0&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(if you didn't want to use mdadm, you would call pvcreate against each of the /dev/sdb through /dev/sde devices)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 7 - create volume group called vg0 spanning the RAID 0 array&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# vgcreate vg0 /dev/md0&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;(if you didn't want to use mdadm, you would run vgcreate and specify the 4 devices /dev/sdb through /dev/sde)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Verify:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# vgscan&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;Reading all physical volumes. &amp;nbsp;This may take a while...&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;Found volume group "vg0" using metadata type lvm2&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# pvscan&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;PV /dev/md0 &amp;nbsp; VG vg0 &amp;nbsp; lvm2 [1.64 TiB / 679.86 GiB free]&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;Total: 1 [1.64 TiB] / in use: 1 [1.64 TiB] / in no VG: 0 [0 &amp;nbsp; ]&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 8 - create 3 logical volumes within the vg0 volume group&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Each local drive is 400 GB, so the total size for the volume group is 1.6 TB. I'll create 2 logical volumes at 500 GB each, and a 10 GB logical volume for swap.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;#&amp;nbsp;lvcreate --name data1 --size 500G vg0&lt;/div&gt;&lt;div&gt;#&amp;nbsp;lvcreate --name data2 --size 500G vg0&lt;/div&gt;&lt;div&gt;#&amp;nbsp;lvcreate --name swap --size 10G vg0&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Verify:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# lvscan&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;ACTIVE &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;'/dev/vg0/data1' [500.00 GiB] inherit&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;ACTIVE &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;'/dev/vg0/data2' [500.00 GiB] inherit&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;ACTIVE &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;'/dev/vg0/swap' [10.00 GiB] inherit&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 9 - create XFS file systems and mount them&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We'll create XFS file systems for the data1 and data2 logical volumes. The names of the devices used for mkfs are the ones displayed via the lvscan command above. Then we'll mount the 2 file systems as /data1 and /data2.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# mkfs.xfs /dev/vg0/data1&lt;/div&gt;&lt;div&gt;# mkfs.xfs /dev/vg0/data2&lt;/div&gt;&lt;div&gt;# mkdir /data1&lt;/div&gt;&lt;div&gt;# mkdir /data2&lt;/div&gt;&lt;div&gt;# mount -t xfs -o noatime /dev/vg0/data1 /data1&lt;/div&gt;&lt;div&gt;# mount -t xfs -o noatime /dev/vg0/data2 /data2&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Step 10 - create and enable swap partition&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;# mkswap /dev/vg0/swap&lt;/div&gt;&lt;div&gt;# swapon /dev/vg0/swap&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point, you should have a fully functional setup. The slight problem is that if you add the newly created file systems to /etc/fstab and reboot, you may not be able to ssh back into your instance -- at least that's what happened to me. I was able to ping the IP of the instance, but ssh would fail.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I finally redid the whole thing on a new instance (I created the RAID 0 directly with lvm, bypassing the mdadm step), but didn't add the file systems to /etc/fstab. After rebooting and running lvscan, I noticed that the logical volumes I had created were all marked as 'inactive':&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# lvscan&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;inactive &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;'/dev/vg0/data1' [500.00 GiB] inherit&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;inactive &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;'/dev/vg0/data2' [500.00 GiB] inherit&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;inactive &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;'/dev/vg0/swap' [10.00 GiB] inherit&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;This was after I ran 'modprobe dm-mod' manually, otherwise the lvscan command would complain:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;/proc/misc: No entry for device-mapper found&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;Is device-mapper driver missing from kernel?&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;Failure to communicate with kernel device-mapper driver.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A Google search revealed &lt;a href="http://osdir.com/ml/ec2ubuntu/2010-12/msg00022.html"&gt;this thread&lt;/a&gt; which offered a solution: run 'lvchange -ay' against each logical volume so that the volume becomes active. Only after doing this I was able to see the logical volumes and mount them.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So I added these lines to /etc/rc.local:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;/sbin/modprobe dm-mod&lt;/div&gt;&lt;div&gt;/sbin/lvscan&lt;/div&gt;&lt;div&gt;/sbin/lvchange -ay /dev/vg0/data1&lt;/div&gt;&lt;div&gt;/sbin/lvchange -ay /dev/vg0/data2&lt;/div&gt;&lt;div&gt;/sbin/lvchange -ay /dev/vg0/swap&lt;/div&gt;&lt;div&gt;/bin/mount -t xfs -o noatime /dev/vg0/data1 &amp;nbsp;/data1&lt;/div&gt;&lt;div&gt;&lt;div&gt;/bin/mount -t xfs -o noatime /dev/vg0/data2 &amp;nbsp;/data2&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;/sbin/swapon /dev/vg0/swap&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After a reboot, everything was working as expected. Note that I am doing the mounting of the file systems and the enabling of the swap within the rc.local script, and not via /etc/fstab. If you try to do it in fstab, it is too early in the boot sequence, so the logical volumes will be inactive and the mount will fail, with the dire consequence that you won't be able to ssh back into your instance (at least in my case).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This was still not enough when creating the RAID 0 array with mdadm. When I used mdadm, even when adding the lines above to /etc/rc.local, the /dev/md0 device was not there after the reboot, so the mount would still fail. The thread I mentioned above does discuss this case at some point, and I also found a Server Fault &lt;a href="http://serverfault.com/questions/141991/ubuntu-software-raid-0-on-aws-does-not-survive-reboot"&gt;thread&lt;/a&gt; on this topic. The solution in my case was to modify the mdadm configuration file /etc/mdadm/mdadm.conf and:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;a) change the DEVICE variable to point to my 4 devices:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;DEVICE /dev/sdb /dev/sdc /dev/sdd /dev/sde&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;b) add an ARRAY variable containing the UUID of /dev/md0 (which you can get via 'mdadm --detail /dev/md0'):&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;ARRAY /dev/md0 level=raid0 num-devices=4 UUID=03f63ee3:607fb777:f9441841:42247c4d&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This change, together with the custom lines in /etc/rc.local, finally enabled me to have a functional RAID 0 array and functional file systems and swap across the ephemeral drives in my EC2 instance.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;I hope this will be useful to somebody out there and will avoid some head-against-the-wall moments that I had to go through....&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2732183998365137373?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2732183998365137373/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2732183998365137373' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2732183998365137373'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2732183998365137373'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/05/setting-up-raid-0-across-ephemeral.html' title='Setting up RAID 0 across ephemeral drives on EC2 instances (and surviving reboots!)'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1847226596619393629</id><published>2011-05-09T15:05:00.000-07:00</published><updated>2011-05-09T15:05:05.592-07:00</updated><title type='text'>Managing infrastructures in the cloud, with lessons learned the hard way</title><content type='html'>Here is a collection of blog posts I wrote over the last 3 years or so. Some of them are practical step-by-step tutorials on using various tools for managing cloud instances, while others talk about lessons learned the hard way, by deploying large-scale infrastructures in the cloud. I am aggregating them here for ease of future reference:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Lessons learned&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2009/04/experiences-deploying-large-scale.html"&gt;Experiences deploying a large-scale infrastructure in Amazon EC2&lt;/a&gt;&amp;nbsp;(April 2009)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2011/04/lessons-learned-from-deploying.html"&gt;Lessons learned from deploying a production database in EC2&lt;/a&gt;&amp;nbsp;(April 2011)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2009/07/dark-launching-and-other-lessons-from.html"&gt;Dark launching and other lessons from Facebook on massive deployments&lt;/a&gt;&amp;nbsp;(July 2009)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2009/02/youre-not-cloud-provider-if-you-dont.html"&gt;You're not a cloud provider if you don't provide an API&lt;/a&gt;&amp;nbsp;(February 2009)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;b&gt;Working with EC2-specific tools&lt;/b&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2008/09/experiences-with-amazon-ec2-and-ebs.html"&gt;Experiences with Amazon EC2 and EBS&lt;/a&gt;&amp;nbsp;(September 2008)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2008/10/update-on-ec2-and-ebs.html"&gt;Update on EC2 and EBS&lt;/a&gt; (October 2008)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2008/12/deploying-ec2-instances-from-command.html"&gt;Deploying EC2 instances from the command line&lt;/a&gt; (December 2008)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2008/12/working-with-amazon-ec2-regions.html"&gt;Working with Amazon EC2 regions&lt;/a&gt; (December 2008)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;b&gt;Load balancing (ELB and HAProxy)&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2011/01/using-aws-elastic-load-balancing-with.html"&gt;Using AWS Elastic Load Balancing with a password-protected site&lt;/a&gt;&amp;nbsp;(January 2011)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2009/02/load-balancing-in-amazon-ec2-with.html"&gt;Load balancing in Amazon EC2 with HAProxy&lt;/a&gt; (February 2009)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2009/03/haproxy-x-forwarded-for-geoip-keepalive.html"&gt;HAProxy, X-Forwarded-For, GeoIP, KeepAlive&lt;/a&gt; (March 2009)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2009/03/haproxy-and-apache-performance-tuning.html"&gt;HAProxy and Apache performance tuning tips&lt;/a&gt; (March 2009)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;div&gt;&lt;div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;"&gt;&lt;b&gt;Working in the multi-cloud with libcloud&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2010/12/using-libcloud-to-manage-instances.html"&gt;Using libcloud to manage instances across multiple cloud providers&lt;/a&gt;&amp;nbsp;(December 2010)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2011/01/libcloud-042-and-ssl.html"&gt;libcloud 0.4.2 and SSL&lt;/a&gt;&amp;nbsp;(January 2011)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2011/01/passing-user-data-to-ec2-ubuntu.html"&gt;Passing user data to EC2 Ubuntu instances with libcloud&lt;/a&gt;&amp;nbsp;(January 2011)&lt;/li&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2011/03/working-in-multi-cloud-with-libcloud.html"&gt;Slides for 'Working in the multi-cloud with libcloud' presentation&lt;/a&gt;&amp;nbsp;(March 2011)&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;b&gt;Backups to S3&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2008/05/incremental-backups-to-amazon-s3.html"&gt;Incremental encrypted backup to Amazon S3 using duplicity&lt;/a&gt; (May 2008)&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Mail server setup in EC2&lt;/b&gt;&lt;/div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2010/05/setting-up-mail-server-in-ec2-with.html"&gt;Setting up a mail server in EC2 with postfix and Postini&lt;/a&gt;&amp;nbsp;(May 2010)&lt;/li&gt;&lt;/ul&gt;&lt;b&gt;Rackspace CloudFiles&lt;/b&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://agiletesting.blogspot.com/2010/09/managing-rackspace-cloudfiles-with.html"&gt;Managing Rackspace CloudFiles with python-cloudfiles&lt;/a&gt; (September 2010)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1847226596619393629?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1847226596619393629/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1847226596619393629' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1847226596619393629'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1847226596619393629'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/05/managing-infrastructures-in-cloud-with.html' title='Managing infrastructures in the cloud, with lessons learned the hard way'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-7810135880572895796</id><published>2011-05-06T11:50:00.000-07:00</published><updated>2011-05-06T11:57:52.583-07:00</updated><title type='text'>Upgrading the GD library in Ubuntu</title><content type='html'>We needed to use &lt;a href="http://finnrudolph.de/ImageFlow/Installation"&gt;ImageFlow&lt;/a&gt; for some internal testing of image manipulations (esp. reflections). With a stock php5/libgd2 install in Ubuntu 10.04, some calls to the ImageFlow library would fail with:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;"GD library is too old. Version 2.0.1 or later is required, and 2.0.28 is strongly recommended."&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;The libraries installed by Ubuntu were:&lt;br /&gt;&lt;pre class="code"&gt;$ dpkg -l | grep libgd2&lt;br /&gt;rc  libgd2-noxpm                               2.0.36~rc1~dfsg-3ubuntu1.9.04.1         GD Graphics Library version 2 (without XPM s&lt;br /&gt;ii  libgd2-xpm                                 2.0.36~rc1~dfsg-3ubuntu1.9.04.1         GD Graphics Library version 2&lt;br /&gt;$ dpkg -l | grep php5-gd&lt;br /&gt;ii  php5-gd                                    5.2.6.dfsg.1-3ubuntu4.6                 GD module for php5&lt;br /&gt;&lt;/pre&gt;The issue here is that Ubuntu does not use the version of GD which is bundled with PHP. See this &lt;a href="https://bugs.launchpad.net/ubuntu/+source/php5/+bug/74647"&gt;discussion&lt;/a&gt; for more details.&lt;br /&gt;&lt;br /&gt;So...some googling around later, I stumbled on this great howtoforge post by patusovniak on "&lt;a href="http://www.howtoforge.com/recompiling-php5-with-bundled-support-for-gd-on-ubuntu"&gt;Recompiling PHP5 with bundled support for GD in Ubuntu&lt;/a&gt;". It also serves as a good overview of building Ubuntu packages from source. The only observation I have is that after I ran the step&lt;pre class="code"&gt;dpkg-buildpackage -rfakeroot&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;I had to install all .deb packages in /usr/src. So I did&lt;br /&gt;&lt;pre class="code"&gt;cd /usr/src&lt;br /&gt;dpkg -i *.deb&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;When running phpinfo(), the GD section now looks:&lt;br /&gt;&lt;pre class="code"&gt;gd&lt;br /&gt;&lt;br /&gt;GD Support enabled&lt;br /&gt;GD Version bundled (2.0.34 compatible)&lt;br /&gt;FreeType Support enabled&lt;br /&gt;FreeType Linkage with freetype&lt;br /&gt;FreeType Version 2.3.11&lt;br /&gt;T1Lib Support enabled&lt;br /&gt;GIF Read Support enabled&lt;br /&gt;GIF Create Support enabled&lt;br /&gt;JPEG Support enabled&lt;br /&gt;libJPEG Version 6b&lt;br /&gt;PNG Support enabled&lt;br /&gt;libPNG Version 1.2.42&lt;br /&gt;WBMP Support enabled&lt;br /&gt;XPM Support enabled&lt;br /&gt;XBM Support enabled&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Hopefully this will be useful to someone out there trying to desperately use a newer version of GD with PHP in Ubuntu...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-7810135880572895796?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/7810135880572895796/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=7810135880572895796' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/7810135880572895796'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/7810135880572895796'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/05/upgrading-gd-library-in-ubuntu.html' title='Upgrading the GD library in Ubuntu'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1835585678169065935</id><published>2011-04-27T12:30:00.000-07:00</published><updated>2011-04-27T19:47:44.815-07:00</updated><title type='text'>Lessons learned from deploying a production database in EC2</title><content type='html'>In light of the &lt;a href="http://highscalability.com/blog/2011/4/25/the-big-list-of-articles-on-the-amazon-outage.html"&gt;Big EC2 Outage of 2011&lt;/a&gt;, I thought I'd offer some perspective on my experiences in deploying and maintaining a production database in EC2. I find it amusing to read some blog posts (esp. the one from SmugMug) where people brag about how they never went down during the EC2 outage, while saying in the same breath that their database infrastructure was hosted somewhere else...duh!&lt;br /&gt;&lt;br /&gt;I will give a short history of why we (Evite) ended up hosting our database infrastructure in EC2. After all, this is &lt;b&gt;not&lt;/b&gt; how people start using the cloud, since it's much easier to deploy web or app servers in the cloud. I will also highlight some lessons learned along the way.&lt;br /&gt;&lt;br /&gt;In the summer of 2009 we decided to follow the &lt;a href="http://bret.appspot.com/entry/how-friendfeed-uses-mysql"&gt;example of FriendFeed&lt;/a&gt; and store our data in an almost schema-less fashion, but still use MySQL. At the time, NoSQL products such as Cassandra, Riak, etc were still very much untested at large scale, so we thought we'd use something we're familiar with. We designed our database layer from the get go to be horizontally scalable by sharding at the application layer. We store our data in what we call 'buckets', which are MySQL tables with an ID/key and a blob of JSON data corresponding to that ID, plus a few other date/time-related columns for storing the creation and update timestamps for the JSON blob. We started with 1,024 such buckets spread across 8 MySQL instances, so 128 buckets per instance. The number 8 seemed like a good compromise between capacity and cost, and we also did some initial load testing against one server to confirm this number.&lt;br /&gt;&lt;br /&gt;We initially rolled out the database infrastructure on 8 Dell PE2970s, each with 16 GB of RAM and 2 quad-core CPUs. Each server ran 2 MySQL instances, for a total of 16, out of each 8 were active at any time, and the other 8 were passive -- each of the active MySQL instances was in a master-master pair with a passive instance running on a different server. This was done so that if any server went down, we still had 8 active MySQL in the mix. We had HAProxy load balancing across each pair of active/passive instances, sending all traffic to the active one, unless it went down, at which point traffic would be sent automatically to the passive one (I blogged about this setup and its caveats &lt;a href="http://agiletesting.blogspot.com/2010/10/mysql-load-balancing-with-haproxy.html"&gt;here&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;As for the version of MySQL, we ran 5.1.37, which was pretty new at the time.&lt;br /&gt;&lt;br /&gt;At this point, we did a lot of load testing using &lt;a href="http://browsermob.com/website-load-testing"&gt;BrowserMob&lt;/a&gt;, which allowed us to exercise our application in an &lt;b&gt;end-to-end&lt;/b&gt; fashion, in the same way a regular user would. All load tests pointed to the fact that we had indeed sufficient firepower at our disposal for the DB layer.&lt;br /&gt;&lt;br /&gt;Two important things to note here:&lt;br /&gt;&lt;br /&gt;1) We ran the load test against empty databases;&lt;br /&gt;2) We couldn't do a proper '&lt;a href="http://www.facebook.com/note.php?note_id=96390263919"&gt;dark launching&lt;/a&gt;' for a variety of reasons, the main one being that the 'legacy' code we were replacing was in a state where nobody dared to touch it -- so we couldn't send existing production traffic to our new DB infrastructure;&lt;br /&gt;&lt;br /&gt;We deployed this infrastructure in production in May/June 2009, and it performed well for a few months. At some point, in late September 2009, and with our highest traffic of the year expected to start before Halloween, we started to see a performance degradation. The plain vanilla version of MySQL we used didn't seem to exercise the CPU cores uniformly, and CPU wait time was also increasing.&lt;br /&gt;&lt;br /&gt;I should also point out here that our application is *very* write-intensive, so the fact the we had 2 MySQL instances per server, both in a master-master setup with another 2 instances running on a different server, started to tax more and more the CPU and RAM resources of each server. In particular, because each server had only 16 GB RAM, the innodb_buffer_pool_size (set initially at 4 GB for each of the 2 MySQL instances) was becoming insufficient, due also to the constant increase of our database size. It also turned out we were updating the JSON blobs too frequently and in some cases unnecessarily, thus causing even more I/O.&lt;br /&gt;&lt;br /&gt;At this point, we had a choice of either expanding our hardware at the data center, or shooting for 'infinite'  horizontal scale by deploying in EC2. We didn't want to wait 2-3 weeks for the former to happen, so we decided to go into the cloud. We also took the opportunity to do the following:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;we replaced vanilla MySQL with the Percona XtraDB distribution, which includes a multitude of patches that improve the performance of MySQL especially on multi-core servers&lt;/li&gt;&lt;li&gt;we engaged Percona consultants to audit our MySQL setup and recommend improvements, especially in the I/O area&lt;/li&gt;&lt;li&gt;we 'flattened' our MySQL server farm by deploying 16 MySQL masters (each an m1.xlarge in EC2) backed by 16 MySQL slaves (each an m1.large in EC2); we moved away from master-master to a simpler master-slave, because the complexities and the potential subtle issues of the master-master setup were not worth the hassle (in short, we have seen at least one case where the active master was overloaded, so it stopped responded to the HAProxy health checks; this caused HAProxy to fail over to the passive master, which wasn't fully caught up replication-wise with the active one; this caused a lot of grief to us)&lt;/li&gt;&lt;li&gt;we eliminated the unnecessary JSON blob updates, which tremendously reduced the writes to our database&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Both moving to the Percona distribution and engaging the Percona experts time turned out to be really beneficial to us. Here are just some of the recommendations from Percona that we applied on the master DBs:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;increase innodb_buffer_pool_size to 8 GB&lt;/li&gt;&lt;li&gt;store different MySQL data file types on different EBS volumes; we set apart 1 EBS volume for each for these types of files:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;data files&lt;/li&gt;&lt;li&gt;innodb transaction logs&lt;/li&gt;&lt;li&gt;binlogs&lt;/li&gt;&lt;li&gt;temporary files (we actually have 3 EBS volumes for 3 temp directories that we specify in my.cnf)&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div&gt;These recommendations, plus the fact that we were only running one MySQL instance per server, plus the reduction of unnecessary blob updates, gave us a #winning formula at the time. We were able to sustain what is for us the highest traffic of the year, the week after Thanksgiving. But....not all was rosy. On the Tuesday of that week we lost one DB master due to the fact that the EBS volume corresponding to the MySQL data directory went AWOL, causing the CPU to get pegged at 100% I/O wait. We had to fail over to the slave, and rebuild another master from scratch. Same thing happened again that Thursday. We thought it was an unfortunate coincidence at the time, but knowing what we know now, I believe we melted those EBS volumes with our writes. Apologies to the other EC2 customers sharing those volumes with us...&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Ever since the move to EC2 we've been relatively happy with the setup, with the exception of fairly frequent EBS issues. The main symptom of such an EBS issue is I/O wait pegged at &amp;gt; 90% for that specific server, which triggers elevated errors across our application server pool. The usual MO for us is to give that server 15-30 minutes to recover, then if it doesn't, to go ahead and fail over to the slave.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One good thing is that we got to be *really* good at this failover procedure. My colleague Marco Garcia and I can do the following real quick even if you wake us up at 1 AM (like the EC2 outage did last week):&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;fail over the DB master1 to a slave (call it slave1)&lt;/li&gt;&lt;li&gt;launch another m1.xlarge EC2 instance to act as a new master (call it master2); this instance is automatically set up via Chef&lt;/li&gt;&lt;li&gt;take an xtrabackup of slave1 to another EBS volume (I described this in more detail &lt;a href="http://agiletesting.blogspot.com/2010/09/mysql-innodb-hot-backups-and-restores.html"&gt;here&lt;/a&gt;)&lt;/li&gt;&lt;li&gt;take a snapshot of the EBS volume, then create another volume out of the snapshot, in the zone of the master2&lt;/li&gt;&lt;li&gt;restore the xtrabackup files from the new EBS volume into master2&lt;/li&gt;&lt;li&gt;configure master2 as a slave to slave1, let replication catch up&lt;/li&gt;&lt;li&gt;at an appropriate time, switch the application from slave1 to master2 &lt;/li&gt;&lt;li&gt;configure slave1 back as a slave to master2&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;When I say 'real quick', I have to qualify it -- we have to wait quite a bit for the xtrabackup to happen, the for the backup files to be transferred over to the new master, either via EBS snapshot or via scp. That's where most of the time goes in this disaster recovery procedure -- think 'hours'. &lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;You may question our use of EBS volumes. Because we wanted to split the various MySQL file types across multiple disks, and because we wanted to make sure we have enough disk capacity, we couldn't just use ephemeral disks. Note that we did also try to stripe multiple EBS volumes into a RAID 0 array, especially for the MySQL datadir, but we didn't notice a marked performance improvement, while the overall reliability of the array was still tied to the least performing of the volumes in the stripe. Not #winning.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We've been quite patient with this setup, even with the almost constant need to babysit or account for flaky EBS volumes, until the EC2 outage of last week. We thought we were protected against massive EC2 failures because each MySQL master had its slave in a different EC2 availability zone -- however our mistake was that all of the zones were within the same region, US East. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;During the first few hours of the outage, all of our masters and slaves in zone us-east-1a got frozen. The first symptom was that all existing queries within MySQL would not complete and would just hang there, since they couldn't write to disk. Then things got worse and we couldn't even connect to MySQL. So we failed over all masters to their corresponding slaves. This was fine until mid-day on the 1st day of the outage, when we had another master fail, this time in zone us-east-1b. To compound the issue, that master happened to have the slave in us-east-1a, so we were hosed at that point.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;It was time for plan B, which was to launch a replacement master in another region (we chose US West) and yet another server in another cloud (we chose Rackspace), then to load the database from backups. We take full mysqldump backups of all our databases every 8 hours, and incrementals (which in our case is the data from the last 24 hours) every hour. We save those to S3 and to Rackspace CloudFiles. So at least there we were well equipped to do a restore. We also had the advantage of having deployed a slave in Rackspace via &lt;a href="https://github.com/tobami/littlechef"&gt;LittleChef&lt;/a&gt;, so we had all that setup (we couldn't use our regular Chef server setup in EC2 at the time). However, while we were busy recovering that server, we got lucky and the server that misbehaved in us-east-1b came back online, so we were able to put it back into the production pool. We did take a maintenance window while this was happening for around 2 hours, but that was the only downtime we had during the whole EC2 outage. Not bad when everything is said and done.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;One important thing to note is that even though we were up and running, we had largely lost our redundancy -- we either lost masters, so we failed over to slaves, or we lost slaves. In each case, we had only 1 MySQL server to rely on, which didn't give us a warm and fuzzy feeling. So we spent most of last week &lt;b&gt;rebuilding our redundancy&lt;/b&gt;. BTW, this is something that I haven't seen emphasized enough in the blog posts about the EC2 outage. Many people bragged about how they never went down, but they never mentioned the fact that they needed to spend a long time rebuilding their redundancy. This is probably because they never did, instead banking on Amazon to recover the affected zone.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point, we are shooting for moving our database server pool back in the data center, this time on much beefier hardware. We are hoping to consolidate the 16 masters that we currently have on 4 Dell C2100s maxed out with CPUs, RAM and disks, with 4 slaves that we will deploy at a different data center. The proper sizing of the new DB pool is to be determined though at this point. We plan on starting with one Dell C2100 which will replace one of the existing masters, then start consolidating more masters, all while production traffic is hitting it. Another type of dark launching if you will -- because &lt;b&gt;there's nothing like production&lt;/b&gt;!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I still think going into EC2 wasn't such a bad idea, because it allowed us to observe our data access patterns and how they affect MySQL. The fact that we were horizontally scalable from day one gave us the advantage of being able to launch new instances and add capacity that way if needed. At this point, we could choose to double our database server count in EC2, but this means double the headaches in babysitting those EBS volumes....so we decided against it. We are able though to take everything we learned in EC2 during the past 6 months and easily deploy anywhere else.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;OK, so now a recap with some of the lessons learned:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;b&gt;Do dark launches whenever possible&lt;/b&gt; -- I said it before, and the above story says it again, it's very hard to replicate production traffic at scale. Even with a lot of load testing, you won't uncover issues that will become apparent in production. This is partly due to the fact that many issues arise after a certain time, or after a certain volume (database size, etc) is reached, and load testing generally doesn't cover those situations.&lt;/li&gt;&lt;li&gt;&lt;b&gt;It's hard to scale a database&lt;/b&gt; -- everybody knows that. If we were to design our data layer today, we would probably look at one of the more mature NoSQL solutions out there (although that is still a fairly risky endeavor in my mind). Our sharded MySQL solution (which we use like I said in an almost NoSQL fashion) is OK, but comes with a whole slew of issues of its own, not the least being that maintenance and disaster recovery are not trivial.&lt;/li&gt;&lt;li&gt;&lt;b&gt;If you use a write-intensive MySQL database, use real hardware&lt;/b&gt; -- virtualization doesn't cut it, and EBS especially so. And related to this:&lt;/li&gt;&lt;li&gt;&lt;b&gt;Engage technology experts early&lt;/b&gt; -- looking back, we should have engaged Percona much earlier in the game, and we should have asked for their help in properly sizing our initial DB cluster&lt;/li&gt;&lt;li&gt;&lt;b&gt;Failover can be easy, but rebuilding redundancy at the database layer is always hard&lt;/b&gt; -- I'd like to see more discussion on this issue, but this has been my conclusion based on our experiences. And related to this:&lt;/li&gt;&lt;li&gt;&lt;b&gt;Automated deployments and configuration management can be quick and easy, but restoring the data is a time sink&lt;/b&gt; -- it's relatively easy to launch a new instance or set up a new server with Chef/LittleChef/Puppet/etc. It's what happens afterwards that takes a long time, namely restoring the data in order to bring that server into the production pool. Here I am talking mostly about database servers. It's much easier if you only have web/app servers to deal with that have little or no state of their own (looking at you SmugMug). This being said, you need to have an automated deployment/config mgmt strategy if you use the cloud, otherwise you're doing it wrong.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Rehearse your disaster recovery procedures&lt;/b&gt; -- we were forced to do it due to the frequent failures we had in EC2. This turned out to be an advantage for us during the Big Outage.&lt;/li&gt;&lt;li&gt;&lt;b&gt;Don't blame 'the cloud' for your outages&lt;/b&gt; -- this has already been rehashed to death by all the post-mortem blogs after the EC2 outage, but it does bear repeating. If you use 'the cloud', expect that each and every instance can go down at any moment, no matter who your cloud provider is. Architect your infrastructure accordingly.&lt;/li&gt;&lt;li&gt;&lt;b&gt;If yo do use the cloud, use more than one&lt;/b&gt; -- I think that multi-cloud architectures will become the norm, especially after the EC2 outage.&lt;/li&gt;&lt;li&gt;&lt;b&gt;It's not in production if it is not monitored and graphed&lt;/b&gt; -- this is a no-brainer, but it's surprising how often this rule is breached in practice. The first thing we do after building a new server is put it in Nagios, Ganglia and Munin.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1835585678169065935?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1835585678169065935/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1835585678169065935' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1835585678169065935'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1835585678169065935'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/04/lessons-learned-from-deploying.html' title='Lessons learned from deploying a production database in EC2'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2324208429558830564</id><published>2011-04-15T16:53:00.000-07:00</published><updated>2011-04-15T17:09:39.868-07:00</updated><title type='text'>Installing and configuring Graphite</title><content type='html'>Here are some notes I jotted down while installing and configuring Graphite, which isn't a trivial task, although the &lt;a href="http://graphite.wikidot.com/documentation"&gt;official documentation&lt;/a&gt; isn't too bad. The next step is to turn them into a Chef recipe. These instructions apply to Ubuntu 10.04 32-bit with Python 2.6.5 so YMMV.&lt;br /&gt;&lt;b&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Install pre-requisites&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;# apt-get install python-setuptools&lt;br /&gt;# apt-get install python-memcache python-sqlite&lt;br /&gt;# apt-get install apache2 libapache2-mod-python pkg-config&lt;br /&gt;# easy_install-2.6 django&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Install pixman, cairo and pycairo&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;# wget http://cairographics.org/releases/pixman-0.20.2.tar.gz&lt;br /&gt;# tar xvfz pixman-0.20.2.tar.gz&lt;br /&gt;# cd pixman-0.20.2&lt;br /&gt;# ./configure; make; make install&lt;br /&gt;&lt;br /&gt;# wget http://cairographics.org/releases/cairo-1.10.2.tar.gz&lt;br /&gt;# tar xvfz cairo-1.10.2.tar.gz&lt;br /&gt;# cd cairo-1.10.2&lt;br /&gt;# ./configure; make; make install&lt;br /&gt;&lt;br /&gt;BTW, the pycairo install was the funkiest I've seen so far for a Python package, and that says a lot:&lt;br /&gt;&lt;br /&gt;# wget http://cairographics.org/releases/py2cairo-1.8.10.tar.gz&lt;br /&gt;# tar xvfz py2cairo-1.8.10.tar.gz&lt;br /&gt;# cd pycairo-1.8.10&lt;br /&gt;# ./configure --prefix=/usr&lt;br /&gt;# make; make install&lt;br /&gt;# echo ‘/usr/local/lib’ &amp;gt; /etc/ld.so.conf.d/pycairo.conf&lt;br /&gt;# ldconfig&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Install graphite packages (carbon, whisper, graphite webapp)&lt;/b&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;# wget http://launchpad.net/graphite/0.9/0.9.8/+download/graphite-web-0.9.8.tar.gz&lt;br /&gt;# wget http://launchpad.net/graphite/0.9/0.9.8/+download/carbon-0.9.8.tar.gz&lt;br /&gt;# wget http://launchpad.net/graphite/0.9/0.9.8/+download/whisper-0.9.8.tar.gz&lt;br /&gt;&lt;br /&gt;# tar xvfz whisper-0.9.8.tar.gz&lt;br /&gt;# cd whisper-0.9.8&lt;br /&gt;# python setup.py install&lt;br /&gt;&lt;br /&gt;# tar xvfz carbon-0.9.8.tar.gz&lt;br /&gt;# cd carbon-0.9.8&lt;br /&gt;# python setup.py install&lt;br /&gt;# cd /opt/graphite/conf&lt;br /&gt;# cp carbon.conf.example carbon.conf&lt;br /&gt;# cp storage-schemas.conf.example storage-schemas.conf&lt;br /&gt;&lt;br /&gt;# tar xvfz graphite-web-0.9.8.tar.gz&lt;br /&gt;# cd graphite-web-0.9.8&lt;br /&gt;# python check-dependencies.py&lt;br /&gt;# python setup.py install&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Configure Apache virtual host for graphite webapp&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Although the Graphite source distribution comes with an example vhost configuration for Apache, it didn't quite work for me. Here's what ended up working -- many thanks to my colleague Marco Garcia for figuring this out.&lt;/div&gt;&lt;div&gt;# cd /etc/apache2/sites-available/&lt;br /&gt;# cat graphite&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;&amp;lt;VirtualHost *:80&amp;gt;        &lt;br /&gt;ServerName graphite.mysite.com        &lt;br /&gt;DocumentRoot "/opt/graphite/webapp"        &lt;br /&gt;ErrorLog /opt/graphite/storage/log/webapp/error.log        &lt;br /&gt;CustomLog /opt/graphite/storage/log/webapp/access.log common        &lt;br /&gt;&amp;lt;Location "/"&amp;gt;                &lt;br /&gt;SetHandler python-program                &lt;br /&gt;PythonPath "['/opt/graphite/webapp'] + sys.path"                &lt;br /&gt;PythonHandler django.core.handlers.modpython                &lt;br /&gt;SetEnv DJANGO_SETTINGS_MODULE graphite.settings                &lt;br /&gt;PythonDebug Off                &lt;br /&gt;PythonAutoReload Off        &lt;br /&gt;&amp;lt;/Location&amp;gt;        &lt;br /&gt;&amp;lt;Location "/content/"&amp;gt;&lt;br /&gt;SetHandler None        &lt;br /&gt;&amp;lt;/Location&amp;gt;&lt;br /&gt;&amp;lt;Location "/media/"&amp;gt;&lt;br /&gt;SetHandler None&lt;br /&gt;&amp;lt;/Location&amp;gt;&lt;br /&gt;Alias /media/ "/usr/local/lib/python2.6/dist-packages/Django-1.3-py2.6.egg/django/contrib/admin/media/"&lt;br /&gt;&amp;lt;/VirtualHost&amp;gt;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;# cd /etc/apache2/sites-enabled/&lt;br /&gt;# ln -s ../sites-available/graphite 001-graphite&lt;/div&gt;&lt;div&gt;&lt;br /&gt;Make sure mod_python is enabled:&lt;br /&gt;&lt;br /&gt;# ls -la /etc/apache2/mods-enabled/python.load&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Create Django database for graphite webapp&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;# cd /opt/graphite/webapp/graphite&lt;br /&gt;# python manage.py syncdb&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Apply permissions on storage directory&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;# chown -R www-data:www-data /opt/graphite/storage/&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Restart Apache&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;# service apache2 restart&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Start data collection server (carbon-cache)&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;# cd /opt/graphite/bin&lt;br /&gt;# ./carbon-cache.py start&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;At this point, if you go to graphite.mysite.com, you should see the dashboard of the Graphite web app.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Test data collection&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The Graphite source distribution comes with an example client written in Python that sends data to the Carbon collecting server every minute. You can find it in graphite-web-0.9.8/examples/example-client.py.&lt;br /&gt;&lt;br /&gt;Sending data is very easy -- like we say in Devops, just open a socket!&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;import sys&lt;br /&gt;import time&lt;br /&gt;import os&lt;br /&gt;import platform&lt;br /&gt;import subprocess&lt;br /&gt;from socket import socket&lt;br /&gt;&lt;br /&gt;CARBON_SERVER = '127.0.0.1'&lt;br /&gt;CARBON_PORT = 2003&lt;br /&gt;delay = 60 &lt;br /&gt;if len(sys.argv) &gt; 1:  &lt;br /&gt;    delay = int( sys.argv[1] )&lt;br /&gt;&lt;br /&gt;def get_loadavg():    &lt;br /&gt;    # For more details, "man proc" and "man uptime"      &lt;br /&gt;        if platform.system() == "Linux":&lt;br /&gt;            return open('/proc/loadavg').read().strip().split()[:3]    &lt;br /&gt;        else:&lt;br /&gt;            command = "uptime"&lt;br /&gt;            process = subprocess.Popen(command, stdout=subprocess.PIPE, shell=True)                      &lt;br /&gt;            os.waitpid(process.pid, 0)&lt;br /&gt;            output = process.stdout.read().replace(',', ' ').strip().split()          &lt;br /&gt;            length = len(output)&lt;br /&gt;            return output[length - 3:length]&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;sock = socket()&lt;br /&gt;try:&lt;br /&gt;    sock.connect((CARBON_SERVER,CARBON_PORT))&lt;br /&gt;except:&lt;br /&gt;    print "Couldn't connect to %(server)s on port %(port)d" % {'server':CARBON_SERVER, 'port':CARBON_PORT}&lt;br /&gt;    sys.exit(1)&lt;br /&gt;&lt;br /&gt;while True:&lt;br /&gt;    now = int( time.time() )&lt;br /&gt;    lines = []&lt;br /&gt;    # We're gonna report all three loadavg values&lt;br /&gt;    loadavg = get_loadavg()&lt;br /&gt;    lines.append("system.loadavg_1min %s %d" % (loadavg[0],now))    &lt;br /&gt;    lines.append("system.loadavg_5min %s %d" % (loadavg[1],now))    &lt;br /&gt;    lines.append("system.loadavg_15min %s %d" % (loadavg[2],now))    &lt;br /&gt;&lt;br /&gt;    message = '\n'.join(lines) + '\n' &lt;br /&gt;    #all lines must end in a newline&lt;br /&gt;    print "sending message\n"&lt;br /&gt;    print '-' * 80&lt;br /&gt;    print message&lt;br /&gt;    print&lt;br /&gt;    sock.sendall(message)&lt;br /&gt;    time.sleep(delay)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Some observations about the above code snippet:&lt;br /&gt;&lt;ul&gt;&lt;li&gt;the format of a message to be sent to a Graphite/Carbon server is very simple: "metric_path value timestamp\n"&lt;/li&gt;&lt;li&gt;metric_path is a completely arbitrary name -- it is a string containing substrings delimited by dots. Think of it as an SNMP OID, where the most general name is at the left and the most specific is at the right&lt;/li&gt;&lt;ul&gt;&lt;li&gt;in the example above, the 3 metric_path strings are system.loadavg_1min, system.loadavg_5min and system.loadavg_15min&lt;/li&gt;&lt;/ul&gt;&lt;/ul&gt;&lt;div&gt;&lt;b&gt;Establish retention policies&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This is explained very well in the '&lt;a href="http://graphite.wikidot.com/getting-your-data-into-graphite"&gt;Getting your data into Graphite&lt;/a&gt;' portion of the docs. What you want to do is to specify a retention configuration for each set of metrics that you send to Graphite. This is accomplished by editing the /opt/graphite/storage/schemas file. For the example above which send the load average for 1, 5 and 15 min to Graphite every minute, we can specify the following retention policy:&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;[loadavg]&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;priority = 100&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;pattern = ^system\.loadavg*&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"&gt;&lt;span class="Apple-style-span" style="font-size: x-small;"&gt;retentions = 60:43200,900:350400&lt;/span&gt;&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This tells graphite that all metric_paths starting with system.loadavg should be stored with a retention policy that keeps per minute (60 seconds) precision data for 30 days(43,200 seconds), and per-15 min (900 sec) precision data for 10 years (350,400 seconds).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Go wild with stats!&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point, if you run the example client, you should be able to go to the Graphite dashboard and expand the Graphite-&amp;gt;system path and see the 3 metrics being captured: loadavg_1min, loadavg_5min and loadavg_15min. Clicking on each one will populate the graph with the corresponding data line. If you're logged in into the dashboard, you can also save a given graph.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The sky is the limit at this point in terms of the data you can capture and visualize with Graphite. As an example, I parse a common maillog file that captures all email sent out through our system. I 'tail' the file every minute and I count how many message were sent out total, and per mail server in our mail cluster. I send this data to Graphite and I watch it in near-realtime (the retention policy in my case is similar to the loadavg one above).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's how the Graphite graph looks like:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;br /&gt;&lt;a href="http://2.bp.blogspot.com/-dafCw_iJ_T4/TajdAWEGUTI/AAAAAAAAACY/0oxjUB9usNA/s1600/mailout.png" onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 320px; height: 153px;" src="http://2.bp.blogspot.com/-dafCw_iJ_T4/TajdAWEGUTI/AAAAAAAAACY/0oxjUB9usNA/s320/mailout.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5595965535000351026" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In another blog post I'll talk about Etsy's &lt;a href="https://github.com/etsy/statsd"&gt;statsd&lt;/a&gt; and its Python equivalent &lt;a href="https://github.com/sivy/py-statsd"&gt;pystatsd&lt;/a&gt;, to which my colleague Josh Frederick contributed the &lt;a href="http://www.monkinetic.com/2011/02/python-statsd-server.html"&gt;server-side code&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2324208429558830564?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2324208429558830564/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2324208429558830564' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2324208429558830564'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2324208429558830564'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/04/installing-and-configuring-graphite.html' title='Installing and configuring Graphite'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-dafCw_iJ_T4/TajdAWEGUTI/AAAAAAAAACY/0oxjUB9usNA/s72-c/mailout.png' height='72' width='72'/><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-8865628768434232405</id><published>2011-03-28T14:06:00.001-07:00</published><updated>2011-03-28T14:06:19.547-07:00</updated><title type='text'>Working in the multi-cloud with libcloud</title><content type='html'>I just posted my slides on "&lt;a href="http://www.slideshare.net/ggheorghiu/working-in-the-multicloud-with-libcloud"&gt;Working on the multi-cloud with libcloud&lt;/a&gt;" to Slideshare. It's a talk I gave at the SoCal Piggies meeting in February 2011.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-8865628768434232405?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/8865628768434232405/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=8865628768434232405' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/8865628768434232405'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/8865628768434232405'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/03/working-in-multi-cloud-with-libcloud.html' title='Working in the multi-cloud with libcloud'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2754038394811800509</id><published>2011-03-25T13:54:00.000-07:00</published><updated>2011-03-25T16:15:56.010-07:00</updated><title type='text'>ABM - "Always Be Monitoring"</title><content type='html'>What prompted this post was an incident we had in the very early hours of this past Tuesday, when we started to see a lot of packet loss, increased latency and timeouts between some of our servers hosted at a data center on the US East Coast, and some instances we have running in EC2, also in the US East region. The symptoms were increased error rates in some application calls that we were making from one back-end server cluster at the data center into another back-end cluster in EC2. These errors weren't affecting our customers too much, because all failed requests were posted to various queues and reprocessed.&lt;br /&gt;&lt;br /&gt;There had also been network maintenance done that night within the data center, so we weren't sure initially if it's our outbound connectivity into EC2 or general inbound connectivity into EC2 that was the culprit. What was strange (and unexpected) too was that several EC2 availability zones seemed to be affected -- mostly us-east-1d, but we were also seeing increased latency and timeouts into 1b and 1a. That made it hard to decide whether the issue was with EC2 or with us.&lt;br /&gt;&lt;br /&gt;Running traceroutes from different source machines (some being our home machines in California, another one being a Rackspace cloud server instance in Chicago) revealed that packet loss and increased latency occurred almost all the time at the same hop: a router within the Level 3 network upstream from the Amazon EC2 network. What was frustrating too was that the AWS Status dashboard showed everything absolutely green. Now you can argue that this wasn't necessarily an EC2 issue, but if I were Amazon I would like to monitor the major inbound network paths into my infrastructure -- especially when it has the potential to affect several availability zones at once.&lt;br /&gt;&lt;br /&gt;This whole issue lasted approximately 3.5 hours, then it miraculously stopped. Somebody must have fixed a defective router. Twitter reports from other people experiencing the exact same issue revealed that the issue was seen as fixed for them at the very minute that it was fixed for us too.&lt;br /&gt;&lt;br /&gt;This incident brought home a valuable point for me though: we needed more monitors than we had available. We were monitoring connectivity &lt;b&gt;1) within the data center&lt;/b&gt;, &lt;b&gt;2) within EC2&lt;/b&gt;, and &lt;b&gt;3) between our data center and EC2&lt;/b&gt;. However, we also needed to monitor &lt;b&gt;4) inbound connectivity into EC2 going from sources that were outside of our data center&lt;/b&gt; infrastructure. Only by &lt;b&gt;triangulating&lt;/b&gt; (for lack of a better term) our monitoring in this manner would we be sure which network path was to blame. Note that we already had Pingdom set up to monitor various URLs within our site, but like I said, the front-end stuff wasn't affected too much by that particular issue that night.&lt;br /&gt;&lt;br /&gt;So...the next day we started up a small Rackspace cloud server in Chicago, and a small Linode VPS in Fremont, California, and we added them to our Nagios installation. We run the same exact checks from these servers into EC2 that we run from our data center into EC2. This makes network issues faster to troubleshoot, although unfortunately not easier to solve -- because we could be depending on a 3rd party to solve them.&lt;br /&gt;&lt;br /&gt;I guess a bigger point to make, other than &lt;b&gt;ABM/Always Be Monitoring&lt;/b&gt;, is &lt;b&gt;OYA/Own Your Availability&lt;/b&gt; (I didn't come up with this, I personally first saw it mentioned by the @fastip guys). To me, what this means is to deploy your infrastructure across multiple providers (data centers/clouds) so that you don't have a single point of failure at the provider level. This is obviously easier said than done....but we're working on it as far as our infrastructure goes.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2754038394811800509?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2754038394811800509/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2754038394811800509' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2754038394811800509'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2754038394811800509'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/03/abm-always-be-monitoring.html' title='ABM - &quot;Always Be Monitoring&quot;'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1512142797854630775</id><published>2011-03-16T11:06:00.000-07:00</published><updated>2011-03-16T11:06:09.313-07:00</updated><title type='text'>What I like and don't like to see in a technical presentation</title><content type='html'>&lt;b&gt;What I like to see:&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Live demo of the technology/tool/process you are describing (or at least a screencast)&lt;/li&gt;&lt;li&gt;Lessons learned -- the most interesting ones are the failures&lt;/li&gt;&lt;li&gt;If you're presenting something you created:&lt;/li&gt;&lt;ul&gt;&lt;li&gt;compare and contrast it with existing solutions&amp;nbsp;&lt;/li&gt;&lt;li&gt;convince me you're not suffering from the NIH syndrome&lt;/li&gt;&lt;li&gt;convince me your creation was born out of necessity, ideally from issues you needed to solve in &lt;b&gt;production&lt;/b&gt;&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Hard data (charts/dashboards)&lt;/li&gt;&lt;li&gt;Balance between being too shallow and going too deep when covering your topic&lt;/li&gt;&lt;ul&gt;&lt;li&gt;keep in mind both the HOW and the WHY of the topic&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Going above and beyond the information I can obtain with a simple Google search for that topic&lt;/li&gt;&lt;li&gt;Pointers to any tools/resources you reference (GitHub pages preferred)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;What I don't like to see:&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;Cute slides with images and only a couple of words (unless you provide generous slide notes in some form)&lt;/li&gt;&lt;li&gt;Humor is fine, but not if it's all there is&lt;/li&gt;&lt;li&gt;Hand-waving / chest-pounding&lt;/li&gt;&lt;li&gt;Vaporware&lt;/li&gt;&lt;li&gt;No knowledge of existing solutions with established communities&lt;/li&gt;&lt;ul&gt;&lt;li&gt;you're telling me you're smarter than everybody else in the room but you're not backing up that assertion&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;Simple usage examples that I can also get via Google searches&lt;/li&gt;&lt;li&gt;Abandoning the WHY for the HOW&lt;/li&gt;&lt;li&gt;Abandoning the HOW for the WHY&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1512142797854630775?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1512142797854630775/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1512142797854630775' title='6 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1512142797854630775'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1512142797854630775'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/03/what-i-like-and-dont-like-to-see-in.html' title='What I like and don&apos;t like to see in a technical presentation'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>6</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-269112057178681519</id><published>2011-03-14T16:02:00.000-07:00</published><updated>2011-03-14T16:02:27.785-07:00</updated><title type='text'>Deployment and hosting open space at PyCon</title><content type='html'>One of the most interesting events for me this year at PyCon was an Open Space session organized by &lt;a href="http://twitter.com/#!/natea"&gt;Nate Aune&lt;/a&gt; on deployment, hosting and configuration management. The session was very well attended, and it included representatives of a large range of companies. Here are some of them, if memory serves well: Disqus, NASA,&amp;nbsp;Opscode,&amp;nbsp;DjangoZoom, Eucalyptus, ep.io, Gondor, Whiskey Media ... and many more that I wish I could remember (if you were there and want to add anything, please leave a comment here).&lt;br /&gt;&lt;br /&gt;Here are some things my tired brain remembers from the discussions we had:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;everybody seems to be using virtualenv when deploying their Python applications&lt;/li&gt;&lt;li&gt;everybody seems to be using &lt;a href="http://docs.fabfile.org/en/1.0.0/index.html"&gt;Fabric&lt;/a&gt; in one way or another to push changes to remote nodes&lt;/li&gt;&lt;li&gt;the participants seemed to be split almost equally between &lt;a href="http://projects.puppetlabs.com/projects/1/wiki/Documentation_Start"&gt;Puppet&lt;/a&gt; and &lt;a href="http://wiki.opscode.com/display/chef/Home"&gt;Chef&lt;/a&gt; for provisioning&lt;/li&gt;&lt;li&gt;the more disciplined of the companies (ep.io for example) use Puppet/Chef both for provisioning and application deployment and configuration (ep.io still uses Fabric for stopping/starting services on remote nodes for example)&lt;/li&gt;&lt;li&gt;other companies (including us at Evite) use Chef/Puppet for automated provisioning of the OS + pre-requisite packages, then use Fabric to push the deployment of the application because they prefer the synchronous aspect of a push approach&lt;/li&gt;&lt;li&gt;upgrading database schemas is hard; many people only do additive changes (NoSQL makes this easier, and as far as relational databases go, PostgreSQL makes it easier than MySQL )&lt;/li&gt;&lt;li&gt;many people struggle with how best to bundle their application with other types of files, such as haproxy or nginx configurations&lt;/li&gt;&lt;ul&gt;&lt;li&gt;at Evite we face the same issue, and we came up with the notion of a bundle, a directory structure that contains the virtualenv of the application, the configuration files for the application, and all the other configuration files for programs that interact with our application -- haproxy, nginx, supervisord for example&lt;/li&gt;&lt;li&gt;when we do a deploy, we check out a bundle via a revision tag, then we push the bundle to a given app server&lt;/li&gt;&lt;/ul&gt;&lt;li&gt;some people prefer to take the OS package approach here, and bundle all the above types of files in an rpm or deb package&lt;/li&gt;&lt;li&gt;&lt;a href="http://twitter.com/#!/kantrn"&gt;Noah Kantrowitz&lt;/a&gt;&amp;nbsp;has released 2 Chef-related Python tools that I was not aware of: &lt;a href="https://github.com/coderanger/pychef"&gt;PyChef&lt;/a&gt; (a Python client that knows how to query a Chef server) and &lt;a href="https://github.com/coderanger/commis"&gt;commis&lt;/a&gt; (a Python implementation of a Chef server, with the goal of being less complicated to install than its Ruby counterpart)&lt;/li&gt;&lt;li&gt;&lt;a href="https://github.com/tobami/littlechef"&gt;LittleChef&lt;/a&gt; was mentioned as a way to run Chef Solo on a remote node via fabric, thus giving you the control of a 'push' method combined with the advantage of using community cookbooks already published for Chef&lt;/li&gt;&lt;li&gt;I had to leave towards the end of the meeting, when people started to discuss the hosting aspect, so I don't have a lot to add here -- but it is interesting to me to see quite a few companies that have Platform-as-a-Service (PaaS) offerings for Python hosting: DjangoZoom, ep.io, Gondor (ep.io can host any WSGI application, while the DjangoZoom and Gondor are focused on Django)&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;All in all, there were some very interesting discussions that showed that pretty much everybody is struggling with similar issues. There is no silver bullet, but there are some tools and approaches that can help make your life easier in this area. My impression is that the field of automated deployments and configuration management, even though changing fast, is also maturing fast, with a handful of tools dominating the space. It's an exciting space to play in!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-269112057178681519?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/269112057178681519/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=269112057178681519' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/269112057178681519'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/269112057178681519'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/03/deployment-and-hosting-open-space-at.html' title='Deployment and hosting open space at PyCon'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-3949762548813469757</id><published>2011-03-08T16:34:00.000-08:00</published><updated>2011-03-08T16:34:49.985-08:00</updated><title type='text'>Monitoring is for ops what testing is for dev</title><content type='html'>Devops. It's the new buzzword. Go to any tech conference these days and you're sure to find an expert panel on the 'what' and 'why' of devops. These panels tend to be light on the 'how', because that's where the rubber meets the road. I tried to give a step-by-step description of how you can become a Ninja Rockstar Internet Samurai devops in my blog post on '&lt;a href="http://agiletesting.blogspot.com/2010/11/how-to-whip-your-infrastructure-into.html"&gt;How to whip your infrastructure into shape&lt;/a&gt;'.&lt;br /&gt;&lt;br /&gt;Here I just want to say that I am struck by the parallels that exist between the activities of developer testing and operations monitoring. It's not a new idea by any means, but it's been growing on me recently.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Test-infected vs. monitoring-infected&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Good developers are test-infected. It doesn't matter too much whether they write tests before or after writing their code -- what matters is that they do write those tests as soon as possible, and that they don't consider their code 'done' until it has a comprehensive suite of tests. And of course test-infected developers are addicted to watching those dots in the output of their favorite test runner.&lt;br /&gt;&lt;br /&gt;Good ops engineers are monitoring-infected. They don't consider their infrastructure build-out 'done' until it has a comprehensive suite of monitoring checks, notifications and alerting rules, and also one or more dashboard-type systems that help them visualize the status of the resources in the infrastructure.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Adding tests vs. adding monitoring checks&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Whenever a bug is found, a good developer will add a unit test for it. It serves as a proof that the bug is now fixed, and also as a regression test for that bug.&lt;br /&gt;&lt;br /&gt;Whenever something unexpectedly breaks within the systems infrastructure, a good ops engineer will add a monitoring check for it, and if possible a graph showing metrics related to the resource that broke. This ensures that alerts will go out in a timely manner next time things break, and that correlations can be made by looking at the metrics graphs for the various resources involved.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Ignoring broken tests vs. ignoring monitoring alerts&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;When a test starts failing, you can either fix it so that the bar goes green, or you can ignore it. Similarly, if a monitoring alert goes off, you can either fix the underlying issue, or you can ignore it by telling yourself it's not really critical.&lt;br /&gt;&lt;br /&gt;The problem with ignoring broken tests and monitoring alerts is that this attitude leads slowly but surely to the&amp;nbsp;&lt;a href="http://en.wikipedia.org/wiki/Broken_windows_theory"&gt;Broken Window Syndrome&lt;/a&gt;. You train yourself to ignore issues that sooner or later will become critical (it's a matter of when, not if).&lt;br /&gt;&lt;br /&gt;A good developer will make sure there are no broken tests in their Continuous Integration system, and a good ops engineer will make sure all alerts are accounted for and the underlying issues fixed.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Improving test coverage vs. improving monitoring coverage&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Although 100% test coverage is not sufficient for your code to be bug-free, still, having something around 80-90% code coverage is a good measure that you as a developer are disciplined in writing those tests. This makes you sleep better at night and gives you pride in producing quality code.&lt;br /&gt;&lt;br /&gt;For ops engineers, sleeping better at night is definitely directly proportional to the quantity and quality of the monitors that are in place for their infrastructure. The more monitors, the better the chances that issues are caught early and fixed before they escalate into the dreaded 2 AM pager alert.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Measure and graph everything&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The more dashboards you have as a devops, the better insight you have into how your infrastructure behaves, from both a code and an operational point of view. I am inspired in this area by the work that's done at Etsy, where they are graphing every interesting metric they can think of (see their '&lt;a href="http://codeascraft.etsy.com/2011/02/15/measure-anything-measure-everything/"&gt;Measure Anything, Measure Everything&lt;/a&gt;' blog post).&lt;br /&gt;&lt;br /&gt;As a developer, you want to see your code coverage graphs showing decent values, close to that mythical 100%. As an ops engineer, you want to see uptime graphs that are close to the mythical 5 9's.&lt;br /&gt;&lt;br /&gt;But maybe even more importantly, you want insight into metrics that tie directly into your business. At Evite, processing messages and sending email reliably is our bread and butter, so we track those processes closely and we have dashboards for metrics related to them. Spikes, either up or down, are investigated quickly.&lt;br /&gt;&lt;br /&gt;Here are some examples of the dashboards we have. For now these use homegrown data collection tools and the &lt;a href="http://code.google.com/apis/charttools/index.html"&gt;Google Visualization API&lt;/a&gt;, but we're looking into using &lt;a href="http://graphite.wikidot.com/"&gt;Graphite&lt;/a&gt; soon.&lt;br /&gt;&lt;br /&gt;Outgoing email messages in the last hour (spiking at close to 100 messages/second):&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-5AgRNYEl0_Q/TXbHjMh_jSI/AAAAAAAAACM/kwvoqlNe1Io/s1600/outgoing_email.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="246" src="https://lh6.googleusercontent.com/-5AgRNYEl0_Q/TXbHjMh_jSI/AAAAAAAAACM/kwvoqlNe1Io/s400/outgoing_email.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;Size of various queues we use to process messages (using a homegrown queuing mechanism):&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh3.googleusercontent.com/-vpZ2Ey1v5fY/TXbIJ3WBapI/AAAAAAAAACQ/CTCIoCHO3HA/s1600/queue_size.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="252" src="https://lh3.googleusercontent.com/-vpZ2Ey1v5fY/TXbIJ3WBapI/AAAAAAAAACQ/CTCIoCHO3HA/s400/queue_size.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;Percentage of errors across some of our servers:&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="https://lh6.googleusercontent.com/-HfLeUCaT1c4/TXbJC9-U3GI/AAAAAAAAACU/CbSzuUHkRD0/s1600/prod_errors.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"&gt;&lt;img border="0" height="286" src="https://lh6.googleusercontent.com/-HfLeUCaT1c4/TXbJC9-U3GI/AAAAAAAAACU/CbSzuUHkRD0/s400/prod_errors.png" width="400" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;br /&gt;&lt;br /&gt;Associated with these metrics we have Nagios alerts that fire when certain thresholds are being met. This combination allows our devops team to sleep better at night.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-3949762548813469757?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/3949762548813469757/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=3949762548813469757' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3949762548813469757'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3949762548813469757'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/03/monitoring-is-for-ops-what-testing-is.html' title='Monitoring is for ops what testing is for dev'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='https://lh6.googleusercontent.com/-5AgRNYEl0_Q/TXbHjMh_jSI/AAAAAAAAACM/kwvoqlNe1Io/s72-c/outgoing_email.png' height='72' width='72'/><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1507439160357609627</id><published>2011-02-26T09:02:00.000-08:00</published><updated>2011-02-26T09:02:51.314-08:00</updated><title type='text'>AWS CloudFormation is a provisioning and not a config mgmt tool</title><content type='html'>There's a lot of buzz on Twitter on how the recently announced &lt;a href="http://aws.amazon.com/cloudformation/"&gt;AWS CloudFormation&lt;/a&gt; service spells the death of configuration management tools such as Puppet/Chef/cfengine/bcfg2. I happen to think that the opposite is true.&lt;br /&gt;&lt;br /&gt;CloudFormation is a great way to &lt;b&gt;provision&lt;/b&gt; what it calls a 'stack' in your EC2 infrastructure. A stack comprises several AWS resources such as EC2 instances, EBS volumes, Elastic Load Balancers, Elastic IPs, RDS databases, etc. Note that it was always possible to do this via your own homegrown tools by calling in concert the various APIs offered by these services/resources. What CloudFormation brings to the table is an easy way to describe the relationships between these resources via a JSON file which they call a template.&lt;br /&gt;&lt;br /&gt;Some people get tripped by the inclusion in the CloudFormation &lt;a href="http://aws.amazon.com/code/AWS-CloudFormation/3233063344774450"&gt;sample templates&lt;/a&gt; of applications such as WordPress, Joomla or Redmine -- they think that CloudFormation deals with application deployments and configuration management. If you look closely at one of these sample templates, let's say the Joomla one, you'll see that what happens is simply that a pre-baked AMI containing the Joomla installation is used when launching the EC2 instances included in the CloudFormation stack. Also, the UserData mechanism is used to pass certain values to the instance. They do add a nice feature here where you can reference attributes defined in other parts of the stack template, such as DB endpoint address in this example:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;"UserData": {&lt;br /&gt;          "Fn::Base64": {&lt;br /&gt;            "Fn::Join": [&lt;br /&gt;              ":",&lt;br /&gt;              [&lt;br /&gt;                {&lt;br /&gt;                  "Ref": "JoomlaDBName"&lt;br /&gt;                },&lt;br /&gt;                {&lt;br /&gt;                  "Ref": "JoomlaDBUser"&lt;br /&gt;                },&lt;br /&gt;                {&lt;br /&gt;                  "Ref": "JoomlaDBPwd"&lt;br /&gt;                },&lt;br /&gt;                {&lt;br /&gt;                  "Ref": "JoomlaDBPort"&lt;br /&gt;                },&lt;br /&gt;                {&lt;br /&gt;                  "Fn::GetAtt": [&lt;br /&gt;                    "JoomlaDB",&lt;br /&gt;                    "Endpoint.Address"&lt;br /&gt;                  ]&lt;br /&gt;                },&lt;br /&gt;                {&lt;br /&gt;                  "Ref": "WebServerPort"&lt;br /&gt;                },&lt;br /&gt;                {&lt;br /&gt;                  "Fn::GetAtt": [&lt;br /&gt;                    "ElasticLoadBalancer",&lt;br /&gt;                    "DNSName"&lt;br /&gt;                  ]&lt;br /&gt;                }&lt;br /&gt;              ]&lt;br /&gt;            ]&lt;br /&gt;          }&lt;br /&gt;        },&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;However, all this was also possible before CloudFormation. You were always able to bake your own AMI containing your own application, and use the UserData mechanism to run whatever you want at instance creation time. Nothing new here. This is &lt;b&gt;NOT&lt;/b&gt; configuration management. This will &lt;b&gt;NOT&lt;/b&gt; replace the need for a solid deployment and configuration management tool. Why? Because rolling your own AMI results in an opaque 'black box' deployment. You need to document and version your per-baked AMIs carefully, then develop a mechanism for associating an AMI ID with a list of packages installed on that AMI. If you think about it, you actually end up writing an asset management tool. Then if you need to deploy a new version of the application, you either bake a new AMI (painful), or you reach for a real deployment/config mgmt tool to do it. &lt;br /&gt;&lt;br /&gt;The alternative, which I espouse, is to start with a bare-bone AMI (I use the &lt;a href="https://help.ubuntu.com/community/EC2StartersGuide"&gt;official Ubuntu AMIs&lt;/a&gt; provided by Canonical) and employ the UserData mechanism to bootstrap the installation of a configuration management client such as chef-client or the Puppet client. The newly created instance then 'phones home' to your central configuration management server (Chef server or Puppetmaster for example) and finds out how to configure itself. The beauty of this approach is that the config mgmt server keeps track of the customizations made on the client. No need for you to document that separately -- just use the search functions provided by the config mgmt tool to find out which packages and applications have been installed on the client.&lt;br /&gt;&lt;br /&gt;The barebone AMI + config mgmt mechanism does result in EC2 instances taking longer to get fully configured initially (as opposed to the pre-baked AMI technique), but the flexibility and control you gain over those instances is well worth it.&lt;br /&gt;&lt;br /&gt;One other argument, that I almost don't need to make, is that the pre-baked AMI technique is very specific to EC2. You will have to reinvent the wheel if you want to deploy your infrastructure to a different cloud provider, or inside your private cloud or datacenter.&lt;br /&gt;&lt;br /&gt;So.....do continue to hone your skills at learning how to fully utilize a good configuration management tool. It will serve you well, both in EC2 and in other environments.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1507439160357609627?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1507439160357609627/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1507439160357609627' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1507439160357609627'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1507439160357609627'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/02/aws-cloudformation-is-provisioning-and.html' title='AWS CloudFormation is a provisioning and not a config mgmt tool'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1617927864899619999</id><published>2011-02-22T16:19:00.000-08:00</published><updated>2011-02-22T16:19:47.658-08:00</updated><title type='text'>Cheesecake project now on GitHub</title><content type='html'>I received a feature request for the &lt;a href="http://www.pycheesecake.org/"&gt;Cheesecake&lt;/a&gt; project last week (thanks Joost Cassee!), so as an experiment I also put the &lt;a href="https://github.com/griggheo/cheesecake"&gt;code&lt;/a&gt; up on Github. Hopefully the 'social coding' aspect will kick in and more people will be interested in the project. One can dream.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1617927864899619999?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1617927864899619999/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1617927864899619999' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1617927864899619999'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1617927864899619999'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/02/cheesecake-project-now-on-github.html' title='Cheesecake project now on GitHub'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-5697290851512901376</id><published>2011-02-22T12:41:00.000-08:00</published><updated>2011-02-22T12:41:39.128-08:00</updated><title type='text'>HAProxy monitoring with Nagios and Munin</title><content type='html'>&lt;a href="http://haproxy.1wt.eu/"&gt;HAProxy&lt;/a&gt; is one of the most widely used (if not THE most widely used) software load balancing solution out there. I definitely recommend it if you're looking for a very solid and very fast piece of software for your load balancing needs. I blogged about it &lt;a href="http://agiletesting.blogspot.com/2009/02/load-balancing-in-amazon-ec2-with.html"&gt;before&lt;/a&gt;, but here I want to describe ways to monitor it with Nagios (for alerting purposes) and Munin (for resource graphing purposes).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;HAProxy Nagios plugin&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Near the top of Google searches for 'haproxy nagios plugin' is this &lt;a href="http://comments.gmane.org/gmane.comp.web.haproxy/2469"&gt;message&lt;/a&gt;&amp;nbsp;to the haproxy mailing list from Jean-Christophe Toussaint which contains links to a Nagios plugin he wrote for checking HAProxy. This plugin is what I ended up using. It's a Perl script which needs the Nagios::Plugin CPAN module installed. Once you do it, drop&amp;nbsp;check_haproxy.pl in your Nagios libexec directory, then configure it to check the HAProxy stats with a command line similar to this:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;/usr/local/nagios/libexec/check_haproxy.pl -u 'http://your.haproxy.server.ip:8000/haproxy;csv' -U hauser -P hapasswd&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This assumes that you have HAProxy configured to output its statistics on port 8000. I have these lines in /etc/haproxy/haproxy.cfg:&lt;br /&gt;&lt;pre class="code"&gt;&lt;/pre&gt;&lt;pre class="code"&gt;# status page.&lt;br /&gt;listen stats 0.0.0.0:8000&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;mode http&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;stats enable&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;stats uri /haproxy&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;stats realm HAProxy&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;stats auth hauser:hapasswd&lt;br /&gt;&lt;/pre&gt;&lt;div&gt;&lt;br /&gt;Note that the Nagios plugin actually requests the stats in CSV format. The output of the plugin is something like:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;pre class="code"&gt;HAPROXY OK - &amp;nbsp;cluster1 (Active: 60/60) cluster2 (Active: 169/169) | t=0.131051s;2;10;0; sess_cluster1=0sessions;;;0;20000 sess_cluster2=78sessions;;;0;20000&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;It shows the active clusters in your HAProxy configuration (e.g. cluster2), together with the number of backends that are UP among the total number of backends for that cluster (e.g 169/169), and also with the number of active sessions for each cluster. If any backend is DOWN, the check status code is critical and you'll get a Nagios alert.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;HAProxy Munin plugins&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Another Google search, this time for HAProxy and Munin, reveals another &lt;a href="http://www.mail-archive.com/haproxy@formilux.org/msg01962.html"&gt;message&lt;/a&gt; to the haproxy mailing list with links to 4 Munin plugins written by Bart van der Schans:&lt;br /&gt;&lt;br /&gt;- haproxy_check_duration: monitor the duration of the health checks per server&lt;br /&gt;- haproxy_errors: monitor the rate of 5xx response headers per backend&lt;br /&gt;- haproxy_sessions: monitors the rate of (tcp) sessions per backend&lt;br /&gt;- haproxy_volume: monitors the bps in and out per backend&lt;br /&gt;&lt;br /&gt;I downloaded the plugins, dropped them into /usr/share/munin/plugins, symlink-ed them into /etc/munin/plugins, and added this stanza to /etc/munin/plugin-conf.d/munin-node:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;[haproxy*]&lt;br /&gt;user haproxy&lt;br /&gt;env.socket /var/lib/haproxy/stats.socket&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;However, note that for the plugins to work properly you need 2 things:&lt;br /&gt;&lt;br /&gt;1) Configure HAProxy to use a socket that can be queried for stats. I did this by adding these lines to the global section in my haproxy.cfg file:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;chroot /var/lib/haproxy&lt;br /&gt;user haproxy&lt;br /&gt;group haproxy&lt;br /&gt;stats socket /var/lib/haproxy/stats.socket uid 1002 gid 1002&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(where in my case 1002 is the uid of the haproxy user, and 1002 the gid of the haproxy group)&lt;br /&gt;&lt;br /&gt;After doing 'service haproxy reload', you can check that the socket stats work as expected by doing something like this (assuming you have socat installed):&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;echo 'show stat' | socat unix-connect:/var/lib/haproxy/stats.socket stdio&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This should output the HAProxy stats in CSV format.&lt;br /&gt;&lt;br /&gt;2) Edit the 4 plugins and change the 'exit 1' statement to 'exit 1' at the top of each plugin:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;if ( $ARGV[0] eq "autoconf" ) {&lt;br /&gt;    print_autoconf();&lt;br /&gt;    exit 0;&lt;br /&gt;} elsif ( $ARGV[0] eq "config" ) {&lt;br /&gt;    print_config();&lt;br /&gt;    exit 0;&lt;br /&gt;} elsif ( $ARGV[0] eq "dump" ) {&lt;br /&gt;    dump_stats();&lt;br /&gt;    exit 0;&lt;br /&gt;} else {&lt;br /&gt;    print_values();&lt;br /&gt;    exit 0;&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;If you don't do this, the plugins will exit with code 1 even in the case of success, and this will be interpreted by munin-node as an error. Consequently, you will scratch your head wondering why no haproxy-related links and graphs are showing up on your munin stats page.&lt;br /&gt;&lt;br /&gt;Once you do all this, do 'service munin-node reload' on the node running the HAProxy Munin plugins, then check that the plugins are working as expected by cd-ing into the /etc/munin/plugins directory and running each plugin through the 'munin-run' utility. For example:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;# munin-run haproxy_sessions &lt;br /&gt;cluster2.value 146761052&lt;br /&gt;cluster1.value 0&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;That's it. These plugins make it fairly easy for you to get more peace of mind and a better sleep at night. Although it's well known that in #devops we don't sleep that much anyway...&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-5697290851512901376?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/5697290851512901376/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=5697290851512901376' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5697290851512901376'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5697290851512901376'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/02/haproxy-monitoring-with-nagios-and.html' title='HAProxy monitoring with Nagios and Munin'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-7811821135762608674</id><published>2011-01-27T14:26:00.000-08:00</published><updated>2011-01-27T14:26:04.943-08:00</updated><title type='text'>Printing to the cloud</title><content type='html'>A movie is worth 10,000 words...&lt;br /&gt;&lt;br /&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;object width="320" height="266" class="BLOGGER-youtube-video" classid="clsid:D27CDB6E-AE6D-11cf-96B8-444553540000" codebase="http://download.macromedia.com/pub/shockwave/cabs/flash/swflash.cab#version=6,0,40,0" data-thumbnail-src="http://i.ytimg.com/vi/f5-eqSfyeAg/0.jpg"&gt;&lt;param name="movie" value="http://www.youtube.com/v/f5-eqSfyeAg?f=user_uploads&amp;c=google-webdrive-0&amp;app=youtube_gdata" /&gt;&lt;param name="bgcolor" value="#FFFFFF" /&gt;&lt;embed width="320" height="266" src="http://www.youtube.com/v/f5-eqSfyeAg?f=user_uploads&amp;c=google-webdrive-0&amp;app=youtube_gdata" type="application/x-shockwave-flash"&gt;&lt;/embed&gt;&lt;/object&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-7811821135762608674?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/7811821135762608674/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=7811821135762608674' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/7811821135762608674'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/7811821135762608674'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/01/printing-to-cloud.html' title='Printing to the cloud'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1711046759334918623</id><published>2011-01-25T15:57:00.000-08:00</published><updated>2011-01-25T15:57:22.966-08:00</updated><title type='text'>Using AWS Elastic Load Balancing with a password-protected site</title><content type='html'>Scenario: you have a password-protected site running in EC2 that you want handled via Amazon Elastic Load Balancing. The problem with that is that the HTTP healthchecks from the ELB to the instance hosting your site will fail because they will get a 401 HTTP status code instead of 200. Hence the instance will be marked as 'out of service' by the ELB.&lt;br /&gt;&lt;br /&gt;My solution was to serve one static file (I called it 'check.html' containing the text 'it works!') without password protection.&lt;br /&gt;&lt;br /&gt;In my case, I have nginx handling both the dynamic app (which is a Django app running on port 8000) and the static files. Here are the relevant excerpts from nginx.conf (check.html is in /usr/local/nginx/static-content):&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;http {&lt;br /&gt;    include       mime.types;&lt;br /&gt;    default_type  application/octet-stream;&lt;br /&gt;&lt;br /&gt;    upstream django {&lt;br /&gt;        server 127.0.0.1:8000;&lt;br /&gt;    }&lt;br /&gt;&lt;br /&gt;    server {&lt;br /&gt;        listen       80;&lt;br /&gt;&lt;br /&gt;        location / {&lt;br /&gt;            proxy_pass http://django/;&lt;br /&gt;            auth_basic            "Restricted";&lt;br /&gt;            auth_basic_user_file  /usr/local/nginx/conf/.htpasswd;&lt;br /&gt;        }&lt;br /&gt;&lt;br /&gt;        location ~* ^.+check\.html$&lt;br /&gt;        {&lt;br /&gt;            root   /usr/local/nginx/static-content;&lt;br /&gt;        }&lt;br /&gt;    }&lt;br /&gt;}&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1711046759334918623?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1711046759334918623/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1711046759334918623' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1711046759334918623'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1711046759334918623'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/01/using-aws-elastic-load-balancing-with.html' title='Using AWS Elastic Load Balancing with a password-protected site'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-217020473148446889</id><published>2011-01-19T13:58:00.000-08:00</published><updated>2011-01-19T13:58:36.652-08:00</updated><title type='text'>Passing user data to EC2 Ubuntu instances with libcloud</title><content type='html'>While I'm on the topic of libcloud, I've been trying to pass user data to newly created EC2 instances running Ubuntu. The libcloud EC2 driver has an extra parameter called &lt;a href="http://ci.apache.org/projects/libcloud/apidocs/libcloud.drivers.ec2.EC2NodeDriver.html#create_node"&gt;ex_userdata&lt;/a&gt; for the create_node method, and that's what I've been trying to use.&lt;br /&gt;&lt;br /&gt;However, the gotcha here is that the value of that argument needs to be the &lt;b&gt;contents&lt;/b&gt; of the user data file, and not the path to the file.&lt;br /&gt;&lt;br /&gt;So...here's what worked for me:&lt;br /&gt;&lt;br /&gt;1) Created a test user data file with following contents:&lt;br /&gt;&lt;pre class="code"&gt;#!/bin/bash&lt;br /&gt;&lt;br /&gt;apt-get update&lt;br /&gt;apt-get install -y munin-node python2.6-dev&lt;br /&gt;hostname coolstuff&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;2) Used the following script to create the node (I also created a keypair which I passed to create_node as the ex_keypair argument):&lt;br /&gt;&lt;pre class="code"&gt;#!/usr/bin/env python&lt;br /&gt;&lt;br /&gt;import os, sys&lt;br /&gt;from libcloud.types import Provider &lt;br /&gt;from libcloud.providers import get_driver &lt;br /&gt;from libcloud.base import NodeImage, NodeSize, NodeLocation&lt;br /&gt; &lt;br /&gt;EC2_ACCESS_ID     = 'MyAccessID'&lt;br /&gt;EC2_SECRET_KEY    = 'MySecretKey'&lt;br /&gt; &lt;br /&gt;EC2Driver = get_driver(Provider.EC2) &lt;br /&gt;conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)&lt;br /&gt;&lt;br /&gt;keyname = sys.argv[1]&lt;br /&gt;resp = conn.ex_create_keypair(name=keyname)&lt;br /&gt;key_material = resp.get('keyMaterial')&lt;br /&gt;if not key_material:&lt;br /&gt;    sys.exit(1)&lt;br /&gt;private_key = '/root/.ssh/%s.pem' % keyname&lt;br /&gt;f = open(private_key, 'w')&lt;br /&gt;f.write(key_material + '\n')&lt;br /&gt;f.close()&lt;br /&gt;os.chmod(private_key, 0600)&lt;br /&gt;&lt;br /&gt;ami = "ami-88f504e1" # Ubuntu 10.04 32-bit&lt;br /&gt;i = NodeImage(id=ami, name="", driver="")&lt;br /&gt;s = NodeSize(id="m1.small", name="", ram=None, disk=None, bandwidth=None, price=None, driver="")&lt;br /&gt;locations = conn.list_locations()&lt;br /&gt;for location in locations:&lt;br /&gt;    if location.availability_zone.name == 'us-east-1b':&lt;br /&gt;        break&lt;br /&gt;&lt;br /&gt;userdata_file = "/root/proj/test_libcloud/userdata.sh"&lt;br /&gt;userdata_contents = "\n".join(open(userdata_file).readlines())&lt;br /&gt;&lt;br /&gt;node = conn.create_node(name='tst', image=i, size=s, location=location, ex_keyname=keyname, ex_userdata=userdata_contents)&lt;br /&gt;print node.__dict__&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;3) Waited for the newly created node to get to the Running state, then ssh-ed into the node using the key I created and verified that munin-node and python2.6-dev were installed, and also that the hostname was changed to 'coolstuf'.&lt;br /&gt;&lt;pre class="code"&gt;# ssh -i ~/.ssh/lc1.pem ubuntu@domU-12-31-38-00-2C-3B.compute-1.internal&lt;br /&gt;&lt;br /&gt;ubuntu@coolstuff:~$ dpkg -l | grep munin&lt;br /&gt;ii  munin-common                      1.4.4-1ubuntu1                    network-wide graphing framework (common)&lt;br /&gt;ii  munin-node                        1.4.4-1ubuntu1                    network-wide graphing framework (node)&lt;br /&gt;&lt;br /&gt;ubuntu@coolstuff:~$ dpkg -l | grep python2.6-dev&lt;br /&gt;ii  python2.6-dev                     2.6.5-1ubuntu6                    Header files and a static library for Python&lt;br /&gt;&lt;br /&gt;ubuntu@coolstuff:~$ hostname&lt;br /&gt;coolstuff&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Anyway....hope this will be useful to somebody one day, even if that somebody is myself ;-)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-217020473148446889?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/217020473148446889/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=217020473148446889' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/217020473148446889'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/217020473148446889'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/01/passing-user-data-to-ec2-ubuntu.html' title='Passing user data to EC2 Ubuntu instances with libcloud'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2752310147869676643</id><published>2011-01-19T08:49:00.000-08:00</published><updated>2011-01-19T08:49:25.328-08:00</updated><title type='text'>libcloud 0.4.2 and SSL</title><content type='html'>Libcloud 0.4.2 was &lt;a href="http://mail-archives.apache.org/mod_mbox/www-announce/201101.mbox/%3C9A2A5D8D-423C-42B2-94AA-606489483FA3@apache.org%3E"&gt;released&lt;/a&gt; yesterday. Among its new features is an important one: SSL certificate validation is now supported when opening a connection to a cloud provider. However, for this to work, you have to jump through a couple of hoops.&lt;br /&gt;&lt;br /&gt;1) Python 2.5 doesn't have the ssl module installed (2.6 does) -- so you need to install it from PyPI. The current version for ssl is 1.15.&lt;br /&gt;&lt;br /&gt;2) By default, SSL cert validation is disabled in libcloud.&lt;br /&gt;&lt;br /&gt;If you open a connection to a provider you get:&lt;br /&gt;&lt;br /&gt;&lt;span class="Apple-style-span" style="border-collapse: collapse; font-family: arial, sans-serif; font-size: 13px;"&gt;/usr/lib/python2.5/site-&lt;wbr&gt;&lt;/wbr&gt;packages/libcloud/httplib_ssl.&lt;wbr&gt;&lt;/wbr&gt;py:55:&lt;br /&gt;UserWarning: SSL certificate verification is disabled, this can pose a&lt;br /&gt;security risk. For more information how to enable the SSL certificate&lt;br /&gt;verification, please visit the libcloud documentation.&lt;br /&gt;&amp;nbsp;warnings.warn(libcloud.&lt;wbr&gt;&lt;/wbr&gt;security.VERIFY_SSL_DISABLED_&lt;wbr&gt;&lt;/wbr&gt;MSG)&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To get past the warning, you need to enable SSL cert validation and also provide a path to a file containing common CA certificates (if you don't have that file, you can download cacert.pem from&amp;nbsp;&lt;a href="http://curl.haxx.se/docs/caextract.html"&gt;http://curl.haxx.se/docs/caextract.html&lt;/a&gt;&amp;nbsp;for example). Add these lines before opening a connection:&lt;br /&gt;&lt;pre class="code"&gt;import libcloud.security&lt;br /&gt;libcloud.security.VERIFY_SSL_CERT = True&lt;br /&gt;libcloud.security.CA_CERTS_PATH.append("/path/to/cacert.pem")&lt;br /&gt;&lt;/pre&gt;As an aside, the libcloud &lt;a href="http://wiki.apache.org/incubator/LibcloudSSL"&gt;wiki page on SSL&lt;/a&gt; is very helpful and I used it to figure out what to do.&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2752310147869676643?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2752310147869676643/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2752310147869676643' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2752310147869676643'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2752310147869676643'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2011/01/libcloud-042-and-ssl.html' title='libcloud 0.4.2 and SSL'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2119799468040032946</id><published>2010-12-21T12:18:00.000-08:00</published><updated>2011-02-24T13:49:04.951-08:00</updated><title type='text'>Using libcloud to manage instances across multiple cloud providers</title><content type='html'>More and more organizations are moving to ‘the cloud’ these days. In most cases, using ‘the cloud’ means buying compute and storage capacity from a public cloud vendor such as Amazon, Rackspace, GoGrid, Linode, etc. I believe that the next step in cloud usage will be deploying instances across multiple cloud providers, mainly for high availability, but also for performance reasons (for example if a specific provider has a presence in a geographical region closer to your user base).&lt;br /&gt;&lt;br /&gt;All cloud vendors offer APIs for accessing their services -- if they don’t, they’re not a genuine cloud vendor in my book at least. The onus is on you as a system administrator to learn how to use these APIs, which can vary wildly from one provider to another. Enter &lt;a href="http://libcloud.org/"&gt;libcloud&lt;/a&gt;, a Python-based package that offers a unified interface to various cloud provider APIs. The list of supported vendors is impressive, and more are added all the time. Libcloud was started by &lt;a href="https://www.cloudkick.com/"&gt;Cloudkick&lt;/a&gt; but has since migrated to the Apache Foundation as an Incubator Project.&lt;br /&gt;&lt;br /&gt;One thing to note is that libcloud goes for breadth at the expense of depth, in that it only supports a subset of the available provider APIs -- things such as creating, rebooting, destroying an instance, and listing all instances. If you need to go in-depth with a given provider’s API, you need to use other libraries that cover all or at least a large portion of the functionality exposed by the API. Examples of such libraries are &lt;a href="https://github.com/boto/boto"&gt;boto&lt;/a&gt; for Amazon EC2 and &lt;a href="https://github.com/rackspace/python-cloudservers"&gt;python-cloudservers&lt;/a&gt; for Rackspace.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Introducing libcloud&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The current stable version on libcloud is 0.4.0. You can install it from PyPI via&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;# easy_install apache-libcloud&lt;/pre&gt;&lt;br /&gt;The main concepts of libcloud are &lt;b&gt;providers&lt;/b&gt;, &lt;b&gt;drivers&lt;/b&gt;, &lt;b&gt;images&lt;/b&gt;, &lt;b&gt;sizes&lt;/b&gt; and &lt;b&gt;locations&lt;/b&gt;.&lt;br /&gt;&lt;br /&gt;A &lt;b&gt;provider&lt;/b&gt; is a cloud vendor such as Amazon EC2 and Rackspace. Note that currently each EC2 region (US East, US West, EU West, Asia-Pacific Southeast) is exposed as a different provider, although they may be unified in the future.&lt;br /&gt;&lt;br /&gt;The common operations supported by libcloud are exposed for each provider through a &lt;b&gt;driver&lt;/b&gt;. If you want to add another provider, you need to create a new driver and implement the interface common to all providers (in the Python code, this is done by subclassing a base NodeDriver class and overriding/adding methods appropriately, according to the specific needs of the provider).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Images&lt;/b&gt; are provider-dependent, and generally represent the OS flavors available for deployment for a given provider. In EC2-speak, they are equivalent to an AMI.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Sizes&lt;/b&gt; are provider-dependent, and represent the amount of compute, storage and network capacity that a given instance will use when deployed. The more capacity, the more you pay and the happier the provider.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Locations&lt;/b&gt; correspond to geographical data center locations available for a given provider; however, they are not very well represented in libcloud. For example, in the case of Amazon EC2, they currently map to EC2 regions rather than EC2 availability zones. However, this will change in the near future (as I will describe below, proper EC2 availability zone management is being implemented). As another example, Rackspace is represented in libcloud as a single location, listed currently as DFW1; however, your instances will get deployed at a data center determined at your Rackspace account creation time (thanks to Paul Querna for clarifying this aspect).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Managing instances with libcloud&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Getting a connection to a provider via a driver&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;All the interactions with a given cloud provider happen in libcloud across a connection obtained via the driver for that provider. Here is the canonical code snippet for that, taking EC2 as an example:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;from libcloud.types import Provider&lt;br /&gt;from libcloud.providers import get_driver&lt;br /&gt;EC2_ACCESS_ID = 'MY ACCESS ID'&lt;br /&gt;EC2_SECRET_KEY = 'MY SECRET KEY'&lt;br /&gt;EC2Driver = get_driver(Provider.EC2)&lt;br /&gt;conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;For Rackspace, the code looks like this:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;USER = 'MyUser'&lt;br /&gt;API_KEY = 'MyApiKey'&lt;br /&gt;Driver = get_driver(Provider.RACKSPACE)&lt;br /&gt;conn = Driver(USER, API_KEY)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Getting a list of images available for a provider&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Once you get a connection, you can call a variety of informational methods on that connection, for example &lt;b&gt;list_images&lt;/b&gt;, which returns a list of NodeImage objects. Be prepared for this call to take a while, especially in Amazon EC2, which in the US East region returns no less than 6,932 images currently. Here is a code snippet that prints the number of available images, and the first 5 images returned in the list:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;EC2Driver = get_driver(Provider.EC2)&lt;br /&gt;conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)&lt;br /&gt;images = conn.list_images()&lt;br /&gt;print len(images)&lt;br /&gt;print images[:5]&lt;br /&gt;6982&lt;br /&gt;[&amp;lt;NodeImage: id=aki-00806369, name=karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091001-test-04.manifest.xml, driver=Amazon EC2 (us-east-1)  ...&amp;gt;, &amp;lt;NodeImage: id=aki-00896a69, name=karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091002-test-04.manifest.xml, driver=Amazon EC2 (us-east-1)  ...&amp;gt;, &amp;lt;NodeImage: id=aki-008b6869, name=redhat-cloud/RHEL-5-Server/5.4/x86_64/kernels/kernel-2.6.18-164.x86_64.manifest.xml, driver=Amazon EC2 (us-east-1)  ...&amp;gt;, &amp;lt;NodeImage: id=aki-00f41769, name=karmic-kernel-zul/ubuntu-kernel-2.6.31-301-ec2-i386-20091012-test-06.manifest.xml, driver=Amazon EC2 (us-east-1)  ...&amp;gt;, &amp;lt;NodeImage: id=aki-010be668, name=ubuntu-kernels-milestone-us/ubuntu-lucid-i386-linux-image-2.6.32-301-ec2-v-2.6.32-301.4-kernel.img.manifest.xml, driver=Amazon EC2 (us-east-1)  ...&amp;gt;]&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Here is the output of same code running against the Rackspace driver:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;23&lt;br /&gt;[&amp;lt;NodeImage: id=58, name=Windows Server 2008 R2 x64 - MSSQL2K8R2, driver=Rackspace  ...&amp;gt;, &amp;lt;NodeImage: id=71, name=Fedora 14, driver=Rackspace  ...&amp;gt;, &amp;lt;NodeImage: id=29, name=Windows Server 2003 R2 SP2 x86, driver=Rackspace  ...&amp;gt;, &amp;lt;NodeImage: id=40, name=Oracle EL Server Release 5 Update 4, driver=Rackspace  ...&amp;gt;, &amp;lt;NodeImage: id=23, name=Windows Server 2003 R2 SP2 x64, driver=Rackspace  ...&amp;gt;]&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note that a NodeImage object for a given provider may have provider-specific information stored in most cases in a variable called ‘extra’. It pays to inspect the NodeImage objects by printing their __dict__ member variable. Here is an example for EC2:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;print images[0].__dict__&lt;br /&gt;{'extra': {}, 'driver': &amp;lt;libcloud.drivers.ec2.ec2nodedriver 0xb7eebfec="" at="" object=""&amp;gt;, 'id': 'aki-00806369', 'name': 'karmic-kernel-zul/ubuntu-kernel-2.6.31-300-ec2-i386-20091001-test-04.manifest.xml'}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;In this case, the NodeImage object has an id, a name and a driver, with no ‘extra’ information.&lt;br /&gt;&lt;br /&gt;Same code running against Rackspace, with similar information being returned:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;print images[0].__dict__&lt;br /&gt;{'extra': {'serverId': None}, 'driver': &amp;lt;libcloud.drivers.rackspace.rackspacenodedriver 0x88b506c="" at="" object=""&amp;gt;, 'id': '4', 'name': 'Debian 5.0 (lenny)'}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Getting a list of sizes available for a provider&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;When you call &lt;b&gt;list_sizes&lt;/b&gt; on a connection to a provider, you retrieve a list of NodeSize objects representing the available sizes for that provider.&lt;br /&gt;&lt;br /&gt;Amazon EC2 example:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;EC2Driver = get_driver(Provider.EC2)&lt;br /&gt;conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)&lt;br /&gt;sizes = conn.list_sizes()&lt;br /&gt;print len(sizes)&lt;br /&gt;print sizes[:5]&lt;br /&gt;print sizes[0].__dict__&lt;br /&gt;9&lt;br /&gt;[&amp;lt;NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="850" driver="Amazon" ec2="" id="m1.large," instance,="" name="Large" price=".38" ram="7680"&amp;gt;, &amp;lt;NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="1690" driver="Amazon" ec2="" extra="" id="c1.xlarge," instance,="" large="" name="High-CPU" price=".76" ram="7680"&amp;gt;, &amp;lt;NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="160" driver="Amazon" ec2="" id="m1.small," instance,="" name="Small" price=".095" ram="1740"&amp;gt;, &amp;lt;NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="350" driver="Amazon" ec2="" id="c1.medium," instance,="" medium="" name="High-CPU" price=".19" ram="1740"&amp;gt;, &amp;lt;NodeSize: (us-east-1)="" ...="" bandwidth="None" disk="1690" driver="Amazon" ec2="" id="m1.xlarge," instance,="" large="" name="Extra" price=".76" ram="15360"&amp;gt;]&lt;br /&gt;{'name': 'Large Instance', 'price': '.38', 'ram': 7680, 'driver': &amp;lt;libcloud.drivers.ec2.ec2nodedriver 0xb7f49fec="" at="" object=""&amp;gt;, 'bandwidth': None, 'disk': 850, 'id': 'm1.large'}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Same code running against Rackspace:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;7&lt;br /&gt;[&amp;lt;NodeSize: ...="" bandwidth="None" disk="10" driver="Rackspace" id="1," name="256" price=".015" ram="256" server,=""&amp;gt;, &amp;lt;NodeSize: ...="" bandwidth="None" disk="20" driver="Rackspace" id="2," name="512" price=".030" ram="512" server,=""&amp;gt;, &amp;lt;NodeSize: ...="" bandwidth="None" disk="40" driver="Rackspace" id="3," name="1GB" price=".060" ram="1024" server,=""&amp;gt;, &amp;lt;NodeSize: ...="" bandwidth="None" disk="80" driver="Rackspace" id="4," name="2GB" price=".120" ram="2048" server,=""&amp;gt;, &amp;lt;NodeSize: ...="" bandwidth="None" disk="160" driver="Rackspace" id="5," name="4GB" price=".240" ram="4096" server,=""&amp;gt;]&lt;br /&gt;{'name': '256 server', 'price': '.015', 'ram': 256, 'driver': &amp;lt;libcloud.drivers.rackspace.rackspacenodedriver 0x841506c="" at="" object=""&amp;gt;, 'bandwidth': None, 'disk': 10, 'id': '1'}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Getting a list of locations available for a provider&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;As I mentioned before, locations are somewhat ambiguous currently in libcloud.&lt;br /&gt;&lt;br /&gt;For example, when you call &lt;b&gt;list_locations&lt;/b&gt; on a connection to the EC2 provider (which represents the EC2 US East region), you get information about the region and not about the availability zones (AZs) included in that region:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;EC2Driver = get_driver(Provider.EC2)&lt;br /&gt;conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)&lt;br /&gt;print conn.list_locations()&lt;br /&gt;[&amp;lt;NodeLocation: id=0, name=Amazon US N. Virginia, country=US, driver=Amazon EC2 (us-east-1)&amp;gt;]&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;However, there is a patch sent by Tomaž Muraus to the libcloud mailing list which adds support for EC2 availability zones. For example, the US East region has 4 AZs: us-east-1a, us-east-1b, us-east-1c, us-east-1d. These AZs should be represented by libcloud locations, and indeed the code with the patch applied shows just that:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;print conn.list_locations()&lt;br /&gt;[&amp;lt;EC2NodeLocation: id=0, name=Amazon US N. Virginia, country=US, availability_zone=us-east-1a driver=Amazon EC2 (us-east-1)&amp;gt;, &amp;lt;EC2NodeLocation: id=1, name=Amazon US N. Virginia, country=US, availability_zone=us-east-1b driver=Amazon EC2 (us-east-1)&amp;gt;, &amp;lt;EC2NodeLocation: id=2, name=Amazon US N. Virginia, country=US, availability_zone=us-east-1c driver=Amazon EC2 (us-east-1)&amp;gt;, &amp;lt;EC2NodeLocation: id=3, name=Amazon US N. Virginia, country=US, availability_zone=us-east-1d driver=Amazon EC2 (us-east-1)&amp;gt;]&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Hopefully the patch will make it soon into the &lt;a href="https://github.com/apache/libcloud"&gt;libcloud github repository&lt;/a&gt;, and then into the next libcloud release.&lt;br /&gt;&lt;br /&gt;(&lt;b&gt;Update 02/24/11&lt;/b&gt;The patch did make it in the latest libcloud release which is 0.4.2 at this time)&lt;br /&gt;&lt;br /&gt;If you run list_locations on a Rackspace connection, you get back DFW1, even though your instances may actually get deployed at a different data center. Hopefully this too will be fixed soon in libcloud:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;Driver = get_driver(Provider.RACKSPACE)&lt;br /&gt;conn = Driver(USER, API_KEY)&lt;br /&gt;print conn.list_locations()&lt;br /&gt;[&amp;lt;NodeLocation: id=0, name=Rackspace DFW1, country=US, driver=Rackspace&amp;gt;]&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Launching an instance&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The API call for launching an instance with libcloud is &lt;b&gt;create_node&lt;/b&gt;. It has 3 required parameters: a name for your new instance, a NodeImage and a NodeSize. You can also specify a NodeLocation (if you don’t, the default location for that provider will be used).&lt;br /&gt;&lt;br /&gt;&lt;b&gt;EC2 node creation example&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;A given provider driver may accept other parameters to the &lt;b&gt;create_node&lt;/b&gt; call. For example, EC2 accepts an &lt;b&gt;ex_keyname&lt;/b&gt; argument for specifying the EC2 key you want to use when creating the instance.&lt;br /&gt;&lt;br /&gt;Note that to create a node, you have to know what image and what size you want to use for that node. Here can come in handy the code snippets I showed above for retrieving images and sizes available for a given provider. You can either retrieve the full list and iterate through the list until you find your desired image and size (either by name or by id), or you can construct NodeImage and NodeSize objects from scratch, based on the desired id.&lt;br /&gt;&lt;br /&gt;Example of a NodeImage object for EC2 corresponding to a specific AMI:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;image = NodeImage(id="ami-014da868", name="", driver="")&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Example of a NodeSize object for EC2 corresponding to an m1.small instance size:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;size = NodeSize(id="m1.small", name="", ram=None, disk=None, bandwidth=None, price=None, driver="")&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note that in both examples the only parameter that need to be set is the id, but all the other parameters need to be present in the call, even if they are set to None or the empty string.&lt;br /&gt;&lt;br /&gt;In the case of EC2, for the instance to be actually usable via ssh, you also need to pass the &lt;b&gt;ex_keyname&lt;/b&gt; parameter and set it to a keypair name that exists in your EC2 account for that region. Libcloud provides a way to create or import a keypair programmatically. Here is a code snippet that creates a keypair via the &lt;b&gt;ex_create_keypair&lt;/b&gt; call (specific to the libcloud EC2 driver), then saves the private key in a file in /root/.ssh on the machine running the code:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;keyname = sys.argv[1]&lt;br /&gt;resp = conn.ex_create_keypair(name=keyname)&lt;br /&gt;key_material = resp.get('keyMaterial')&lt;br /&gt;if not key_material:&lt;br /&gt;    sys.exit(1)&lt;br /&gt;private_key = '/root/.ssh/%s.pem' % keyname&lt;br /&gt;f = open(private_key, 'w')&lt;br /&gt;f.write(key_material + '\n')&lt;br /&gt;f.close()&lt;br /&gt;os.chmod(private_key, 0600)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;You can also pass the name of an EC2 security group to create_node via the &lt;b&gt;ex_securitygroup&lt;/b&gt; parameter. Libcloud also allows you to create security groups programmatically by means of the &lt;b&gt;ex_create_security_group&lt;/b&gt; method specific to the libcloud EC2 driver.&lt;br /&gt;&lt;br /&gt;Now, armed with the NodeImage and NodeSize objects constructed above, as well as the keypair name, we can launch an instance in EC2:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;node = conn.create_node(name='test1', image=image, size=size, ex_keyname=keyname)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note that we didn’t specify any location, so we have no control over the availability zone where the instance will be created. With Tomaž’s patch we can actually get a location corresponding to our desired availability zone, then launch the instance in that zone. Here is an example for us-east-1b:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;locations = conn.list_locations()&lt;br /&gt;for location in locations:&lt;br /&gt;    if location.availability_zone.name == 'us-east-1b':&lt;br /&gt;        break&lt;br /&gt;node = conn.create_node(name='tst', image=image, size=size, location=location, ex_keyname=keyname)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Once the node is created, you can call the &lt;b&gt;list_nodes&lt;/b&gt; method on the connection object and inspect the current status of the node, along with other information about that node. In EC2, a new instance is initially shown with a status of ‘pending’. Once the status changes to ‘running’, you can ssh into that instance using the private key created above.&lt;br /&gt;&lt;br /&gt;Printing node.__dict__ for a newly created instance shows it with ‘pending’ status:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;{'name': 'i-f692ae9b', 'extra': {'status': 'pending', 'productcode': [], 'groups': None, 'instanceId': 'i-f692ae9b', 'dns_name': '', 'launchdatetime': '2010-12-14T20:25:22.000Z', 'imageId': 'ami-014da868', 'kernelid': None, 'keyname': 'k1', 'availability': 'us-east-1d', 'launchindex': '0', 'ramdiskid': None, 'private_dns': '', 'instancetype': 'm1.small'}, 'driver': &amp;lt;libcloud.drivers.ec2.ec2nodedriver 0x9e088ec="" at="" object=""&amp;gt;, 'public_ip': [''], 'state': 3, 'private_ip': [''], 'id': 'i-f692ae9b', 'uuid': '76fcd974aab6f50092e5a637d6edbac140d7542c'}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Printing node.__dict__ a few minutes after the instance was launched shows the instance with ‘running’ status:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;{'name': 'i-f692ae9b', 'extra': {'status': 'running', 'productcode': [], 'groups': ['default'], 'instanceId': 'i-f692ae9b', 'dns_name': 'ec2-184-72-92-114.compute-1.amazonaws.com', 'launchdatetime': '2010-12-14T20:25:22.000Z', 'imageId': 'ami-014da868', 'kernelid': None, 'keyname': 'k1', 'availability': 'us-east-1d', 'launchindex': '0', 'ramdiskid': None, 'private_dns': 'domU-12-31-39-04-65-11.compute-1.internal', 'instancetype': 'm1.small'}, 'driver': &amp;lt;libcloud.drivers.ec2.ec2nodedriver 0x93f42cc="" at="" object=""&amp;gt;, 'public_ip': ['ec2-184-72-92-114.compute-1.amazonaws.com'], 'state': 0, 'private_ip': ['domU-12-31-39-04-65-11.compute-1.internal'], 'id': 'i-f692ae9b', 'uuid': '76fcd974aab6f50092e5a637d6edbac140d7542c'}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note also that the ‘extra’ member variable of the node object shows a wealth of information specific to EC2 -- things such as security group, AMI id, kernel id, availability zone, private and public DNS names, etc. Another interesting thing to note is that the &lt;b&gt;name&lt;/b&gt; member variable of the node object is now set to the EC2 instance id, thus guaranteeing uniqueness of names across EC2 node objects.&lt;br /&gt;&lt;br /&gt;At this point (assuming the machine where you run the libcloud code is allowed ssh access into the default EC2 security group) you should be able to ssh into the newly created instance using the private key corresponding to the keypair you used to create the instance. In my case, I used the k1.pem private file created via &lt;b&gt;ex_create_keypair&lt;/b&gt; and I ssh-ed into the private IP address of the new instance, because I was already on an EC2 instance in the same availability zone:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;# ssh -i ~/.ssh/k1.pem domU-12-31-39-04-65-11.compute-1.internal&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Rackspace node creation example&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Here is another example of calling &lt;b&gt;node_create&lt;/b&gt;, this time using Rackspace as the provider. Before I ran this code, I already called list_images and list_sizes on the Rackspace connection object, so I know that I want the NodeImage with id 71 (which happens to be Fedora 14) and the NodeSize with id 1 (the smallest one). The code snippet below will create the node using the image and the size I specify, with a &lt;b&gt;name&lt;/b&gt; that I also specify (this name needs to be different for each call of create_node):&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;Driver = get_driver(Provider.RACKSPACE)&lt;br /&gt;conn = Driver(USER, API_KEY)&lt;br /&gt;images = conn.list_images()&lt;br /&gt;for image in images:&lt;br /&gt;    if image.id == '71':&lt;br /&gt;        break&lt;br /&gt;sizes = conn.list_sizes()&lt;br /&gt;for size in sizes:&lt;br /&gt;    if size.id == '1':&lt;br /&gt;        break&lt;br /&gt;node = conn.create_node(name='testrackspace', image=image, size=size)&lt;br /&gt;print node.__dict__&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;The code prints out:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;{'name': 'testrackspace', 'extra': {'metadata': {}, 'password': 'testrackspaceO1jk6O5jV', 'flavorId': '1', 'hostId': '9bff080afbd3bec3ca140048311049f9', 'imageId': '71'}, 'driver': &amp;lt;libcloud.drivers.rackspace.rackspacenodedriver 0x877c3ec="" at="" object=""&amp;gt;, 'public_ip': ['184.106.187.226'], 'state': 3, 'private_ip': ['10.180.67.242'], 'id': '497741', 'uuid': '1fbf7c3fde339af9fa901af6bf0b73d4d10472bb'}&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Note that the &lt;b&gt;name&lt;/b&gt; variable of the node object was set to the name we specified in the create_node call. You don’t log in with a key (at least initially) to a Rackspace node, but instead you’re given a password you can use to log in as root to the public IP that is also returned in the node information:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;ssh root@184.106.187.226root@184.106.187.226's password:[root@testrackspace ~]#&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Rebooting and destroying instances&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Once you have a list of nodes in a given provider, it’s easy to iterate through the list and choose a given node based on its unique name -- which as we’ve seen is the instance id for EC2 and the hostname for Rackspace. Once you identify a node, you can call &lt;b&gt;destroy_node&lt;/b&gt; or &lt;b&gt;reboot_node&lt;/b&gt; on the connection object to terminate or reboot that node.&lt;br /&gt;&lt;br /&gt;Here is a code snippet that performs a &lt;b&gt;destroy_node&lt;/b&gt; operation for an EC2 instance with a specific instance id:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;EC2Driver = get_driver(Provider.EC2)&lt;br /&gt;conn = EC2Driver(EC2_ACCESS_ID, EC2_SECRET_KEY)&lt;br /&gt;nodes = conn.list_nodes()&lt;br /&gt;for node in nodes:&lt;br /&gt;    if node.name == 'i-66724d0b':&lt;br /&gt;        conn.destroy_node(node)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;Here is another code snippet that performs a &lt;b&gt;reboot_node&lt;/b&gt; operation for a Rackspace node with a specific hostname:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;Driver = get_driver(Provider.RACKSPACE)&lt;br /&gt;conn = Driver(USER, API_KEY)&lt;br /&gt;nodes = conn.list_nodes()&lt;br /&gt;for node in nodes:&lt;br /&gt;    if node.name == 'testrackspace':&lt;br /&gt;        conn.reboot_node(node)&lt;/pre&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;The Overmind project&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;I would be remiss if I didn’t mention a new but very promising project started by &lt;a href="http://tobami.wordpress.com/"&gt;Miquel Torres&lt;/a&gt;: &lt;a href="https://github.com/tobami/overmind"&gt;Overmind&lt;/a&gt;. The goal of Overmind is to be a complete server provisioning and configuration management system. For the server provisioning portion, Overmind uses libcloud, while also offering a Django-based Web interface for managing providers and nodes. EC2 and Rackspace are supported currently, but it should be easy to add new providers. If you are interested in trying out Overmind and contributing code or tests, please send a message to the &lt;a href="http://groups.google.com/group/overmind-dev?pli=1"&gt;overmind-dev&lt;/a&gt; mailing list. Next versions of Overmind aim to add configuration management capabilities using &lt;a href="http://www.opscode.com/chef"&gt;Opscode Chef&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Further reading&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;libcloud &lt;a href="http://incubator.apache.org/libcloud/getting-started.html"&gt;Getting Started&lt;/a&gt; page&lt;/li&gt;&lt;li&gt;libcloud &lt;a href="http://ci.apache.org/projects/libcloud/apidocs/"&gt;API documentation&lt;/a&gt;&lt;/li&gt;&lt;li&gt;libcloud &lt;a href="https://issues.apache.org/jira/browse/LIBCLOUD"&gt;JIRA issue tracker&lt;/a&gt;&lt;/li&gt;&lt;li&gt;libcloud &lt;a href="http://mail-archives.apache.org/mod_mbox/incubator-libcloud/"&gt;mailing list archives&lt;/a&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2119799468040032946?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2119799468040032946/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2119799468040032946' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2119799468040032946'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2119799468040032946'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/12/using-libcloud-to-manage-instances.html' title='Using libcloud to manage instances across multiple cloud providers'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-5047506547135488442</id><published>2010-12-10T12:33:00.000-08:00</published><updated>2011-09-28T11:46:50.424-07:00</updated><title type='text'>A Fabric script for striping EBS volumes</title><content type='html'>Here's a short Fabric script which might be useful to people who need to stripe EBS volumes in Amazon EC2. Striping is recommended if you want to improve the I/O of your EBS-based volumes. However, striping won't help if one of the member EBS volumes goes AWOL or suffers performance issues. In any case, here's the Fabric script:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;import commands&lt;br /&gt;from fabric.api import *&lt;br /&gt;&lt;br /&gt;# Globals&lt;br /&gt;&lt;br /&gt;env.project='EBSSTRIPING'&lt;br /&gt;env.user = 'myuser'&lt;br /&gt;&lt;br /&gt;DEVICES = [&lt;br /&gt;    "/dev/sdd",&lt;br /&gt;    "/dev/sde",&lt;br /&gt;    "/dev/sdf",&lt;br /&gt;    "/dev/sdg",&lt;br /&gt;]&lt;br /&gt;&lt;br /&gt;VOL_SIZE = 400 # GB&lt;br /&gt;&lt;br /&gt;# Tasks&lt;br /&gt;&lt;br /&gt;def install():&lt;br /&gt;    install_packages()&lt;br /&gt;    create_raid0()&lt;br /&gt;    create_lvm()&lt;br /&gt;    mkfs_mount_lvm()&lt;br /&gt;&lt;br /&gt;def install_packages():&lt;br /&gt;    run('DEBIAN_FRONTEND=noninteractive apt-get -y install mdadm')&lt;br /&gt;    run('apt-get -y install lvm2')&lt;br /&gt;    run('modprobe dm-mod')&lt;br /&gt;    &lt;br /&gt;def create_raid0():&lt;br /&gt;    cmd = 'mdadm --create /dev/md0 --level=0 --chunk=256 --raid-devices=4 '&lt;br /&gt;    for device in DEVICES:&lt;br /&gt;        cmd += '%s ' % device&lt;br /&gt;    run(cmd)&lt;br /&gt;    run('blockdev --setra 65536 /dev/md0')&lt;br /&gt;&lt;br /&gt;def create_lvm():&lt;br /&gt;    run('pvcreate /dev/md0')&lt;br /&gt;    run('vgcreate vgm0 /dev/md0')&lt;br /&gt;    run('lvcreate --name lvm0 --size %dG vgm0' % VOL_SIZE)&lt;br /&gt;&lt;br /&gt;def mkfs_mount_lvm():&lt;br /&gt;    run('mkfs.xfs /dev/vgm0/lvm0')&lt;br /&gt;    run('mkdir -p /mnt/lvm0')&lt;br /&gt;    run('echo "/dev/vgm0/lvm0 /mnt/lvm0 xfs defaults 0 0" &amp;gt;&amp;gt; /etc/fstab')&lt;br /&gt;    run('mount /mnt/lvm0')&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;A few things to note:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;I assume that you already created and attached 4 EBS volumes to your instance with device names /dev/sdd through /dev/sdg; if your device names or volume count are different, modify the DEVICES list appropriately&lt;/li&gt;&lt;li&gt;The size of your target RAID0 volume is set in the VOL_SIZE variable&lt;/li&gt;&lt;li&gt;the helper functions are pretty self-explanatory:&amp;nbsp;&lt;/li&gt;&lt;ol&gt;&lt;li&gt;we use mdadm to create a RAID0 device called /dev/md0; we also set the block size to 64 KB via the blockdev call&lt;/li&gt;&lt;li&gt;we create a physical LVM volume on /dev/md0&lt;/li&gt;&lt;li&gt;we create a volume group called vgm0 on /dev/md0&lt;/li&gt;&lt;li&gt;we create a logical LVM volume called lvm0 of size VOL_SIZE, inside the vgm0 group&lt;/li&gt;&lt;li&gt;we format the logical volume as XFS, then we mount it and also modify /etc/fstab&lt;/li&gt;&lt;/ol&gt;&lt;/ul&gt;&lt;div&gt;That's it. Hopefully it will be useful to somebody out there.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-5047506547135488442?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/5047506547135488442/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=5047506547135488442' title='9 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5047506547135488442'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5047506547135488442'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/12/fabric-script-for-striping-ebs-volumes.html' title='A Fabric script for striping EBS volumes'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>9</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1850099966073387805</id><published>2010-12-09T11:06:00.001-08:00</published><updated>2010-12-09T11:06:33.605-08:00</updated><title type='text'>Automated deployments with LittleChef</title><content type='html'>See my &lt;a href="http://sysadvent.blogspot.com/2010/12/day-9-automated-deployments-with.html"&gt;post&lt;/a&gt; on the Sysadvent blog.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1850099966073387805?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1850099966073387805/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1850099966073387805' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1850099966073387805'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1850099966073387805'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/12/automated-deployments-with-littlechef.html' title='Automated deployments with LittleChef'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-6503775229157849943</id><published>2010-11-30T21:38:00.000-08:00</published><updated>2010-11-30T21:38:23.587-08:00</updated><title type='text'>Working with Chef attributes</title><content type='html'>It took me a while to really get how to use &lt;a href="http://wiki.opscode.com/display/chef/Attributes"&gt;Chef attributes&lt;/a&gt;. It's fairly easy to understand what they are and where they are referenced in recipes, but it's not so clear from the documentation how and where to override them. Here's a quick note to clarify this.&lt;br /&gt;&lt;br /&gt;A Chef attribute can be seen as a variable that:&lt;br /&gt;&lt;br /&gt;1) &lt;b&gt;gets initialized to a default value&lt;/b&gt; in cookbooks/mycookbook/attributes/default.rb&lt;br /&gt;&lt;br /&gt;Examples:&lt;br /&gt;&lt;br /&gt;default[:mycookbook][:swapfilesize] = '10485760'&lt;br /&gt;default[:mycookbook][:tornado_version] = '1.1'&lt;br /&gt;default[:mycookbook][:haproxy_version] = '1.4.8'&lt;br /&gt;default[:mycookbook][:nginx_version] = '0.8.20'&lt;br /&gt;&lt;div&gt;&lt;/div&gt;&lt;br /&gt;2) &lt;b&gt;gets used in cookbook recipes&lt;/b&gt; such as cookbooks.mycookbook/recipes/default.rb or any other myrecipefile.rb in the recipes directory; the syntax for using the attribute's value is of the form #{node[:mycookbook][:attribute_name]}&lt;br /&gt;&lt;br /&gt;Example of using the haproxy_version attribute in a recipe called haproxy.rb:&lt;br /&gt;&lt;br /&gt;# install haproxy from source&lt;br /&gt;haproxy = "haproxy-&lt;b&gt;#{node[:mycookbook][:haproxy_version]}&lt;/b&gt;"&lt;br /&gt;haproxy_pkg = "#{haproxy}.tar.gz"&lt;br /&gt;&lt;br /&gt;downloads = [&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;"#{haproxy_pkg}",&lt;br /&gt;]&lt;br /&gt;&lt;br /&gt;downloads.each do |file|&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;remote_file "/tmp/#{file}" do&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;source "http://myserver.com/download/haproxy/#{file}"&lt;br /&gt;&amp;nbsp;&amp;nbsp; &amp;nbsp;end&lt;br /&gt;end&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;....etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Example of using the&amp;nbsp;swapfilesize attribute in the default recipe default.rb:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# create swap file if it doesn't exist&lt;/div&gt;&lt;div&gt;SWAPFILE=/data/swapfile1&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;if [ -e "$SWAPFILE" ]&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;then&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; exit 0&lt;/div&gt;&lt;div&gt;fi&lt;/div&gt;&lt;div&gt;dd if=/dev/zero of=$SWAPFILE bs=1024 count=&lt;b&gt;#{node[:mycookbook][:swapfilesize]}&lt;/b&gt;&lt;/div&gt;&lt;div&gt;mkswap $SWAPFILE&lt;/div&gt;&lt;div&gt;swapon $SWAPFILE&lt;/div&gt;&lt;div&gt;echo "$SWAPFILE swap swap defaults 0 0" &amp;gt;&amp;gt; /etc/fstab&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;3) &lt;b&gt;can be overridden&lt;/b&gt; at either the role or the node level (and some other more obscure levels that I haven't used in practice)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I prefer to override attributes at the role level, because I want those overridden values to apply to all nodes pertaining to that role.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For example, I have a role called appserver, which is defined in a file called appserver.rb in the roles directory:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;# cat appserver.rb&amp;nbsp;&lt;/div&gt;&lt;div&gt;name "appserver"&lt;/div&gt;&lt;div&gt;description "Installs required packages and applications for an app server"&lt;/div&gt;&lt;div&gt;run_list "recipe[mycookbook::appserver]", "recipe[mycookbook::haproxy]", "recipe[memcached]"&lt;/div&gt;&lt;div&gt;&lt;b&gt;override_attributes "mycookbook" =&amp;gt; { "swapfilesize" =&amp;gt; "&lt;span class="Apple-style-span" style="font-weight: normal;"&gt;&lt;b&gt;4194304&lt;/b&gt;&lt;/span&gt;" }, "memcached" =&amp;gt; { "memory" =&amp;gt; "4096" }&lt;/b&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here I override the default swapfilesize value (an attribute from the mycookbook cookbook) of 10 GB and set it to 4 GB. I also override the default memcached memory value (an attribute from the Opscode memcached cookbook) of 128 MB and set it&lt;/div&gt;&lt;div&gt;to 4 GB.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If however you want to override some attributes at the node level, you could do this in the chef.json file on the node if you wanted for example to set the swap size to 2 GB:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;"run_list": [ "role[base]", "role[appserver]" ]&lt;/div&gt;&lt;div&gt;"mycookbook": { "swapfilesize": "2097152"}&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Anothe question is when should you use Chef attributes? My own observation is that wherever I use hardcoded values in my recipes, it's almost always better to use an attribute instead, and set the default value of the attribute to that hardcoded value, which can be then overridden as needed.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-6503775229157849943?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/6503775229157849943/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=6503775229157849943' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/6503775229157849943'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/6503775229157849943'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/11/working-with-chef-attributes.html' title='Working with Chef attributes'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-8318391420718568852</id><published>2010-11-16T16:21:00.000-08:00</published><updated>2010-11-16T22:22:41.507-08:00</updated><title type='text'>How to whip your infrastructure into shape</title><content type='html'>It's easy, just follow these steps:&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 0.&lt;/b&gt;  If you're fortunate enough to participate in the design of your infrastructure (as opposed to being thrown at the deep end and having to maintain some 'legacy' one), then try to aim for horizontal scalability. It's easier to scale out than to scale up, and failures in this mode will hopefully impact a smaller percentage of your users.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 1.  Configure a good monitoring and alerting system&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This is the single most important thing you need to do for your infrastructure. It's also a great way to learn a new infrastructure that you need to maintain.&lt;br /&gt;&lt;br /&gt;I talked about different types of monitoring in &lt;a href="http://agiletesting.blogspot.com/2010/02/web-site-monitoring-techniques-and.html"&gt;another blog post&lt;/a&gt;. My preferred approach is to have 2 monitoring systems in place:&lt;br /&gt;&lt;ul&gt;&lt;li&gt; an internal monitoring system which I use to check the health of individual servers/devices&lt;/li&gt;&lt;li&gt;an external monitoring system used to check the behavior of the application/web site as a regular user would.&lt;/li&gt;&lt;/ul&gt;&lt;br /&gt;My preferred internal monitoring/alerting system is Nagios, but tools like Zabbix, OpenNMS, Zenoss, Monit etc. would definitely also do the job. I like Nagios because there is a wide variety of plugins already available (such as the extremely useful &lt;a href="http://exchange.nagios.org/directory/MySQL/check_mysql_health/details"&gt;check_mysql_health&lt;/a&gt; plugin) and also because it's very easy to write custom plugin for your specific application needs. It's also relatively easy to generate Nagios configuration files automatically.&lt;br /&gt;&lt;br /&gt;For external monitoring I use a combination of Pingdom and Akamai alerts. Pingdom runs checks against certain URLs within our application, whereas Akamai alerts us whenever the percentage of HTTP error codes returned by our application is greater than a certain threshold.&lt;br /&gt;&lt;br /&gt;I'll talk more about correlating internal and external alerts below.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 2.  Configure a good resource graphing system&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;This is the second most important thing you need to do for your  infrastructure. It gives you visibility into how your system resources are used. If you don't have this visibility, it's very hard to do proper capacity planning. It's also hard to correlate monitoring alerts with resource limits you might have reached.&lt;br /&gt;&lt;br /&gt;I use both Munin and Ganglia for resource graphing. I like the graphs that Munin produces and also some of the plugins that are available (such as the &lt;a href="https://github.com/kjellm/munin-mysql"&gt;munin-mysql&lt;/a&gt; plugin), and I also like Ganglia's nice graph aggregation feature, which allows me to watch the same system resource across a cluster of nodes. Munin has this feature too, but Ganglia was designed from the get-go to work on clusters of machines.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 3. Dashboards, dashboards, dashboards&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Everybody knows that whoever has the most dashboards wins. I am talking here about &lt;b&gt;application-specific metrics&lt;/b&gt; that you want to track over time. I gave an example before of a dashboard I built for &lt;a href="http://agiletesting.blogspot.com/2010/07/tracking-and-visualizing-mail-logs-with.html"&gt;visualizing the outgoing email count&lt;/a&gt; through our system.&lt;br /&gt;&lt;br /&gt;It's very easy to build such a dashboard with the &lt;a href="http://code.google.com/apis/charttools/index.html"&gt;Google Visualization API&lt;/a&gt;, so there's really no excuse for not having charts for critical metrics of your infrastructure. We use queuing a lot internally at Evite, so we have dashboards for tracking various queue sizes. We also track application errors from nginx logs and chart them in various ways: by server, by error code, by URL, aggregated, etc.&lt;br /&gt;&lt;br /&gt;Dashboards offered by external monitoring tools such as Pingdom/Keynote/Gomez/Akamai are also very useful.  They typically chart uptime and response time for various pages, and edge/origin HTTP traffic in the case of Akamai.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 4. Correlate errors with resource state and capacity&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;The combination of internal and external monitoring, resource charting and application dashboards is very powerful. As a rule, whenever you have an external alert firing off, you should have one or more internal ones firing off too. If you don't, then you don't have sufficient internal alerts, so you need to work on that aspect of your monitoring.&lt;br /&gt;&lt;br /&gt;Once you do have external and internal alerts firing off in unison, you will be able to correlate external issues (such as increased percentages of HTTP error codes, or timeouts in certain application URLs) with server capacity issues/bottlenecks within your infrastructure. Of course, the fact that you are charting resources over time, and that you have a baseline to go from, will help you quickly identify outliers such as spikes in CPU usage or drops in Akamai traffic.&lt;br /&gt;&lt;br /&gt;A typical work day for me starts with me opening a few tabs in my browser: the Nagios overview page, the Munin overview, the Ganglia overview, the Akamai HTTP content delivery dashboard, and various application-specific dashboards.&lt;br /&gt;&lt;br /&gt;Let's say I get an alert from Akamai that the percentage of HTTP 500 error codes is over 1%. I start by checking the resource graphs for our database servers. I look at in-depth MySQL metrics in Munin, and at CPU metrics (especially CPU I/O wait time) in Ganglia. If nothing is out of the ordinary, I look at our various application services (our application consists of a multitude of RESTful Web services). The nginx log dashboard may show increased HTTP 500 errors from a particular server, or it may show an increase in such errors across the board. This may point to insufficient capacity at our services layer. Time to deploy more services on servers with enough CPU/RAM capacity. I know which servers those are, because I keep tabs on them with Munin and Ganglia.&lt;br /&gt;&lt;br /&gt;As another example, I know that if the CPU I/O wait on my database servers approaches 30%, the servers will start huffing and puffing, and I'll see an increased number of slow queries. In this case, it's time to either identify queries to be optimized, or reduce the number of queries to the database -- or if everything else fails, time to add more database servers. (BTW, if you haven't yet read @allspaw's book "&lt;a href="http://www.amazon.com/Art-Capacity-Planning-Scaling-Resources/dp/0596518579"&gt;The Art of Capacity Planning&lt;/a&gt;", add reading it as a task for Step 0)&lt;br /&gt;&lt;br /&gt;My point is that all these alerts and metric graphs are interconnected, and without looking at all of them, you're flying blind.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Step 5.  Expect failures and recover quickly and gracefully&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;It's not a question whether failures will happen, it's WHEN they will happen. When they do happen, you need to be prepared. Hopefully you designed your infrastructure in a way that allows you to bounce back quickly from failures. If a database server goes down, hopefully you have a slave that you can quickly promote to a master, or even better you have another passive master ready to become the active server. Even better, maybe you have a fancy self-healing distributed database -- kudos to you then ;-)&lt;br /&gt;&lt;br /&gt;One thing that you can do here is to have various knobs that turn on and off certain features or pieces of functionality within your application (again, John Allspaw has some blog posts and presentations on that from his days at Flickr and his current role at Etsy). These knobs allow you to survive an application server outage, or even (God forbid) a database outage, while still being able to present *something* to your end-users.&lt;br /&gt;&lt;br /&gt;To quickly bounce back from a server failure, I recommend you use automated deployment and configuration management tools such as Chef, Puppet, Fabric, etc. (see &lt;a href="http://agiletesting.blogspot.com/2010/03/automated-deployment-systems-push-vs.html"&gt;some&lt;/a&gt; &lt;a href="http://agiletesting.blogspot.com/2010/07/bootstrapping-ec2-instances-with-chef.html"&gt;posts&lt;/a&gt; of &lt;a href="http://agiletesting.blogspot.com/2010/10/introducing-project-overmind.html"&gt;mine&lt;/a&gt; on this &lt;a href="http://agiletesting.blogspot.com/2010/08/what-automated-deploymentconfig-mgmt.html"&gt;topic&lt;/a&gt;). I personally use a combination of Chef (to bootstrap a new machine and do things as file system layout, pre-requisite installation etc) and Fabric (to actually deploy the application code).&lt;br /&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;br /&gt;&lt;b&gt;Update #1:&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Comment on Twitter:&lt;br /&gt;&lt;br /&gt;@ericholscher:&lt;br /&gt;&lt;br /&gt;"@griggheo Good stuff. Any thoughts on figuring out symptom's from causes? eg. load balancer is having issues which causes db load to drop?"&lt;br /&gt;&lt;br /&gt;Good question, and something similar actually has happened to us. To me, it's a matter of knowing your baseline graphs. In our case, whenever I see an Akamai traffic drop, it's usually correlated to an increase in the percentage of HTTP 500 errors returned by our Web services. If I also see DB traffic dropping, then I know the bottleneck is at the application services layer. If the DB traffic is increasing, then the bottleneck is most likely the DB. Depending on the bottleneck, we need to add capacity at that layer, or to optimize code or DB queries at the respective layer.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-size: x-small;"&gt;&lt;b&gt;Main accomplishment of this post&lt;/b&gt;&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;I'm most proud of the fact that I haven't used the following words in the post above: 'devops', 'cloud', 'noSQL' and 'agile'.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-8318391420718568852?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/8318391420718568852/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=8318391420718568852' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/8318391420718568852'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/8318391420718568852'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/11/how-to-whip-your-infrastructure-into.html' title='How to whip your infrastructure into shape'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-3800334991399894914</id><published>2010-10-28T14:43:00.000-07:00</published><updated>2010-10-28T14:43:39.977-07:00</updated><title type='text'>MySQL load balancing with HAProxy</title><content type='html'>In an &lt;a href="http://agiletesting.blogspot.com/2010/02/use-haproxy-14-if-you-need-mysql-health.html"&gt;earlier blog post&lt;/a&gt; I was advising people to use HAProxy 1.4 and above if they need MySQL load balancing with health checks. It turns out that I didn't have much luck with that solution either. HAProxy shines when it load balances HTTP traffic, and its health checks are really meant to be run over HTTP and not plain TCP. So the solution I found was to have a small HTTP Web service (which I wrote using &lt;a href="http://www.tornadoweb.org/"&gt;tornado&lt;/a&gt;) listening on a configurable port on each &amp;nbsp;MySQL node.&lt;br /&gt;&lt;br /&gt;For the health check, the Web service connects via&amp;nbsp;MySQLdb to the MySQL instance running on a given port and issues a 'show databases' command. For more in-depth checking you can obviously run fancier SQL statements.&lt;br /&gt;&lt;br /&gt;The code for my small tornado server is &lt;a href="http://gist.github.com/652355"&gt;here&lt;/a&gt;. The default port it listens on is 31337.&lt;br /&gt;&lt;br /&gt;Now on the HAProxy side I have a "listen" section for each collection of MySQL nodes that I want to load balance. Example:&lt;br /&gt;&lt;pre class="code"&gt;listen mysql-m0 0.0.0.0:33306&lt;br /&gt;  mode tcp&lt;br /&gt;  option httpchk GET /mysqlchk/?port=3306&lt;br /&gt;  balance roundrobin&lt;br /&gt;  server db101 10.10.10.1:3306 check port 31337 inter 5000 rise 3 fall 3&lt;br /&gt;  server db201 10.10.10.2:3306 check port 31337 inter 5000 rise 3 fall 3 backup&lt;br /&gt;&lt;/pre&gt;In this case, HAProxy listens on port 33306 and load balances MySQL traffic between db101 and db201, with db101 being the primary node and db201 being the backup node (which means that traffic only goes to db101 unless it's considered down by the health check, in which case traffic is directed to db201). This scenario is especially useful when db101 and db201 are in a master-master replication setup, and you want traffic to hit only 1 of them at any given time. Note also that I could have had HAProxy listen on port 3306, but I preferred to have it listen and be contacted by the application on port 33306, in case I also wanted to run a MySQL server in port 3306 on the same server as HAProxy.&lt;br /&gt;&lt;br /&gt;I specify how to call the HTTP check handler via "option httpchk GET /mysqlchk/?port=3306". I specify the port the handler listens on via the "port" option in the "server" line. In my case the port is 31337. So HAProxy will do a GET against http://10.10.10.1:31337/mysqlchk/?port=3306. If the result is an HTTP error code, the health check will be considered failed.&lt;br /&gt;&lt;br /&gt;The other options "inter 5000 rise 3 fall 3" mean that the health check is issued by HAProxy every 5,000 ms, and that the health check needs to succeed 3 times ("rise 3") in order for the node to be considered up, and it needs to fail 3 times ("fall 3") in order for the node to be considered down.&lt;br /&gt;&lt;br /&gt;I hasten to add that the master-master load balancing has its disadvantages. It did save my butt one Sunday morning when db101 went down hard (after all, it was an EC2 instance), and traffic was directed by HAProxy to db201 in a totally transparent fashion to the application.&lt;br /&gt;&lt;br /&gt;But....I have also seen the situation where db201, as a slave to db101, lagged in its replication, and so when db101 was considered down and traffic was sent to db201, the state of the data was stale from an application point of view. I consider this disadvantage to weigh more than the automatic failover advantage, so I actually ended up taking db201 out of HAProxy. If db101 ever goes down hard again, I'll just manually point HAProxy to db201, after making sure the state of the data on db201 is what I expect.&lt;br /&gt;&lt;br /&gt;So all this being said, I recommend the automated failover scenario only when load balance against a read-only farm of MySQL servers, which are all probably slaves of some master. In this case, although reads can also get out of sync, at least you won't attempt to do creates/updates/deletes against stale data.&lt;br /&gt;&lt;br /&gt;The sad truth is that there is no good way of doing automated load balancing AND failover with MySQL without resorting to things such as &lt;a href="http://www.drbd.org/"&gt;DRBD&lt;/a&gt;&amp;nbsp;which are not cloud-friendly. I am aware of Yves Trudeau's blog posts on "&lt;a href="http://www.mysqlperformanceblog.com/2010/06/17/high-availability-for-mysql-on-amazon-ec2-part-1-intro/"&gt;High availability for MySQL on Amazon EC2&lt;/a&gt;" but the setup he describes strikes me as experimental and I wouldn't trust it in a large-scale production setup.&lt;br /&gt;&lt;br /&gt;In any case, I hope somebody will find the tornado handler I wrote useful for their own MySQL health checks, or actually any TCP-based health check they need to do within HAProxy.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-3800334991399894914?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/3800334991399894914/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=3800334991399894914' title='7 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3800334991399894914'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3800334991399894914'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/10/mysql-load-balancing-with-haproxy.html' title='MySQL load balancing with HAProxy'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>7</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-4862310228319623015</id><published>2010-10-14T14:35:00.000-07:00</published><updated>2010-10-14T14:35:08.964-07:00</updated><title type='text'>Introducing project "Overmind"</title><content type='html'>&lt;a href="http://github.com/tobami/overmind"&gt;Overmind&lt;/a&gt; is the brainchild of Miquel Torres. In its current version, released today, Overmind is what is sometimes called a 'controller fabric' for managing cloud instances&lt;span id="goog_613523621"&gt;&lt;/span&gt;&lt;span id="goog_613523622"&gt;&lt;/span&gt;&lt;a href="http://draft.blogger.com/"&gt;&lt;/a&gt;, based on &lt;a href="http://incubator.apache.org/libcloud/"&gt;libcloud&lt;/a&gt;. However, Miquel's &lt;a href="http://github.com/tobami/overmind/wiki/Roadmap"&gt;Roadmap&lt;/a&gt; for the project is very ambitious, and includes things like automated configuration management and monitoring for the instances launched and managed via Overmind.&lt;br /&gt;&lt;br /&gt;A little bit of history: Miquel contacted me via email in late July because he read my blog post on "&lt;a href="http://agiletesting.blogspot.com/2010/03/automated-deployment-systems-push-vs.html"&gt;Automated deployment systems: push vs. pull&lt;/a&gt;" and he was interested in collaborating on a queue-based deployment/config management system. The first step in such a system is to actually deploy the instances you need configured. Hence the need for something like Overmind.&lt;br /&gt;&lt;br /&gt;I'm sure you're asking yourself -- why do these guys wanted to roll their own system? Why not use something like &lt;a href="http://www.openstack.org/"&gt;OpenStack&lt;/a&gt;? Note in late July&amp;nbsp;OpenStack had only just been announced, and to this day (mid-October 2010) they have yet to release their controller fabric code. In the mean time, we have a pretty functional version of a deployment tool in Overmind, supporting Amazon EC2 and Rackspace, with a Django Web interface, and also with a REST API interface.&lt;br /&gt;&lt;br /&gt;I am aware there are many other choices out there in terms of managing and deploying cloud instances -- &lt;a href="https://www.cloudkick.com/"&gt;Cloudkick&lt;/a&gt;, &lt;a href="http://www.rightscale.com/"&gt;RightScale&lt;/a&gt;, &lt;a href="http://www.scalarium.com/"&gt;Scalarium&lt;/a&gt; ...and the list goes on. The problem is that none of these is Open Source. They do have great ideas though that we can steal ;-)&lt;br /&gt;&lt;br /&gt;I am also aware of Ruby-based tools such as &lt;a href="http://marionette-collective.org/"&gt;Marionette Collective&lt;/a&gt;&amp;nbsp;and its close integration with Puppet (which is now even closer since it has been acquired by &lt;a href="http://www.puppetlabs.com/"&gt;Puppet Labs&lt;/a&gt;). The problem is that it's Ruby and not Python ;-)&lt;br /&gt;&lt;br /&gt;In short, what Overmind brings to the table today is a Python-based, Django-based, libcloud-based tool for deploying (and destroying, but be careful out there) cloud instances. For the next release, Miquel and I are planning to add some configuration management capabilities. We're looking at &lt;a href="http://github.com/samuel/kokki"&gt;kokki&lt;/a&gt; as a very interesting Python-based alternative to chef, although we're planning on supporting &lt;a href="http://wiki.opscode.com/display/chef/Chef+Solo"&gt;chef-solo&lt;/a&gt; too.&lt;br /&gt;&lt;br /&gt;If you're interested in contributing to the project, please do! Miquel is an amazingly talented, focused and relentless developer, but he can definitely use more help (my contributions have been minimal in terms of actual code; I mostly tested Miquel's code and did some design and documentation work, especially in the REST API area).&lt;br /&gt;&lt;br /&gt;Here are some pointers to Overmind-related resources:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;GitHub &lt;a href="http://github.com/tobami/overmind/"&gt;project&lt;/a&gt; and &lt;a href="http://github.com/tobami/overmind/wiki"&gt;wiki&lt;/a&gt;&lt;/li&gt;&lt;li&gt;&lt;a href="http://github.com/tobami/overmind/archives/0.1.0"&gt;Download page&lt;/a&gt; for release 0.1.0&lt;/li&gt;&lt;li&gt;&lt;a href="http://groups.google.com/group/overmind-dev"&gt;Overmind-dev&lt;/a&gt; Google Group (please sign up)&lt;/li&gt;&lt;li&gt;Miquel's &lt;a href="http://mail-archives.apache.org/mod_mbox/incubator-libcloud/201010.mbox/%3CAANLkTinZ-i0CoQABkdhA=E+veYhYrQh4sWLQtxDaWXPY@mail.gmail.com%3E"&gt;announcement&lt;/a&gt; to the libcloud mailing list&lt;/li&gt;&lt;/ul&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-4862310228319623015?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/4862310228319623015/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=4862310228319623015' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/4862310228319623015'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/4862310228319623015'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/10/introducing-project-overmind.html' title='Introducing project &quot;Overmind&quot;'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-5546249766825461847</id><published>2010-09-27T14:27:00.000-07:00</published><updated>2010-09-27T14:30:38.307-07:00</updated><title type='text'>Getting detailed I/O stats with Munin</title><content type='html'>Ever since &lt;a href="http://vuksan.com/blog/"&gt;Vladimir Vuksan&lt;/a&gt; pointed me to his &lt;a href="http://github.com/ganglia/gmetric/blob/master/disk/diskio.pl/ganglia_disk_stats.pl"&gt;Ganglia script for getting detailed disk stats&lt;/a&gt;, I've been looking for something similar for Munin. The iostat and iostat_ios Munin plugins, which are enabled by default when you install Munin, do show disk stats across all devices detected on the system. I wanted more in-depth stats per device though. In my case, the devices I'm interested in are actually Amazon EBS volumes mounted on my database servers.&lt;br /&gt;&lt;br /&gt;I finally figured out how to achieve this, using the diskstat_ Munin plugin which gets installed by default when you install munin-node.&lt;br /&gt;&lt;br /&gt;If you run&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;/usr/share/munin/plugins/diskstat_ suggest&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;you will see the various symlinks you can create for the devices available on your server.&lt;br /&gt;&lt;br /&gt;In my case, I have 2 EBS volumes on each of my database servers, mounted as /dev/sdm and /dev/sdn. I created the following symlinks for /dev/sdm (and similar for /dev/sdn):&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;ln -snf /usr/share/munin/plugins/diskstat_ /etc/munin/plugins/diskstat_latency_sdm&lt;br /&gt;ln -snf /usr/share/munin/plugins/diskstat_ /etc/munin/plugins/diskstat_throughput_sdm&lt;br /&gt;ln -snf /usr/share/munin/plugins/diskstat_ /etc/munin/plugins/diskstat_iops_sdm&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Here's what metrics you get from these plugins:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;from diskstat_iops: Read I/O Ops/sec, Write I/O Ops/sec, Avg. Request Size, Avg. Read Request Size, Avg. Write Request Size&lt;/li&gt;&lt;li&gt;from diskstat_latency: Device Utilization, Avg. Device I/O Time, Avg. I/O Wait Time, Avg. Read I/O Wait Time, Avg. Write I/O Wait Time&lt;/li&gt;&lt;li&gt;from diskstat_throughput: Read Bytes, Write Bytes&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;My next step is to follow the advice of Mark Seger (the author of &lt;a href="http://collectl.sourceforge.net/Documentation.html"&gt;collectl&lt;/a&gt;) and graph the output of collectl in real time, so that the stats are displayed in fine-grained intervals of 5-10 seconds instead of the 5-minute averages that RRD-based tools offer.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-5546249766825461847?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/5546249766825461847/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=5546249766825461847' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5546249766825461847'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/5546249766825461847'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/09/getting-detailed-io-stats-with-munin.html' title='Getting detailed I/O stats with Munin'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-3482457913070659697</id><published>2010-09-21T14:52:00.000-07:00</published><updated>2010-09-21T14:52:23.559-07:00</updated><title type='text'>Quick note on installing and configuring Ganglia</title><content type='html'>I decided to give &lt;a href="http://ganglia.info/"&gt;Ganglia&lt;/a&gt; a try to see if I like its metric visualizations and its plugins better than Munin's. I am still in the very early stages of evaluating it. However, I already banged my head against the wall trying to understand how to configure it properly. Here are some quick notes:&lt;br /&gt;&lt;br /&gt;1) You can split your servers into clusters for ease of metric aggregation.&lt;br /&gt;&lt;br /&gt;2) Each node in a cluster needs to run gmond. In Ubuntu, you can do 'apt-get install ganglia-monitoring' to install it. The config file is in /etc/ganglia/gmond.conf. More on the config file in a minute.&lt;br /&gt;&lt;br /&gt;3) Each node in a cluster can send its metrics to a designated node via UDP.&lt;br /&gt;&lt;br /&gt;4) One server in your infrastructure can be configured as both the overall metric collection server, and as the web front-end. This server needs to run gmetad, which in Ubuntu can be installed via 'apt-get install gmetad'. Its config file is /etc/gmetad.conf.&lt;br /&gt;&lt;br /&gt;Note that you can have a tree of gmetad nodes, with the root of the tree configured to actually display the metric graphs. I wanted to keep it simple, so I am running both gmetad and the Web interface on the same node.&lt;br /&gt;&lt;br /&gt;5) The gmetad server periodically polls one or more nodes in each cluster and retrieves the metrics for that cluster. It displays them via a PHP web interface which can be found in the source distribution.&lt;br /&gt;&lt;br /&gt;That's about it in a nutshell in terms of the architecture of Ganglia. The nice thing is that it's scalable. You split nodes in clusters, you designate one or more nodes in a cluster to gather metrics from all the other nodes, and you have one ore more gmetad node(s) collecting the metrics from the designated nodes.&lt;br /&gt;&lt;br /&gt;Now for the actual configuration. I have a cluster of DB servers, each running gmond. I also have another server called bak01 that I keep around for backup purposes. I configured each DB server to be part of a cluster called 'db'. I also configured each DB server to send the metrics collected by gmond to bak01 (via UDP on the non-default port of 8650). To do this, I have these entries in /etc/ganglia/gmond.conf on each DB server:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;cluster {&lt;br /&gt;&amp;nbsp;&amp;nbsp;name = "db"&lt;br /&gt;&amp;nbsp;&amp;nbsp;owner = "unspecified"&lt;br /&gt;&amp;nbsp;&amp;nbsp;latlong = "unspecified"&lt;br /&gt;&amp;nbsp;&amp;nbsp;url = "unspecified"&lt;br /&gt;}&lt;br /&gt;&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;udp_send_channel {&amp;nbsp;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;host = bak01&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;port = 8650&lt;/div&gt;&lt;div&gt;}&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;On host bak01, I also defined a udp_recv_channel and a tcp_accept_channel:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;udp_recv_channel {&amp;nbsp;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;port = 8650&lt;/div&gt;&lt;div&gt;}&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;/* You can specify as many tcp_accept_channels as you like to share&amp;nbsp;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp; an xml description of the state of the cluster */&amp;nbsp;&lt;/div&gt;&lt;div&gt;tcp_accept_channel {&amp;nbsp;&lt;/div&gt;&lt;div&gt;&amp;nbsp;&amp;nbsp;port = 8649&amp;nbsp;&lt;/div&gt;&lt;div&gt;}&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;The upd_recv_channel is necessary so bak01 can receive the metrics from the gmond nodes. The tcp_accept_channel is necessary so that bak01 can be contacted by the gmetad node.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;That's it in terms of configuring gmond.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;On the gmetad node, I made one modification to the default /etc/gmetad.conf file by&amp;nbsp;specifying the cluster I want to collect metrics for, and the node where I want to collect the metrics from:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;data_source "eosdb" 60 bak01&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I then restarted gmetad via '/etc/init.d/gmetad restart'.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Ideally, these instructions would get you to a state where you would be able to see the graphs for all the nodes in the cluster.&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I automated the process of installing and configuring gmond on all the nodes via fabric. Maybe it all happened too fast for the collecting node (bak01), because it wasn't collecting metrics correctly for some of the nodes. I noticed that if I did 'telnet localhost 8649' on bak01, some of the nodes had no metrics associated with them. My solution was to stop and start gmond on those nodes, and that kicked things off. Strange though...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In any case, my next step is to install all kinds of Ganglia plugins, especially related to MySQL, but also for more in-depth disk I/O metrics.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-3482457913070659697?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/3482457913070659697/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=3482457913070659697' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3482457913070659697'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/3482457913070659697'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/09/quick-note-on-installing-and.html' title='Quick note on installing and configuring Ganglia'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-7337967365567834818</id><published>2010-09-15T08:52:00.000-07:00</published><updated>2010-09-15T08:52:55.565-07:00</updated><title type='text'>Managing Rackspace CloudFiles with python-cloudfiles</title><content type='html'>I've started to use Rackspace CloudFiles as an alternate storage for database backups. I have the backups now on various EBS volumes in Amazon EC2, AND in CloudFiles, so that should be good enough for Disaster Recovery purposes, one would hope ;-)&lt;br /&gt;&lt;br /&gt;I found the documentation for the &lt;a href="http://github.com/rackspace/python-cloudfiles"&gt;python-cloudfiles&lt;/a&gt; package a bit lacking, so here's a quick post that walks through the common scenarios you encounter when managing CloudFiles containers and objects. I am not interested in the CDN aspect of CloudFiles for my purposes, so for that you'll need to dig on your own.&lt;br /&gt;&lt;br /&gt;A CloudFiles container is similar to an Amazon S3 bucket, with one important difference: a container name cannot contain slashes, so you won't be able to mimic a file system hierarchy in CloudFiles the way you can do it in S3. A CloudFiles container, similar to an S3 bucket, contains objects -- which for CloudFiles have a max. size of 5 GB. So the CloudFiles storage landscape consists of 2 levels: a first level of containers (you can have an unlimited number of them), and a second level of objects embedded in containers. More details in the CloudFiles API Developer Guide (&lt;a href="http://docs.rackspacecloud.com/files/api/cf-devguide-latest.pdf"&gt;PDF&lt;/a&gt;).&lt;br /&gt;&lt;br /&gt;Here's how you can use the python-cloudfiles package to perform CRUD operations on containers and objects.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Getting a connection to CloudFiles&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;First you need to obtain a connection to your CloudFiles account. You need a user name and an API key (the key can be generated via the Web interface at https://manage.rackspacecloud.com).&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;conn = cloudfiles.get_connection(username=USERNAME, api_key=API_KEY, serviceNet=True)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;When specifying serviceNet=True, the docs say that you will use the&amp;nbsp;Rackspace ServiceNet network to access Cloud Files, and not the public network.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Listing containers and objects&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Once you get a connection, you can list existing containers, and objects within a container:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;containers = conn.get_all_containers()&lt;br /&gt;for c in containers:&lt;br /&gt;    print "\nOBJECTS FOR CONTAINER: %s" % c.name&lt;br /&gt;    objects = c.get_objects()&lt;br /&gt;    for obj in objects:&lt;br /&gt;        print obj.name&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;Creating containers&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;container = conn.create_container(container_name)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;Creating objects in a container&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Assuming you have a list of filenames you want to upload to a given container:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;for f in files:&lt;br /&gt;    print 'Uploading %s to container %s' % (f, container_name)&lt;br /&gt;    basename = os.path.basename(f)&lt;br /&gt;    o = container.create_object(basename)&lt;br /&gt;    o.load_from_filename(f)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;(note that the overview in the python-cloudfiles index.html doc has a typo -- it specifies 'load_from_file' instead of the correct 'load_from_filename')&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Deleting containers and objects&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;You first need to delete all objects inside a container, then you can delete the container itself:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;print 'Deleting container %s' % c.name&lt;br /&gt;print 'Deleting all objects first'&lt;br /&gt;objects = c.get_objects()&lt;br /&gt;for obj in objects:&lt;br /&gt;    c.delete_object(obj.name)&lt;br /&gt;print 'Now deleting the container'&lt;br /&gt;conn.delete_container(c.name)&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;&lt;b&gt;Retrieving objects from a container&lt;/b&gt;&lt;br /&gt;&lt;br /&gt;Remember that you don't have a backup process in place until you tested restores. So let's see how you retrieve objects that are stored in a CloudFiles container:&lt;br /&gt;&lt;br /&gt;&lt;pre class="code"&gt;container_name = sys.argv[1]&lt;br /&gt;containers = conn.get_all_containers()&lt;br /&gt;c = None&lt;br /&gt;for c in containers:&lt;br /&gt;    if container_name == c.name:&lt;br /&gt;        break&lt;br /&gt;if not c:&lt;br /&gt;    print "No countainer found with name %s" % container_name&lt;br /&gt;    sys.exit(1)&lt;br /&gt;&lt;br /&gt;target_dir = container_name&lt;br /&gt;os.system('mkdir -p %s' % target_dir)&lt;br /&gt;objects = c.get_objects()&lt;br /&gt;for obj in objects:&lt;br /&gt;    obj_name = obj.name&lt;br /&gt;    print "Retrieving object %s" % obj_name&lt;br /&gt;    target_file = "%s/%s" % (target_dir, obj_name)&lt;br /&gt;    obj.save_to_filename(target_file)&lt;br /&gt;&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-7337967365567834818?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/7337967365567834818/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=7337967365567834818' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/7337967365567834818'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/7337967365567834818'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/09/managing-rackspace-cloudfiles-with.html' title='Managing Rackspace CloudFiles with python-cloudfiles'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>3</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-1567344840639820665</id><published>2010-09-01T12:23:00.000-07:00</published><updated>2010-10-06T20:05:10.089-07:00</updated><title type='text'>MySQL InnoDB hot backups and restores with Percona XtraBackup</title><content type='html'>I blogged a while ago about &lt;a href="http://agiletesting.blogspot.com/2009/05/mysql-fault-tolerance-and-disaster.html"&gt;MySQL fault-tolerance and disaster recovery techniques&lt;/a&gt;. At that time I was experimenting with the non-free &lt;a href="http://www.innodb.com/products/hot-backup/"&gt;InnoDB Hot Backup&lt;/a&gt; product. In the mean time I discovered Percona's &lt;a href="http://www.percona.com/docs/wiki/percona-xtrabackup:start"&gt;XtraBackup&lt;/a&gt; (thanks Robin!). Here's how I tested XtraBackup for doing a hot backup and a restore of a MySQL database running Percona XtraDB (XtraBackup works with vanilla InnoDB too).&lt;br /&gt;&lt;br /&gt;First of all, I use the following Percona .deb packages on a 64-bit Ubuntu Lucid EC2 instance:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# dpkg -l | grep percona&lt;br /&gt;ii  libpercona-xtradb-client-dev      5.1.43-xtradb-1.0.6-9.1-60.jaunty.11 Percona SQL database development files&lt;br /&gt;ii  libpercona-xtradb-client16        5.1.43-xtradb-1.0.6-9.1-60.jaunty.11 Percona SQL database client library&lt;br /&gt;ii  percona-xtradb-client-5.1         5.1.43-xtradb-1.0.6-9.1-60.jaunty.11 Percona SQL database client binaries&lt;br /&gt;ii  percona-xtradb-common             5.1.43-xtradb-1.0.6-9.1-60.jaunty.11 Percona SQL database common files (e.g. /etc&lt;br /&gt;ii  percona-xtradb-server-5.1         5.1.43-xtradb-1.0.6-9.1-60.jaunty.11 Percona SQL database server binaries&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I tried using the latest stable XtraBackup .deb package from the &lt;a href="http://www.percona.com/downloads/XtraBackup/XtraBackup-1.2/deb/jaunty/x86_64/"&gt;Percona downloads site&lt;/a&gt; but it didn't work for me. I started a hot backup with /usr/bin/innobackupex-1.5.1 and it ran for a while before dying with "InnoDB: Operating system error number 9 in a file operation." See this &lt;a href="https://bugs.launchpad.net/percona-xtrabackup/+bug/568087"&gt;bug report&lt;/a&gt; for more details.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;After unsuccessfully trying to compile XtraBackup from source, I tried XtraBackup-1.3-beta for Lucid from the &lt;a href="http://www.percona.com/downloads/XtraBackup/XtraBackup-1.3-beta/deb/lucid/x86_64/"&gt;Percona downloads&lt;/a&gt;. This worked fine.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here's the scenario I tested against a MySQL Percona XtraDB instance running with DATADIR=/var/lib/mysql/m10 and a customized configuration file /etc/mysql10/my.cnf. I created and attached an EBS volume which I mounted as /xtrabackup on the instance running MySQL.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) Take a hot backup of all databases under that instance:&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;/usr/bin/innobackupex-1.5.1 --defaults-file=/etc/mysql10/my.cnf --user=root --password=xxxxxx /xtrabackup&lt;/pre&gt;&lt;br /&gt;This will take a while and will create a timestamped directory under /xtrabackup, where it will store the database files from DATADIR. Note that the InnoDB log files are not created unless you apply step 2 below.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As the documentation says, make sure the output of innobackupex-1.5.1 ends with:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;100901 05:33:12 innobackupex-1.5.1: completed OK!&lt;br /&gt;&lt;br /&gt;2) Apply the transaction logs to the datafiles just created, so that the InnoDB logfiles are recreated in the target directory:&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;/usr/bin/innobackupex-1.5.1 --defaults-file=/etc/mysql10/my.cnf --user=root --password=xxxxxx --apply-log /xtrabackup/2010-09-01_05-21-36/&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;At this point, I tested a disaster recovery scenario by stopping MySQL and moving all files in DATADIR to a different location.&lt;br /&gt;&lt;br /&gt;To bring the databases back to normal from the XtraBackup hot backup, I did the following:&lt;br /&gt;&lt;br /&gt;1) Brought back up a functioning MySQL instance to be used by the XtraBackup restore operation:&lt;br /&gt;&lt;br /&gt;i) Copied the contents of the default /var/lib/mysql/mysql database under /var/lib/mysql/m10/ (or you can recreate the mysql DB from scratch)&lt;br /&gt;&lt;br /&gt;ii) Started mysqld_safe manually:&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;mysqld_safe --defaults-file=/etc/mysql10/my.cnf&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;This will create the data files and logs under DATADIR (/var/lib/mysql/m10) with the sizes specified in the configuration file. I had to wait until the messages in /var/log/syslog told me that the MySQL instance is ready and listening for connections.&lt;br /&gt;&lt;br /&gt;2) Copied back the files from the hot backup directory into DATADIR&lt;br /&gt;&lt;br /&gt;Note that the copy-back operation below initially errored out because it tried to copy the mysql directory too, and it found the directory already there under DATADIR. So the 2nd time I ran it, I moved /var/lib/mysql/m10/mysql to mysql.bak. The copy-back command is:&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;/usr/bin/innobackupex-1.5.1 --defaults-file=/etc/mysql10/my.cnf --user=root --copy-back /xtrabackup/2010-09-01_05-21-36/&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;You can also copy the files from  /xtrabackup/2010-09-01_05-21-36/ into DATADIR using vanilla cp.&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;b&gt;NOTE&lt;/b&gt;: verify the permissions on the restored files. In my case, some files in DATADIR were owned by root, so MySQL didn't start up properly because of that. Do a 'chown -R mysql:mysql DATADIR' to be sure.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;3) If everything went well in step 2, restart the MySQL instance to make sure everything is OK.&lt;br /&gt;&lt;br /&gt;At this point, your MySQL instance should have its databases restored to the point where you took the hot backup.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;IMPORTANT&lt;/b&gt;: if the newly restored instance needs to be set up as a slave to an existing master server, you need to set the correct master_log_file and master_log_pos parameters via a 'CHANGE MASTER TO' command. These parameters are saved by innobackupex-1.5.1 in a file called xtrabackup_binlog_info in the target backup directory.&lt;br /&gt;&lt;br /&gt;In my case, the xtrabackup_binlog_info file contained:&lt;br /&gt;&lt;br /&gt;mysql-bin.000041 23657066&lt;br /&gt;&lt;br /&gt;Here is an example of a CHANGE MASTER TO command I used:&lt;br /&gt;&lt;pre class="code"&gt;&lt;br /&gt;STOP SLAVE;&lt;br /&gt;&lt;br /&gt;CHANGE MASTER TO MASTER_HOST='masterhost', MASTER_PORT=3316, MASTER_USER='masteruser', MASTER_PASSWORD='masterpass', MASTER_LOG_FILE='mysql-bin.000041', MASTER_LOG_POS=23657066;&lt;br /&gt;&lt;br /&gt;START SLAVE;&lt;br /&gt;&lt;/pre&gt;&lt;br /&gt;Note that XtraBackup can also run in a 'stream' mode useful for compressing the files generated by the backup operation. Details in the &lt;a href="http://www.percona.com/docs/wiki/percona-xtrabackup:xtrabackup_howto"&gt;documentation&lt;/a&gt;.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-1567344840639820665?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/1567344840639820665/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=1567344840639820665' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1567344840639820665'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/1567344840639820665'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/09/mysql-innodb-hot-backups-and-restores.html' title='MySQL InnoDB hot backups and restores with Percona XtraBackup'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2189933157616141548</id><published>2010-08-31T11:58:00.000-07:00</published><updated>2010-08-31T11:58:33.432-07:00</updated><title type='text'>Poor man's MySQL disaster recovery in EC2 using EBS volumes</title><content type='html'>First of all, I want to emphasize that this is NOT a disaster recovery strategy I recommend. However, in a pinch, it might save your ass. Here's the scenario I have:&lt;br /&gt;&lt;br /&gt;&lt;ul&gt;&lt;li&gt;2 m1.large instances running Ubuntu 10.04 64-bit and the Percona XtraDB MySQL builds (for the record, the exact version I'm using is "Server version: 5.1.43-60.jaunty.11-log (Percona SQL Server (GPL), XtraDB 9.1, Revision 60")&lt;/li&gt;&lt;li&gt;I'll call the 2 servers db101 and db201&lt;/li&gt;&lt;li&gt;each server is running 2 MySQL instances -- I'll call them m1 and m2&lt;/li&gt;&lt;li&gt;instance m1 on db101 and instance m1 on db201 are set up in master-master replication (and similar for instance m2)&lt;/li&gt;&lt;li&gt;the DATADIR for m1 is /var/lib/mysql/m1 on each server; that file system is mounted from an EBS volume (and similar for m2)&lt;/li&gt;&lt;li&gt;the configuration files for m1 are in /etc/mysql1 on each server -- that directory was initially a copy of the Ubuntu /etc/mysql configuration directory, which I then customized (and similar for m2)&lt;/li&gt;&lt;li&gt;the init.d script for m1 is in /etc/init.d/mysql1 (similar for m2)&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;What I tested:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;I took a snapshot of each of the 2 EBS volumes associated with each of the DB servers (4 snapshots in all)&lt;/li&gt;&lt;li&gt;I terminated the 2 m1.large instances&lt;/li&gt;&lt;li&gt;I launched 2 m1.xlarge instances and installed the same Percona distribution (this was done via a Chef recipe at instance launch time); I'll call the 2 new instances xdb101 and xdb102&lt;/li&gt;&lt;li&gt;I pushed the configuration files for m1 and m2, as well as the init.d scripts (this was done via fabric)&lt;/li&gt;&lt;li&gt;I created new volumes from the EBS snapshots (note that these volumes can be created in any EC2 availability zone)&lt;/li&gt;&lt;li&gt;On xdb101, I attached the 2 volumes created from the EBS snapshots on db101; I specified /dev/sdm and /dev/sdn as the device names (similar on xdb201)&lt;/li&gt;&lt;li&gt;On xdb101,&amp;nbsp;I created /var/lib/mysql/m1 and mounted /dev/sdm there; I also created /var/lib/mysql/m2 and mounted /dev/sdn there (similar on xdb201)&lt;/li&gt;&lt;li&gt;At this point, the DATADIR directories for both m1 and m2 are populated with 'live files' from the moment when I took the EBS snapshot&lt;/li&gt;&lt;li&gt;I made sure syslog-ng accepts UDP traffic from localhost (by default it doesn't); this is because by default in Ubuntu mysql log messages are sent to syslog --&amp;gt; to do this, I ensured that "udp(ip(127.0.0.1) port(514));" appears in the "source s_all" entry in /etc/syslog-ng/syslog-ng.conf&lt;/li&gt;&lt;/ul&gt;At this point, I started up the first MySQL instance on xdb101 via "/etc/init.d/mysql1 start". This script most likely will show [fail] on the console, because MySQL will not start up normally. If you look in /var/log/syslog, you'll see entries similar to:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;Aug 31 18:03:21 xdb101 mysqld: 100831 18:03:21 [Note] Plugin 'FEDERATED' is disabled.&lt;/div&gt;&lt;div&gt;Aug 31 18:03:21&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: The InnoDB memory heap is disabled&lt;/div&gt;&lt;div&gt;Aug 31 18:03:21&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: Mutexes and rw_locks use GCC atomic builtins&lt;/div&gt;&lt;div&gt;Aug 31 18:03:22&amp;nbsp;xdb101&amp;nbsp;mysqld: 100831 18:03:22 &amp;nbsp;InnoDB: highest supported file format is Barracuda.&lt;/div&gt;&lt;div&gt;Aug 31 18:03:23&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: The log sequence number in ibdata files does not match&lt;/div&gt;&lt;div&gt;Aug 31 18:03:23&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: the log sequence number in the ib_logfiles!&lt;/div&gt;&lt;div&gt;Aug 31 18:03:23&amp;nbsp;xdb101&amp;nbsp;mysqld: 100831 18:03:23 &amp;nbsp;InnoDB: Database was not shut down normally!&lt;/div&gt;&lt;div&gt;Aug 31 18:03:23&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: Starting crash recovery.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;If you wait a bit longer (and if you're lucky), you'll see entries similar to:&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;div&gt;Aug 31 18:04:20&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: Restoring possible half-written data pages from the doublewrite&lt;/div&gt;&lt;div&gt;Aug 31 18:04:20&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: buffer...&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: In a MySQL replication slave the last master binlog file&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: position 0 15200672, file name mysql-bin.000015&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: and relay log file&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: position 0 15200817, file name ./mysqld-relay-bin.000042&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: InnoDB: Last MySQL binlog file position 0 17490532, file name /var/lib/mysql/m1/mysql-bin.000002&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: 100831 18:04:24 InnoDB Plugin 1.0.6-9.1 started; log sequence number 1844705956&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: 100831 18:04:24 [Note] Recovering after a crash using /var/lib/mysql/m1/mysql-bin&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: 100831 18:04:24 [Note] Starting crash recovery...&lt;/div&gt;&lt;div&gt;Aug 31 18:04:24&amp;nbsp;xdb101&amp;nbsp;mysqld: 100831 18:04:24 [Note] Crash recovery finished.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point, you can do "/etc/init.d/mysql1 restart" just to make sure that both stopping and starting that instance work as expected. Repeat for instance m2, and also repeat on server xdb201.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;So....IF you are lucky and the InnoDB crash recovery process did its job, you should have 2 functional MySQL instances one each of xdb101 and xdb201. I tested this with several pairs of servers and it worked for me every time, but I hasten to say that YMMV, so DO NOT bet on this as your disaster recovery strategy!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At this point I still had to re-establish the master-master replication between m1 on xdb101 and m1 on xdb201 (and similar for m2).&amp;nbsp;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When I initially set up this replication between the original m1.large servers, I used something like this on both db101 and db201:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;CHANGE MASTER TO MASTER_HOST='master1', MASTER_PORT=3306, MASTER_USER='masteruser', MASTER_PASSWORD='xxxxxx';"&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The trick for me is that master1 points to db201 in db101's /etc/hosts, and vice-versa.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;On the newly created xdb101 and xdb201, there are no entries for master1 in /etc/hosts, so replication is broken. Which is a good thing initially, because you want to have the MySQL instances on each server be brought back up without throwing replication into the mix.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Once I added an entry for master1 in xdb101's /etc/hosts pointing to xdb201, and did the same on xdb201, I did a 'stop slave; start slave; show slave status\G' on the m1 instance on each server. In all cases I tested, one of the slaves was showing everything OK, while the other one was complaining about &amp;nbsp; not being able to read from the master's log file. This was fairly simply to fix. Let's assume xdb101 is the one complaining. I did the following:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;on xdb201, I ran 'show master status\G' and noted the file name (for example "mysql-bin.000017") and the file position (for example 106)&lt;/li&gt;&lt;li&gt;on xdb101, I ran the following command: "stop slave; change master to master_log_file='mysql-bin.000017', master_log_pos=106; start slave;"&lt;/li&gt;&lt;li&gt;not a 'show slave status\G' on xdb101 should show everything back to normal&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;Some lessons:&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;take periodic snapshots of your EBS volumes (at least 1/day)&lt;/li&gt;&lt;li&gt;&lt;b&gt;for a true disaster recovery strategy, use at least mysqldump to dump your DB to disk periodically, or something more advanced such as &lt;a href="http://www.percona.com/docs/wiki/percona-xtrabackup:xtrabackup_manual"&gt;Percona XtraBackup&lt;/a&gt;; I recommend dumping the DB to an EBS volume and taking periodic snapshots of that volume&lt;/b&gt;&lt;/li&gt;&lt;li&gt;the procedure I detailed above is handy when you want to grow your instance 'vertically' -- for example I went from m1.large to m1.xlarge&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2189933157616141548?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2189933157616141548/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2189933157616141548' title='1 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2189933157616141548'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2189933157616141548'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/08/poor-mans-mysql-disaster-recovery-in.html' title='Poor man&apos;s MySQL disaster recovery in EC2 using EBS volumes'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>1</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-2609948635734935214</id><published>2010-08-20T10:51:00.000-07:00</published><updated>2011-03-01T16:14:19.248-08:00</updated><title type='text'>Visualizing MySQL metrics with the munin-mysql plugin</title><content type='html'>&lt;a href="http://munin-monitoring.org/"&gt;Munin&lt;/a&gt; is a great tool for resource visualization. Sometimes though installing a 3rd party Munin plugin is not as straightforward as you would like. I have been struggling a bit with one such plugin, &lt;a href="http://github.com/kjellm/munin-mysql"&gt;munin-mysql&lt;/a&gt;, so I thought I'd spell it out for my future reference. My particular scenario is running multiple MySQL instances on various port numbers (3306 and up) on the same machine. I wanted to graph in particular the various InnoDB metrics that munin-mysql supports. I installed the plugin on various Ubuntu flavors such as Jaunty and Lucid.&lt;br /&gt;&lt;br /&gt;Here are the steps:&lt;br /&gt;&lt;br /&gt;1) Install 2 pre-requisite Perl modules for munin-mysql: IPC-ShareLite and Cache-Cache&lt;br /&gt;&lt;br /&gt;2) git clone http://github.com/kjellm/munin-mysql&lt;br /&gt;&lt;br /&gt;3) cd munin-mysql; edit Makefile and point PLUGIN_DIR to the directory where your munin plugins reside (if you installed Munin on Ubuntu via apt-get, that directory is /usr/share/munin/plugins)&lt;br /&gt;&lt;br /&gt;4) make install --&amp;gt; this will copy the mysql_ Perl script to PLUGIN_DIR, and the mysql_.conf file to /etc/munin/plugin-conf.d&lt;br /&gt;&lt;br /&gt;5) Edit /etc/munin/plugin-conf.d/mysql_.conf and customize it with your specific MySQL information.&lt;br /&gt;&lt;br /&gt;For example, if you run 2 MySQL instances on ports 3306 and 3307, you could have something like this in mysql_.conf:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;[mysql_3306_*]&lt;br /&gt;env.mysqlconnection DBI:mysql:mysql;host=127.0.0.1;port=3306&lt;br /&gt;env.mysqluser myuser1&lt;br /&gt;env.mysqlpassword mypassword1&lt;br /&gt;&lt;br /&gt;[mysql_3307_*]&lt;br /&gt;env.mysqlconnection DBI:mysql:mysql;host=127.0.0.1;port=3307&lt;br /&gt;env.mysqluser myuser2&lt;br /&gt;env.mysqlpassword mypassword2&lt;br /&gt;&lt;br /&gt;6) Run "/usr/share/munin/plugins/mysql_ suggest" to see what metrics are supported by the plugin. Then proceed to create symlinks in /etc/munin/plugins, adding the port number and the metric name as the suffix.&lt;br /&gt;&lt;br /&gt;For example, to track InnoDB I/O metrics for the MySQL instance running on port 3306, you would create this symlink:&lt;br /&gt;&lt;br /&gt;ln -s /usr/share/munin/plugins/mysql_ /etc/munin/plugins/mysql_3306_innodb_io&lt;br /&gt;&lt;br /&gt;(replace 3306 with 3307 to track this metric for the other MySQL instance running on port 3307)&lt;br /&gt;&lt;br /&gt;Of course, it's easy to automate this by a simple &lt;a href="http://gist.github.com/540801"&gt;shell script&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;7) Restart munin-node and wait 10-15 minutes for the munin master to receive the information about the new metrics.&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Important!&lt;/b&gt; If you need to troubleshoot this plugin (and any Munin plugin), do not make the mistake of simply running the plugin script directly in the shell. If you do this, it will not read the configuration file(s) correctly, and it will most probably fail. Instead, what you need to do is to follow the "&lt;a href="http://munin-monitoring.org/wiki/Debugging_Munin_plugins"&gt;Debugging Munin plugins&lt;/a&gt;" documentation, and run the plugin through the munin-run utility. For example:&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;# munin-run mysql_3306_innodb_io&lt;br /&gt;ib_io_read.value 34&lt;br /&gt;ib_io_write.value 57870&lt;br /&gt;ib_io_log.value 8325&lt;br /&gt;ib_io_fsync.value 55476&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;One more thing: you should probably automate all these above steps. I have most of it automated via a fabric script. The only thing I do by hand is to create the appropriate symlinks for the specific port numbers I have on each server.&lt;div&gt;&lt;br /&gt;That's it! Enjoy staring for hours at your brand new MySQL metrics!&lt;br /&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-2609948635734935214?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/2609948635734935214/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://www.blogger.com/comment.g?blogID=9238405&amp;postID=2609948635734935214' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2609948635734935214'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/9238405/posts/default/2609948635734935214'/><link rel='alternate' type='text/html' href='http://agiletesting.blogspot.com/2010/08/visualizing-mysql-metrics-with-munin.html' title='Visualizing MySQL metrics with the munin-mysql plugin'/><author><name>Grig Gheorghiu</name><uri>http://www.blogger.com/profile/17863511617654196370</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='24' height='32' src='http://agile.unisonis.com/gg.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-9238405.post-4478284693696395303</id><published>2010-08-16T12:07:00.000-07:00</published><updated>2010-08-16T12:07:49.351-07:00</updated><title type='text'>MySQL and AppArmor on Ubuntu</title><content type='html'>This is just a quick post that I hope will save some people some headache when they try to customize their MySQL setup on Ubuntu. I've spent some quality time with this problem over the weekend. I tried in vain for hours to have MySQL read its configuration files from a non-default location on an Ubuntu 9.04 server, only to figure out that it was all AppArmor's fault.&lt;br /&gt;&lt;br /&gt;My ultimate goal was to run multiple instances of MySQL on the same host. In the past &lt;a href="http://agiletesting.blogspot.com/2009/07/managing-multiple-mysql-instances-with.html"&gt;I achieved this with MySQL Sandbox&lt;/a&gt;, but this time I wanted to use MySQL installed from Debian packages and not from a tarball of the binary distribution, and MySQL Sandbox has some issues with that.&lt;br /&gt;&lt;br /&gt;Here's what I did: I copied /etc/mysql to /etc/mysql0, then I edited /etc/mysql0/my.cnf and modified the location of the socket file, the pid file and the datadir to non-default locations. Then I tried to run:&lt;br /&gt;&lt;br /&gt;/usr/bin/mysqld_safe --defaults-file=/etc/mysql0/my.cnf&lt;br /&gt;&lt;br /&gt;At this point, /var/log/daemon.log showed this error:&lt;br /&gt;&lt;br /&gt;mysqld[25133]: Could not open required defaults file: /etc/mysql0/my.cnf&lt;br /&gt;mysqld[25133]: Fatal error in defaults handling. Program aborted&lt;br /&gt;&lt;br /&gt;It took me as I said a few hours trying all kinds of crazy things until I noticed lines like these in /var/log/syslog:&lt;br /&gt;&lt;br /&gt;kernel: [18593519.090601] type=1503 audit(1281847667.413:22): operation="inode_permission" requested_mask="::r" denied_mask="::r" fsuid=0 name="/etc/mysql0/my.cnf"&lt;br /&gt;&amp;nbsp;pid=4884 profile="/usr/sbin/mysqld"&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This made me realize it's AppArmor preventing mysqld from opening non-default files. I don't need AppArmor on my servers, so I just stopped it with 'service apparmor stop' and chkconfig-ed it off....at which point every customization I had started to work perfectly.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At least 2 lessons:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) when you see mysterious, hair-pulling errors, check security-related processes on your server: iptables, AppArmor, SELinux etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;2) check all log files in /var/log -- I was focused on daemon.log and didn't notice the errors in syslog quickly enough&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Google didn't help when I searched for "mysqld&amp;nbsp;Could not open required defaults file". I couldn't find any reference to AppArmor, only to file permissions.&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/9238405-4478284693696395303?l=agiletesting.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://agiletesting.blogspot.com/feeds/4478284693696395303/comments/default' title='Post Comments'/><l
