Tuesday, January 24, 2012

An ode to running a database on bare metal


No, my muse is not quite as strong as to inspire me to write an ode, but I still want to emphasize a few points about the goodness of running a database on bare metal.

At Evite, we use sharded MySQL for our production database. We designed the current architecture in 2009, when NoSQL was still very much in its infancy, so MySQL seemed a solid choice, a technology that we could at least understand. As I explained elsewhere, we do use MySQL in an almost non-relational way, and we sharded from the get-go, with the idea that it's better to scale horizontally than vertically.

We initially launched with the database hosted at a data center on a few Dell PE2970 servers, each with 16 GB of RAM and 2 quad-core CPUs. Each server was running 2 MySQL instances. We didn't get a chance to dark launch, but the initial load testing we did showed that we should be OK. However, there is nothing like production traffic to really stress test your infrastructure, and we soon realized that we have an insufficient number of servers for the peak traffic we were expecting towards the end of the year.

We decided to scale horizontally in EC2, with one MySQL instance per m1.xlarge EC2 instance. At the time we also engaged Percona and they helped us fine-tune our Percona XtraDB MySQL configuration so we could get the most out of the m1.xlarge horsepower. We managed to scale sufficiently enough for our high season in 2010, although we had plenty of pain points. We chose to use EBS volumes for our database files, because at the time EBS still gave people the illusion of stability and durability. We were very soon confronted with severe performance issues, manifested as very high CPU I/O wait times, which were sometimes so high as to make the instance useless.

I described in a previous post how proficient we became at failing over from a master that went AWOL to a slave. Our issues with EBS volumes were compounded by the fact that our database access pattern is very write-intensive, and a shared medium such as EBS was far from ideal. Our devops team was constantly on the alert, and it seemed like we were always rebuilding instances and recovering from EC2 instance failures, although the end-user experience was not affected.

Long story short, we decided to bring the database back in-house, at the data center, on 'real' bare-metal servers. No virtualization, thanks. The whole process went relatively smoothly. One important point I want to make here is that we already had a year's worth of hard numbers at that point regarding the access patterns to our database, iops/sec, MySQL query types, etc, etc. So it made it easy to do proper capacity planning this time, in the presence of production traffic.

We started by buying 2 Dell C2100 servers, monster machines, with dual Intel Xeon X5650 processors (for a total of 24 cores), 144 GB RAM, and 12 x 1 TB hard disks out of which we prepared a 6 TB RAID 10 volume which we further divided in LVM logical volumes for specific types of MySQL files.

We put 2 MySQL instances on each server, and we engaged Percona again to help us fine-tune the configuration, this time including not only MySQL, but also the hardware and the OS. They were super helpful to us, as usual. Here are only some of the things they recommended, which we implemented:
  • set vm.swappiness kernel setting to 0 in /etc/sysctl.conf
  • set InnoDB flush method to O_DIRECT because we can rely on the RAID controller to do the caching (we also mounted XFS with the nobarrier option in conjunction with this change)
  • disable MySQL query cache, which uses a global mutex that can cause performance issues when used on a multi-core server
  • various other optimizations which were dependent on our setup, things like tweaking MySQL configuration options such as key_buffer_size and innodb_io_capacity
One important MySQL configuration option that we had to tweak was innodb_buffer_pool_size. If we set it too high, the server could start swapping. If we set it too low, the disk I/O on the server could become too problematic. Since we had 144 GB of RAM and we were running 2 MySQL instances per server, we decided to give each instance 60 GB of RAM. This proved to strike a good balance.

Once the fine-tuning was done, we directed production traffic away from 4 EC2 m1.xlarge instances to 2 x 2 MySQL instances, with each pair running on a C2100. We then sat back and wallowed for a while in the goodness of the I/O numbers we were observing. Basically, the servers were barely working. This is how life should be. 

We soon migrated all of our MySQL masters back into the data center. We left the slaves running in EC2 (still one m1.xlarge slave per MySQL master instance), but we changed them from being EBS-backed to using the local ephemeral disk in RAID 0 with LVM. We look at EC2 in this case as a secondary data center, used only in emergency situations.

One thing that bit us in our bare-metal setup was....a bare-metal issue around the LSI MegaRAID controllers. I already blogged about the problems we had with the battery relearning cycle, and with decreased performance in the presence of bad drives. But these things were easy to fix (again thanks to our friends at Percona for diagnosing these issues correctly in the first place...)

I am happy to report that we went through our high season for 2011 without a glitch in this setup. Our devops team slept much better at night too! One nice thing about having EC2 as a 'secondary data center' is that if need be, we can scale out horizontally   by launching more EC2 instances. In fact, we doubled the number of MySQL slave instances for the duration of our high season, with the thought that if we need to, we can double the number of shards at the application layer, and thus scale horizontally that way. We didn't have to do any tweaking fortunately, but we were able to -- a strategy which would otherwise be hard to pull off if we didn't have any cloud presence, unless we bought a lot of extra capacity at the data center.

This brings me to one of the points I want to make in this post: it is a very valuable strategy to be able to use the cloud to roll out a new architecture (which you designed from the get-go however to be horizontally scalable) and to gauge its performance in the presence of real production traffic. You will get less than optimal performance per instance (because of virtualization vs. real hardware) , but since you can scale horizontally, you should be able to sustain the desired level of traffic for your application. You will get hard numbers that will help you do capacity planning and you will be able to bring the database infrastructure back to real hardware if you so wish, like we did. Note that Zynga has a similar strategy -- they roll out new games in EC2 and once they get a handle on how much traffic a game has, they bring it back into the data center (although it looks like they still use a private cloud and not bare metal).

Another point I want to make is that the cloud is not ready yet for write-intensive transactional databases, mainly because of the very poor I/O performance that you get on virtual instances in the cloud (compounded by shared network storage such as EBS). Adrian Cockcroft will reply that Netflix is doing just fine and they're exclusively in EC2. I hope they are doing just fine, and I hope his devops team is getting some good sleep at night, but I'm not sure. I need to perhaps qualify my point and say that the cloud is not ready for traditional transactional databases such as MySQL and PostgreSQL, which require manual sharding to be horizontally scalable. If I had to look at redesigning our database architecture today, I'd definitely try out HBase, Riak and maybe Cassandra. The promise there at least is that adding a new node to the cluster in these technologies is much less painful than in the manual sharding and scaling scenario. This still doesn't guarantee that you won't end up paying for a lot of instances to compensate for poor individual I/O per instance. Maybe a cloud vendor like Joyent with their SmartMachines will make a difference in this area (in fact, it is on our TODO list to test out their Percona SmartMachine).

Note however that there's something to be said about using good ol' RDBMS technologies. Ryan Mack says this in a Facebook Engineering post:

"After a few discussions we decided to build on four of our core technologies: MySQL/InnoDB for storage and replication, Multifeed (the technology that powers News Feed) for ranking, Thrift for communications, and memcached for caching. We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits."

The emphasis on the last sentence is mine. It's the operational aspect of a new architecture that will kill you first. With a well understood architecture, at least you have a chance to tame it.

Yet another point I'd like to make is: do not base your disaster recovery strategy in EC2 around EBS volumes, especially if you have a write-intensive database. It's not worth the performance loss, and most of all it's not worth the severe and unpredictable fluctuation in performance. It works much better in our experience to turn the ephemeral disks of an m1.xlarge EC2 instance into a RAID 0 array and put LVM on top of that, and use it for storing the various MySQL file types. We are then able to do LVM snapshots of that volume, and upload the snapshots to S3. To build a new slave, we can restore the snapshot from S3, then catch up the replication with the master. Works fine.

There you have it. An ode in prose to running your database on bare metal. Try it, you may sleep better at night!

Thursday, January 05, 2012

Graphing, alerting and mission control with Graphite and Nagios


We’ve been using Graphite more and more for graphing of OS- and application-related metrics (here are some old-ish notes of mine on installing and configuring Graphite.) We measure and graph variables as diverse as:
  • relative and absolute state of charge of the LSI MegaRAID controller battery (why? because we’ve been burned by battery issues before)
  • database server I/O wait time (critical for EC2 instances which are notorious for their poor I/O performance; at this point we only run MySQL slaves in EC2, and we do not repeat DO NOT use EBS volumes for DB servers, instead we stripe the local disks into a RAID 0 array with LVM)
  • memcached stats such as current connections, get hits and misses, delete hits and misses, evictions
  • percentage and absolute number of HTTP return codes as served by nginx and haproxy
  • count of messages in various queues (our own and Amazon SQS)
  • count of outgoing mail messages

We do have a large Munin install base, so we found Adam Jacob’s munin-graphite.rb script very useful in sending all data captured by Munin to Graphite.

Why Graphite and not Munin or Ganglia? Mainly because it’s so easy to send arbitrarily named metrics to Graphite, but also because we can capture measurements at 1 second granularity (although this is possible with some tweaking with RRD-based tools as well).

On the Graphite server side, we set up different retention policies depending on the type of data we capture. For example, for app server logs (nginx and haproxy) we have the following retention policy specified in /opt/graphite/conf/storage-schemas.conf:

[appserver]
pattern = ^appserver\.
retentions = 60s:30d,10m:180d,60m:2y


This tells Graphite we want to keep data aggregated at 60 second intervals for 30 days, 10 minute data for 6 months and hourly data for 2 years.

The main mechanism we use for sending data to Graphite is tailing various log files at different intervals, parsing the entries in order to extract the metrics we’re interested in, and sending those metrics to Graphite by the tried-and-true method called ‘just open a socket’.

For example, we tail the nginx access log file via a log tracker script written in Python (and run as a daemon with supervisor), and we extract values such as the timestamp, the request URL, the HTTP return code, bytes sent and the request time in milliseconds. The default interval for collecting these values is 1 minute. For HTTP return codes, we group the codes such as 2xx, 3xx, 4xx, 5xx together, so we can report on each type of return code. We aggregate the values per collection interval, then send the counts to Graphite, named something like appserver.app01.500.reqs, which represents the HTTP 500 error count on server app01.

A more elegant way would be to use a tool such as logster to capture various log entries, but we haven’t had the time to write logster plugins for the 2 main services we’re interested in, nginx and haproxy. Our solution is deemed temporary, but as we all know, there’s nothing more permanent than a temporary solution.

For some more unusual metrics that we measure ourselves, such as LSI MegaRaid battery charge state, we run a shell script in an infinite loop and produce a value every second, then we send it to Graphite. To obtain the value we run something that resembles line noise:

$ MegaCli64 -AdpBbuCmd -GetBbuStatus -a0 | grep -i "Relative State of Charge" | awk '{print $5}'

(thanks to my colleague Marco Garcia for coming up with this)

Once the data is captured in Graphite, we can do several things with it:
  • visualize it using the Graphite dashboards
  • alert on it using custom Nagios plugins
  • capture it in our own more compact dashboards, so we can have multiple graphs displayed on one ‘mission control’ page

Nagios plugins

My colleague Marco Garcia wrote a Nagios plugin in Python to alert on HTTP 500 errors. To obtain the data from Graphite, he queries a special ‘rawData’ URL of this form:

http://graphite.yourdomain.com/render/?from=-5minutes&until=-1minutes&target=asPercent(sum(appserver.app01.500.*.reqs),sum(appserver.app01.*.*.reqs))&rawData

which returns something like

asPercent(sumSeries(appserver.app01.500.*.reqs),sumSeries(appserver.app01.*.*.reqs)),1325778360,1325778600,60|value1,value2,value3,value4

where the 4 values represent the 4 data points, one per minute, from 5 minutes ago to 1 minute ago. Each data point is the percentage of HTTP 500 errors calculated against the total number of HTTP requests.

The script then compares the values returned with a warning threshold (0.5) and a critical threshold (1.0). If all values are greater than the respective threshold, we issue a warning or a critical Nagios alert.

Mission control dashboard

My colleague Josh Frederick came up with the idea of presenting all relevant graphs based on data captured in Graphite in a single page which he dubbed ‘mission control’. This enables us to make correlations at a glance between things such as increased I/O wait on DB servers and spikes in HTTP 500 errors.

To generate a given graph, Josh uses some Javascript magic to come up with an URL such as:

http://graphite.yourdomain.com/render/?width=500&height=200&hideLegend=1&from=-60minutes&until=-0minutes&bgcolor=FFFFFF&fgcolor=000000&areaMode=stacked&title=500s%20as%20%&target=asPercent(sum(appserver.app*.500.*.reqs),%20sum(appserver.app*.*.*.reqs))&_ts=1325783046438

which queries Graphite for the percentage of HTTP 500 errors from all HTTP requests across all app servers for the last 60 minutes. The resulting graph looks like this:



Our mission control page currently has 29 such graphs.

We also have (courtesy of Josh again) a different set of charts based on the Google Visualization Python API. We get the data from Graphite in CSV format (by adding format=csv to the Graphite query URLs), then we display the data using the Google Visualization API.

If you don’t want to roll your own JS-based dashboard talking to Graphite, there’s a tool called statsdash that may be of help.

Thursday, December 22, 2011

Load balancing and SSL in EC2

Here is another post I wrote for this year's Sysadvent blog. It briefly mentions some ways you can do load balancing in EC2, and focuses on how to upload SSL certificates to an Elastic Load Balancer using command-line tools. Any comments appreciated!

Monday, December 12, 2011

Analyzing logs with Pig and Elastic MapReduce

This is a blog post I wrote for this year's Sysadvent series. If you're not familiar with the Sysadvent blog, you should be, if you are at all interested in system administration/devops topics. It is maintained by the indefatigable Jordan Sissel, and it contains posts contributed by various authors, 25 posts per year, from Dec 1st through Dec 25th. The posts cover a variety of topics, and to give you a taste, here are the articles so far this year:

Day 1: "Don't bash your process outputs" by Phil Hollenback
Day 2: "Strategies for Java deployment" by Kris Buytaert
Day 3: "Share skills and permissions with code" by Jordan Sissel
Day 4: "A guide to packaging systems" by Jordan Sissel
Day 5: "Tracking requests with Request Tracker" by Christopher Webber
Day 6: "Always be hacking" by John Vincent
Day 7: "Change and proximity of communication" by Aaron Nichols
Day 8: "Running services with systemd" by Jordan Sissel
Day 9: "Data in the shell" by Jordan Sissel
Day 10: "Analyzing logs with Pig and Elastic MapReduce" by yours truly
Day 11: "Simple disk-based server backups with rsnapshot" by Phil Hollenback
Day 12: "Reverse-engineer servers with Blueprint" by Richard Crowley

Jordan needs more articles for this year, so if you have something to contribute, please propose it on the mailing list.

Wednesday, December 07, 2011

Crowd Mood - an indicator of health for products/projects

I thought I just coined a new term -- Crowd Mood -- but a quick Google search revealed a 2009 paper on "Crowd Behavior at Mass Gatherings: A Literature Review" (PDF) which says:

In the mass-gathering literature, the use of terms “crowd behavior”, “crowd type”, “crowd management”, and “crowd mood” are used in variable contexts. More practically, the term “crowd mood” has become an accepted measure of probable crowd behavior outcomes. This is particularly true in the context of crowds during protests/riots, where attempts have been made to identify factors that lead to a change of mood that may underpin more violent behavior.

Instead of protests and riots, the crowd behavior I'm referring to is the reaction of users of software products or projects. I think the overall Crowd Mood of these users is a good indicator of the health of those products/projects. I may state the obvious here, and maybe it's been done already, but I'm not aware of large-scale studies that try to correlate the success or failure of a particular software project with the mood of its users, as expressed in messages to mailing lists, in blog posts, and of course Twitter.

I'm aware of Sentiment Analysis and I know there are companies who offer this service by mining Twitter. But a more rigorous study would include other data sources. I have in mind something similar to this study by Kaleev Leetaru: "Culturonomics 2.0: Forecasting large-scale human behavior using global news media tone in time and space". This study mined information primarily from the archives of the "Summary of World Broadcasts (SWB)" global news monitoring service. It analyzed the tone or mood of the news regarding a particular event/person/place, and it established correlations between the sentiment it mined (for example negative news regarding Egypt's then-president Mubarak) and events that happen shortly afterwards, such as the Arab Spring of 2011. The study actually talks not only about correlation, but also about forecasting events based on current news tone.

I believe similar studies mining the Crowd Mood would be beneficial to any large-scale software product or project. For example, Canonical would be well-advised to conduct such a study in order to determine whether their decision to drop Gnome in favor of Unity was good or not (my feeling? it was BAD! -- and I think the Crowd Mood surrounding this decision would confirm it).

Another example: Python 3 and the decision not to continue releasing Python 2 past 2.8. Good or bad? I say BAD! (see also Armin Ronacher's recent blog post on the subject, which brilliantly expresses the issues around this decision).

Yet one more example: the recent changes in the Google UI, especially GMail. BAD!

These examples have a common theme in my opinion: the unilateral decision by a company or project to make non-backwards-compatible changes without really consulting its users. I know Apple can pull this off, but they're the exception, not the rule. The attitude of "trust us, we know what's good for you" leads to failure in the long run.

Somebody should build a product (or even better, an Open Source project) around Crowd Mood analysis. Maybe based on Google's Prediction API, which has functionality for sentiment analysis.

Thursday, November 17, 2011

Seven years of blogging in less than 500 words


In a couple of days, this blog will be 7 years old. Hard to believe so much time has passed since my first Hello World post.

I keep reading that blogging is on the wane, and it's true, mainly because of the popularity of Twitter. But I strongly believe that blogging is still important, and that more people should do it. For me, it's a way to give back to the community. I can't even remember how many times I found solutions to my technical problems by reading a blog post. I personally try to post something on my blog every single time I solve an issue that I've struggled with. If you post documentation to a company wiki (assuming it's not confidential), I urge you to try to also blog publicly about it – think of it as a public documentation that can help both you and others.

Blogging is also a great way to further your career. Back in September 2008 I blogged about my experiences with EC2. I had launched an m1.small instance and I had started to play with it. Lo and behold, I got a job offer based on that post. I accepted the offer, and that job allowed me to greatly enhance my experience with cloud computing. This brings me to a more general point: if you want to work on something you know nothing about, start small on your own, persevere, and yes, blog about it! In my experience, you'll invariably find yourself in a position to be paid to work on it.

It's also pretty interesting for me to look at my blog posts and to recognize in them the evolution of my own career. I started to blog when I was working as a tester. I was passionate about automated testing, Python and agile processes and tools. This led to digging more into automation tools, then into automated deployments and configuration management in the context of system architecture. This led into cloud computing, and these days this is leading into Big Data analytics. Who knows what's next? Whatever it is, I'm sure it will be exciting, because I make an effort to make it so!

I think it's also important to recognize that learning a tool or a technique is necessary but not sufficient: at the end of the day it's what you do with it that matters. For me the progression has been testing->automation->cloud computing->data analytics->??? (I actually started my career as a programmer, but here I include only the period coinciding with my blogging years.)

This may be self-help-kind-of-corny, but a quote which is attributed (maybe wrongly) to Goethe really applies to everybody:

“Whatever you can do, or dream you can do, begin it. Boldness has genius, power, and magic in it. Begin it now."

Wednesday, November 09, 2011

Troubleshooting memory allocation errors in Elastic MapReduce

Yesterday we ran into an issue with some Hive scripts running within an Amazon Elastic MapReduce cluster. Here's the error we got:


Caused by: java.io.IOException: Spill failed
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:877)
 at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:474)
 at org.apache.hadoop.hive.ql.exec.ReduceSinkOperator.processOp(ReduceSinkOperator.java:289)
 ... 11 more
Caused by: java.io.IOException: Cannot run program "bash": java.io.IOException: error=12, Cannot allocate memory
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:460)
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:176)
 at org.apache.hadoop.util.Shell.run(Shell.java:161)
 at org.apache.hadoop.fs.DF.getAvailable(DF.java:73)
 at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:329)
 at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
 at org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1238)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:703)
 at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1190)
Caused by: java.io.IOException: java.io.IOException: error=12, Cannot allocate memory
 at java.lang.UNIXProcess.(UNIXProcess.java:148)
 at java.lang.ProcessImpl.start(ProcessImpl.java:65)
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:453)



Googling around for java.io.IOException: java.io.IOException: error=12, Cannot allocate memory, it seems it's a common problem. See this AWS Developer Forums thread, this Hadoop core-user mailing list thread, and this explanation by Ken Krugler from Bixo Labs.

Basically, it boils down to the fact that when Java tries to fork a new process (in this case a bash shell), Linux will try to allocate as much memory as the current Java process, even though not all that memory will be required. There are several workarounds (read in particular the AWS Forum thread), but a solution that worked for us was to simply add swap space to the Elastic MapReduce slave nodes.

You can ssh into a slave node from the EMR master node by using the same private key you used when launching the EMR cluster, and by targeting the internal IP address of the slave node. In our case, the slaves are m1.xlarge instances, and they have 4 local disks (/dev/sdb through /dev/sde) mounted as /mnt, /mnt1, /mnt2 and /mnt3, with 414 GB available on each file system. I ran this simple script via sudo on each slave to add 4 swap files of 1 GB each, one on each of the 4 local disks.


$ cat make_swap.sh 
#!/bin/bash


SWAPFILES='
/mnt/swapfile1
/mnt1/swapfile1
/mnt2/swapfile1
/mnt3/swapfile1
'
for SWAPFILE in $SWAPFILES; do
dd if=/dev/zero of=$SWAPFILE bs=1024 count=1048576
mkswap $SWAPFILE
swapon $SWAPFILE
echo "$SWAPFILE swap swap defaults 0 0" >> /etc/fstab
done

This solved our issue. No more failed Map tasks, no more failed Reduce tasks. Maybe this will be of use to some other frantic admins out there (like I was yesterday) who are not sure how to troubleshoot the intimidating Hadoop errors they're facing.



Friday, November 04, 2011

Experiences with Amazon Elastic MapReduce


We started to use AWS Elastic MapReduce (EMR) in earnest a short time ago, with the help of Bradford Stephens from Drawn to Scale. We needed somebody to jumpstart our data analytics processes and workflow, and Bradford's help was invaluable. At some point we'll probably build our own Hadoop cluster either in EC2 or in-house, but for now EMR is doing the job just fine.

We started with an EMR cluster containing the master + 5 slave nodes, all m1.xlarge. We still have that cluster up and running, but in the mean time I've also experimented with launching clusters, running our data analytics processes on them, then shutting them down -- which is the 'elastic' type of workflow that takes full advantage of the pay-per-hour model of EMR.

Before I go into the details of launching and managing EMR clusters, here's the general workflow that we follow for our data analytics processes on a nightly basis:

  1. We gather data from various sources such as production databases, ad impression reports, mail server logs, and other 3rd party sources. All this data is in CSV format, mostly comma-separated or tab-separated. Each CSV file is timestamped with YYYY-MM-DD and corresponds to a 'table' or 'entity' that we want to analyse later.
  2. We gzip all CSV files and upload them to various S3 buckets.
  3. On the EMR Hadoop master node, we copy the csv.gz files we need from S3 into HDFS.
  4. We create Hive tables, one table for each type of 'entity'.
  5. We run Hive queries against these tables and save the results in HDFS.
  6. We export the results of the Hive queries to MySQL so we can further analyse them when we need to, and so we can visualize them in a dashboard.
I think this is a fairly common workflow for doing data analytics with Hadoop. It can definitely be optimized. Currently we create the Hive tables from scratch in step 4. Ideally we'll want to save them to S3, and only append new data to them, but the append operation seems to only exist in Hive 0.8 which is not yet available in EMR. But as suboptimal as it is, even if it takes a few hours each night, this process allows us to run queries that were simply impossible to execute outside of Hadoop.

Here are some experiments I've done with using EMR in its truly 'elastic' mode.

Installing the EMR Ruby CLI

Eventually we'll use something like boto's EMR bindings to manage our EMR clusters, but for quick experimentation I preferred a purely command-line tool, and the Ruby-based elastic-mapreduce tool seemed to be the only one available. To install, download the zip file from here, then unzip it somewhere on an EC2 instance where you can store your AWS credentials (a management-type instance usually). I installed in /opt/emr on one of our EC2 instances. At this point it's also a good idea to become familiar with the EMR Developer Guide, which has examples of various elastic-mapreduce use cases. I also found a good README on GitHub.

Next, create a credentials.json file containing some information about your AWS credentials and the keypair that will be used when launching the EMR cluster. The format of this JSON file is:

{
  "access-id": "YOUR_AWS_ACCESS_ID",
  "private-key": "YOUR_AWS_SECRET_KEY",
  "key-pair": "YOUR_EC2_KEYPAIR_NAME",
  "key-pair-file": "PATH_TO_PRIVATE_SSH_KEY_CORRESPONDING_TO_KEYPAIR",
  "region": "us-east-1",
  "log-uri": "s3://somebucket.yourcompany.com/logs"
}

Launching an EMR cluster

# ./elastic-mapreduce -c /opt/emr/credentials.json --create --name "test1" --alive --num-instances 3 --master-instance-type m1.small --slave-instance-type m1.small --hadoop-version 0.20 --hive-interactive --hive-versions 0.7.1

The command line above launches an EMR cluster called test1 with 3  instances (1 Hadoop master and 2 slave nodes), installs Hadoop 0.20 and Hive 0.71 on it, then keeps the cluster up and running (because we specified --alive). For experimentation purposes I recommend using m1.small instances.

This command returns a jobflow ID, which you'll need for all other commands that reference this specific cluster.

Getting information about a specific jobflow

# ./elastic-mapreduce --describe --jobflow JOBFLOWID

This command returns a JSON document containing a wealth of information about the EMR cluster. Here's an example output.

Listing the state of a jobflow

# ./elastic-mapreduce --list --jobflow JOBFLOWID
JOBFLOWID     WAITING        ec2-A-B-C-D.compute-1.amazonaws.com         test1
   COMPLETED      Setup Hive                    

This command is useful when you're trying to ascertain whether the initial configuration of the cluster is done. The state of the cluster immediately after you launch it will change from STARTING to RUNNING (during which step Hadoop and Hive are installed) and finally to WAITING.

Enabling and disabling termination protection for a jobflow

If you want to make sure that the jobflow won't be terminated, you can turn the termination protection on (it's off by default):

# ./elastic-mapreduce --set-termination-protection true --jobflow JOBFLOWID

To disable the termination protection, set it to false.

Adding core nodes to the EMR cluster

There are 2 types of Hadoop slave nodes: core (which contribute to the HDFS cluster) and task (which run Hadoop tasks but are not part of the HDFS cluster). If you want to add core nodes, you can use this command:

# ./elastic-mapreduce --modify-instance-group CORE --instance-count NEW_COUNT

where NEW_COUNT is the new overall count of core nodes that you are targetting (so for example if you had 5 core nodes and you wanted to add 2 more, the NEW_COUNT will be 7).

Note that you can only request an increased NEW_COUNT for core nodes, never a decreased count.

Also note that if one or more slave nodes are misbehaving (more on that in a bit), it's better to terminate them (via ElasticFox or the AWS console for example) than to add new core nodes. When you terminate them, the EMR jobflow will automatically launch extra nodes so that the core node count is kept the same.

Accessing the Hadoop master node and working with the HDFS cluster

You can ssh into the Hadoop master by using the private key you specified in credentials.json. 

# ssh -i /path/to/private_ssh_jey hadoop@public_ip_or_dns_of_master

Or you can use:

# ./elastic-mapreduce --jobflow JOBFLOWID --ssh

Once you're logged in as user hadoop on the master node, you can run various HDFS commands such as:
  • creating an HDFS directory: $ hadoop fs -mkdir /user/hadoop/mydir
  • copying files from S3 into HDFS: $ hadoop fs -cp s3://somebucket.yourcompany.com/dir1/data*gz /user/hadoop/mydir
  • listing files in an HDFS directory: $ hadoop fs -ls /user/hadoop/mydir
  • deleting files in an HDFS directory: $ hadoop fs -rm /user/hadoop/*.gz
  • copying files from HDFS to a local file system on the master node: $ hadoop fs -copyToLocal /user/hadoop/mydir/*.gz /home/hadoop/somedirTh
There is also a very useful admin-type command which allows you to see the state of the slave nodes, and the HDFS file system usage. Here's an example which I ran on a cluster with 3 slave nodes (I added the 3nd node at a later time, which is why its DFS usage is less than the usage on the first 2 nodes):

$ hadoop dfsadmin -report
Configured Capacity: 5319863697408 (4.84 TB)
Present Capacity: 5052263763968 (4.6 TB)
DFS Remaining: 4952298123264 (4.5 TB)
DFS Used: 99965640704 (93.1 GB)
DFS Used%: 1.98%
Under replicated blocks: 3
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 3 (3 total, 0 dead)

Name: 10.68.125.161:9200
Decommission Status : Normal
Configured Capacity: 1773287899136 (1.61 TB)
DFS Used: 47647641600 (44.38 GB)
Non DFS Used: 89327800320 (83.19 GB)
DFS Remaining: 1636312457216(1.49 TB)
DFS Used%: 2.69%
DFS Remaining%: 92.28%
Last contact: Fri Nov 04 20:31:31 UTC 2011


Name: 10.191.121.76:9200
Decommission Status : Normal
Configured Capacity: 1773287899136 (1.61 TB)
DFS Used: 49191796736 (45.81 GB)
Non DFS Used: 89329020928 (83.19 GB)
DFS Remaining: 1634767081472(1.49 TB)
DFS Used%: 2.77%
DFS Remaining%: 92.19%
Last contact: Fri Nov 04 20:31:29 UTC 2011


Name: 10.68.105.36:9200
Decommission Status : Normal
Configured Capacity: 1773287899136 (1.61 TB)
DFS Used: 3126202368 (2.91 GB)
Non DFS Used: 88943112192 (82.83 GB)
DFS Remaining: 1681218584576(1.53 TB)
DFS Used%: 0.18%
DFS Remaining%: 94.81%
Last contact: Fri Nov 04 20:31:30 UTC 2011

When there's something wrong with any slave data node, e.g. it can't be contacted by the master, then the number of dead nodes will be non-zero, and the 'Last contact' date for that slave will be off compared to the healthy nodes.

Another useful admin command is 

$ hadoop fsck /

which should report HEALTHY:


Status: HEALTHY
 Total size: 99648130222 B
 Total dirs: 76
 Total files: 5044 (Files currently being written: 1)
 Total blocks (validated): 5215 (avg. block size 19107982 B) (Total open file blocks (not validated): 1)
 Minimally replicated blocks: 5215 (100.0 %)
 Over-replicated blocks: 0 (0.0 %)
 Under-replicated blocks: 3 (0.057526365 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor: 1
 Average block replication: 1.0011505
 Corrupt blocks: 0
 Missing replicas: 21 (0.4022218 %)
 Number of data-nodes: 3
 Number of racks: 1




The filesystem under path '/' is HEALTHY

Terminating a jobflow

As you would expect, the command is:

# ./elastic-mapreduce --terminate --jobflow JOBFLOWID

Inspecting Hive logs

After running a Hive query on the Hadoop master node, you can inspect the logs created in this directory (this is for Hive 0.71): /mnt/var/lib/hive_07_1/tmp/history

Putting it all together

I put many of these pieces together in one script (see gist here) that I run every night from the EC2 management instance, and that does the following:
  1. Launches EMR cluster with desired instance types for the master node and the slave nodes, and with the desired instance count (this count includes the master node); the command line also specified that Hadoop 0.20 and Hive 0.71 need to be installed, and it keeps the cluster up and running via the --alive option.
  2. Waits in a loop for the state of the jobflow to be WAITING and sleeps 10 seconds in between checks.
  3. Retrieves the public DNS name of the Hadoop master from the JSON description of the new jobflow.
  4. Copies from the management node to the Hadoop master a local directory containing scripts that will be run on the master.
  5. Runs a script (called run_hive_scripts.sh) on the Hadoop master via ssh; this script does all the further Hadoop and Hive processing, which includes
    • creation of HDFS directories
    • copying of csv.gz files from S3 into HDFS
    • creation of Hive files
    • running of Hive queries
    • saving of Hive queries to a local directory on the Hadoop master
  6. Retrieves the Hive output files by scp-ing them back from the Hadoop master
  7. Terminates the EMR cluster
At this point, the Hive output files are processed and the data is inserted into a MySQL instance for further analysis and visualization.

That's about it for now. We have a lot more work to do before we declare ourselves satisfied with the state of our data analytics platform, and I'll blog more about it as soon as we cross more things off our todo list.

Friday, October 28, 2011

More fun with the LSI MegaRAID controllers

As I mentioned in a previous post, we've had some issues with the LSI MegaRAID controllers on our Dell C2100 database servers. Previously we noticed periodical slow-downs of the databases related to decreased I/O throughput. It turned out it was the LSI RAID battery going through its relearning cycle.

Last night we got paged again by increased load on one of the Dell C2100s. The load average went up to 25, when typically it's between 1 and 2. It turns out one of the drives in the RAID10 array managed by the LSI controller was going bad. You would think the RAID array would be OK even with a bad drive, but the drive didn't go completely offline, so the controller was busy servicing it and failing. This had the effect of decreasing the I/O throughput on the server, and making our database slow.

For my own reference, and hopefully for others out there, here's what we did to troubleshoot the issue. We used the MegaCli utilities (see this post for how to install them).

Check RAID event log for errors

# ./MegaCli64 -AdpEventLog -GetSinceReboot -f events.log -aALL

This will save all RAID-related events since the last reboot to events.log. In our case, we noticed these lines:

===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
Error: 3

seqNum: 0x00001654
Time: Fri Oct 28 04:45:26 2011

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 0b(e0x0c/s11) Path
5000c50034731385, CDB: 2a 00 0b a1 4c 00 00 00 18 00, Sense: 6/29/02
Event Data:
===========
Device ID: 11
Enclosure Index: 12
Slot Number: 11
CDB Length: 10
CDB Data:
002a 0000 000b 00a1 004c 0000 0000 0000 0018 0000 0000 0000 0000 0000
0000 0000 Sense Length: 18
Sense Data:
0070 0000 0006 0000 0000 0000 0000 000a 0000 0000 0000 0000 0029 0002
0002 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00001655

Time: Fri Oct 28 04:45:26 2011

Code: 0x00000060
Class: 1
Locale: 0x02
Event Description: Predictive failure: PD 0b(e0x0c/s11)
Event Data:



Check state of particular physical drive

You need the enclosure index (12 in the output above) and the device ID (11 in the output above). Then you run:

./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

In our case, we saw that Media Error Count was non-zero.

Mark physical drive offline

We decided to mark the faulty drive offline, to see if that takes work off of the RAID controller and improves I/O throughput in the system.

# ./MegaCli64 -PDOffline  -PhysDrv [12:11] aALL

Indeed, we noticed how the I/O wait time and the load average on the system dropped to normal levels.

Hot swap faulty drive

We had a spare drive which we hot-swapped with the faulty drive. The new drive started to rebuild. You can see its status if you run:

# ./MegaCli64 -pdInfo -PhysDrv [12:11] aALL

Enclosure Device ID: 12
Slot Number: 11
Device Id: 13
Sequence Number: 3
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 931.512 GB [0x74706db0 Sectors]
Non Coerced Size: 931.012 GB [0x74606db0 Sectors]
Coerced Size: 931.0 GB [0x74600000 Sectors]
Firmware state: Rebuild
SAS Address(0): 0x5000c500347da135
SAS Address(1): 0x0
Connected Port Number: 0(path0) 
Inquiry Data: SEAGATE ST31000424SS    KS689WK4Z08J            
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Foreign State: None 
Device Speed: Unknown 
Link Speed: Unknown 
Media Type: Hard Disk Device


Exit Code: 0x00

Check progress of the rebuilding process

# ./MegaCli64 -PDRbld -ShowProg -PhysDrv [12:11] -aALL
                                     
Rebuild Progress on Device at Enclosure 12, Slot 11 Completed 47% in 88 Minutes.

Exit Code: 0x00

Check event log again

Once the rebuilding is done, you can check the event log by specifying for example that you want to see the last 10 events:

# ./MegaCli64 -AdpEventLog -GetLatest 10 -f t1.log -aALL

In our case we saw something like this:


# cat t1.log 



seqNum: 0x00001720
Time: Fri Oct 28 12:48:27 2011

Code: 0x000000f9
Class: 0
Locale: 0x01
Event Description: VD 00/0 is now OPTIMAL
Event Data:
===========
Target Id: 0


seqNum: 0x0000171f
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000051
Class: 0
Locale: 0x01
Event Description: State change on VD 00/0 from DEGRADED(2) to OPTIMAL(3)
Event Data:
===========
Target Id: 0
Previous state: 2
New state: 3


seqNum: 0x0000171e
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from REBUILD(14) to ONLINE(18)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 20
New state: 24


seqNum: 0x0000171d
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000064
Class: 0
Locale: 0x02
Event Description: Rebuild complete on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11


seqNum: 0x0000171c
Time: Fri Oct 28 12:48:27 2011

Code: 0x00000063
Class: 0
Locale: 0x02
Event Description: Rebuild complete on VD 00/0
Event Data:
===========
Target Id: 0


seqNum: 0x000016b7
Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from OFFLINE(10) to REBUILD(14)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 16
New state: 20


seqNum: 0x000016b6
Time: Fri Oct 28 08:55:42 2011

Code: 0x0000006a
Class: 0
Locale: 0x02
Event Description: Rebuild automatically started on PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11


seqNum: 0x000016b5
Time: Fri Oct 28 08:55:42 2011

Code: 0x00000072
Class: 0
Locale: 0x02
Event Description: State change on PD 0d(e0x0c/s11) from UNCONFIGURED_GOOD(0) to OFFLINE(10)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11
Previous state: 0
New state: 16


seqNum: 0x000016b4
Time: Fri Oct 28 08:55:42 2011

Code: 0x000000f7
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11) Info: enclPd=0c, scsiType=0, portMap=00, sasAddr=5000c500347da135,0000000000000000
Event Data:
===========
Device ID: 13
Enclosure Device ID: 12
Enclosure Index: 1
Slot Number: 11
SAS Address 1: 5000c500347da135
SAS Address 2: 0


seqNum: 0x000016b3
Time: Fri Oct 28 08:55:42 2011

Code: 0x0000005b
Class: 0
Locale: 0x02
Event Description: Inserted: PD 0d(e0x0c/s11)
Event Data:
===========
Device ID: 13
Enclosure Index: 12
Slot Number: 11

Here are some other useful troubleshooting commands:

Check firmware state across all physical drives


This is a useful command if you want to see at a glance the state of all physical drives attached to the RAID controller:

# ./MegaCli64 -PDList -a0 | grep Firmware

In the normal case, all drives should be ONLINE. If any drive reports as FAILED you have a problem.

Check virtual drive state

You can also check the virtual drive state:

# ./MegaCli64 -LDInfo -L0 -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (Target Id: 0)
Name:
RAID Level: Primary-1, Secondary-0, RAID Level Qualifier-0
Size:5.455 TB
State: Degraded
Stripe Size: 64 KB
Number Of Drives per span:2
Span Depth:6
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Encryption Type: None


Exit Code: 0x00

Here you can see that the state is degraded. Running the previous PDlist command shows that one drive is failed:

# ./MegaCli64 -PDList -aALL | grep Firmware
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Failed
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online
Firmware state: Online


Our next step is to write a custom Nagios plugin to check for events that are out of the ordinary. A good indication of an error state is the transition from 'Previous state: 0' to 'New state: N' where N > 0, e.g.:


Previous state: 0

New state: 16

Thanks to my colleague Marco Garcia for digging deep into the MegaCli documentation and finding some of these obscure commands.

Resources