Agile Testing: January 2012

Tuesday, January 24, 2012

An ode to running a database on bare metal

No, my muse is not quite as strong as to inspire me to write an ode, but I still want to emphasize a few points about the goodness of running a database on bare metal.

At Evite, we use sharded MySQL for our production database. We designed the current architecture in 2009, when NoSQL was still very much in its infancy, so MySQL seemed a solid choice, a technology that we could at least understand. As I explained elsewhere, we do use MySQL in an almost non-relational way, and we sharded from the get-go, with the idea that it's better to scale horizontally than vertically.

We initially launched with the database hosted at a data center on a few Dell PE2970 servers, each with 16 GB of RAM and 2 quad-core CPUs. Each server was running 2 MySQL instances. We didn't get a chance to dark launch, but the initial load testing we did showed that we should be OK. However, there is nothing like production traffic to really stress test your infrastructure, and we soon realized that we have an insufficient number of servers for the peak traffic we were expecting towards the end of the year.

We decided to scale horizontally in EC2, with one MySQL instance per m1.xlarge EC2 instance. At the time we also engaged Percona and they helped us fine-tune our Percona XtraDB MySQL configuration so we could get the most out of the m1.xlarge horsepower. We managed to scale sufficiently enough for our high season in 2010, although we had plenty of pain points. We chose to use EBS volumes for our database files, because at the time EBS still gave people the illusion of stability and durability. We were very soon confronted with severe performance issues, manifested as very high CPU I/O wait times, which were sometimes so high as to make the instance useless.

I described in a previous post how proficient we became at failing over from a master that went AWOL to a slave. Our issues with EBS volumes were compounded by the fact that our database access pattern is very write-intensive, and a shared medium such as EBS was far from ideal. Our devops team was constantly on the alert, and it seemed like we were always rebuilding instances and recovering from EC2 instance failures, although the end-user experience was not affected.

Long story short, we decided to bring the database back in-house, at the data center, on 'real' bare-metal servers. No virtualization, thanks. The whole process went relatively smoothly. One important point I want to make here is that we already had a year's worth of hard numbers at that point regarding the access patterns to our database, iops/sec, MySQL query types, etc, etc. So it made it easy to do proper capacity planning this time, in the presence of production traffic.

We started by buying 2 Dell C2100 servers, monster machines, with dual Intel Xeon X5650 processors (for a total of 24 cores), 144 GB RAM, and 12 x 1 TB hard disks out of which we prepared a 6 TB RAID 10 volume which we further divided in LVM logical volumes for specific types of MySQL files.

We put 2 MySQL instances on each server, and we engaged Percona again to help us fine-tune the configuration, this time including not only MySQL, but also the hardware and the OS. They were super helpful to us, as usual. Here are only some of the things they recommended, which we implemented:

set vm.swappiness kernel setting to 0 in /etc/sysctl.conf
set InnoDB flush method to O_DIRECT because we can rely on the RAID controller to do the caching (we also mounted XFS with the nobarrier option in conjunction with this change)
disable MySQL query cache, which uses a global mutex that can cause performance issues when used on a multi-core server
various other optimizations which were dependent on our setup, things like tweaking MySQL configuration options such as key_buffer_size and innodb_io_capacity

One important MySQL configuration option that we had to tweak was innodb_buffer_pool_size. If we set it too high, the server could start swapping. If we set it too low, the disk I/O on the server could become too problematic. Since we had 144 GB of RAM and we were running 2 MySQL instances per server, we decided to give each instance 60 GB of RAM. This proved to strike a good balance.

Once the fine-tuning was done, we directed production traffic away from 4 EC2 m1.xlarge instances to 2 x 2 MySQL instances, with each pair running on a C2100. We then sat back and wallowed for a while in the goodness of the I/O numbers we were observing. Basically, the servers were barely working. This is how life should be.

We soon migrated all of our MySQL masters back into the data center. We left the slaves running in EC2 (still one m1.xlarge slave per MySQL master instance), but we changed them from being EBS-backed to using the local ephemeral disk in RAID 0 with LVM. We look at EC2 in this case as a secondary data center, used only in emergency situations.

One thing that bit us in our bare-metal setup was....a bare-metal issue around the LSI MegaRAID controllers. I already blogged about the problems we had with the battery relearning cycle, and with decreased performance in the presence of bad drives. But these things were easy to fix (again thanks to our friends at Percona for diagnosing these issues correctly in the first place...)

I am happy to report that we went through our high season for 2011 without a glitch in this setup. Our devops team slept much better at night too! One nice thing about having EC2 as a 'secondary data center' is that if need be, we can scale out horizontally by launching more EC2 instances. In fact, we doubled the number of MySQL slave instances for the duration of our high season, with the thought that if we need to, we can double the number of shards at the application layer, and thus scale horizontally that way. We didn't have to do any tweaking fortunately, but we were able to -- a strategy which would otherwise be hard to pull off if we didn't have any cloud presence, unless we bought a lot of extra capacity at the data center.

This brings me to one of the points I want to make in this post: it is a very valuable strategy to be able to use the cloud to roll out a new architecture (which you designed from the get-go however to be horizontally scalable) and to gauge its performance in the presence of real production traffic. You will get less than optimal performance per instance (because of virtualization vs. real hardware) , but since you can scale horizontally, you should be able to sustain the desired level of traffic for your application. You will get hard numbers that will help you do capacity planning and you will be able to bring the database infrastructure back to real hardware if you so wish, like we did. Note that Zynga has a similar strategy -- they roll out new games in EC2 and once they get a handle on how much traffic a game has, they bring it back into the data center (although it looks like they still use a private cloud and not bare metal).

Another point I want to make is that the cloud is not ready yet for write-intensive transactional databases, mainly because of the very poor I/O performance that you get on virtual instances in the cloud (compounded by shared network storage such as EBS). Adrian Cockcroft will reply that Netflix is doing just fine and they're exclusively in EC2. I hope they are doing just fine, and I hope his devops team is getting some good sleep at night, but I'm not sure. I need to perhaps qualify my point and say that the cloud is not ready for traditional transactional databases such as MySQL and PostgreSQL, which require manual sharding to be horizontally scalable. If I had to look at redesigning our database architecture today, I'd definitely try out HBase, Riak and maybe Cassandra. The promise there at least is that adding a new node to the cluster in these technologies is much less painful than in the manual sharding and scaling scenario. This still doesn't guarantee that you won't end up paying for a lot of instances to compensate for poor individual I/O per instance. Maybe a cloud vendor like Joyent with their SmartMachines will make a difference in this area (in fact, it is on our TODO list to test out their Percona SmartMachine).

Note however that there's something to be said about using good ol' RDBMS technologies. Ryan Mack says this in a Facebook Engineering post:

"After a few discussions we decided to build on four of our core technologies: MySQL/InnoDB for storage and replication, Multifeed (the technology that powers News Feed) for ranking, Thrift for communications, and memcached for caching. We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits."

The emphasis on the last sentence is mine. It's the operational aspect of a new architecture that will kill you first. With a well understood architecture, at least you have a chance to tame it.

Yet another point I'd like to make is: do not base your disaster recovery strategy in EC2 around EBS volumes, especially if you have a write-intensive database. It's not worth the performance loss, and most of all it's not worth the severe and unpredictable fluctuation in performance. It works much better in our experience to turn the ephemeral disks of an m1.xlarge EC2 instance into a RAID 0 array and put LVM on top of that, and use it for storing the various MySQL file types. We are then able to do LVM snapshots of that volume, and upload the snapshots to S3. To build a new slave, we can restore the snapshot from S3, then catch up the replication with the master. Works fine.

There you have it. An ode in prose to running your database on bare metal. Try it, you may sleep better at night!

Thursday, January 05, 2012

Graphing, alerting and mission control with Graphite and Nagios

We’ve been using Graphite more and more for graphing of OS- and application-related metrics (here are some old-ish notes of mine on installing and configuring Graphite.) We measure and graph variables as diverse as:

relative and absolute state of charge of the LSI MegaRAID controller battery (why? because we’ve been burned by battery issues before)
database server I/O wait time (critical for EC2 instances which are notorious for their poor I/O performance; at this point we only run MySQL slaves in EC2, and we do not repeat DO NOT use EBS volumes for DB servers, instead we stripe the local disks into a RAID 0 array with LVM)
memcached stats such as current connections, get hits and misses, delete hits and misses, evictions
percentage and absolute number of HTTP return codes as served by nginx and haproxy
count of messages in various queues (our own and Amazon SQS)
count of outgoing mail messages

We do have a large Munin install base, so we found Adam Jacob’s munin-graphite.rb script very useful in sending all data captured by Munin to Graphite.

Why Graphite and not Munin or Ganglia? Mainly because it’s so easy to send arbitrarily named metrics to Graphite, but also because we can capture measurements at 1 second granularity (although this is possible with some tweaking with RRD-based tools as well).

On the Graphite server side, we set up different retention policies depending on the type of data we capture. For example, for app server logs (nginx and haproxy) we have the following retention policy specified in /opt/graphite/conf/storage-schemas.conf:

[appserver]
pattern = ^appserver\.
retentions = 60s:30d,10m:180d,60m:2y

This tells Graphite we want to keep data aggregated at 60 second intervals for 30 days, 10 minute data for 6 months and hourly data for 2 years.

The main mechanism we use for sending data to Graphite is tailing various log files at different intervals, parsing the entries in order to extract the metrics we’re interested in, and sending those metrics to Graphite by the tried-and-true method called ‘just open a socket’.

For example, we tail the nginx access log file via a log tracker script written in Python (and run as a daemon with supervisor), and we extract values such as the timestamp, the request URL, the HTTP return code, bytes sent and the request time in milliseconds. The default interval for collecting these values is 1 minute. For HTTP return codes, we group the codes such as 2xx, 3xx, 4xx, 5xx together, so we can report on each type of return code. We aggregate the values per collection interval, then send the counts to Graphite, named something like appserver.app01.500.reqs, which represents the HTTP 500 error count on server app01.

A more elegant way would be to use a tool such as logster to capture various log entries, but we haven’t had the time to write logster plugins for the 2 main services we’re interested in, nginx and haproxy. Our solution is deemed temporary, but as we all know, there’s nothing more permanent than a temporary solution.

For some more unusual metrics that we measure ourselves, such as LSI MegaRaid battery charge state, we run a shell script in an infinite loop and produce a value every second, then we send it to Graphite. To obtain the value we run something that resembles line noise:

$ MegaCli64 -AdpBbuCmd -GetBbuStatus -a0 | grep -i "Relative State of Charge" | awk '{print $5}'

(thanks to my colleague Marco Garcia for coming up with this)

Once the data is captured in Graphite, we can do several things with it:

visualize it using the Graphite dashboards
alert on it using custom Nagios plugins
capture it in our own more compact dashboards, so we can have multiple graphs displayed on one ‘mission control’ page

Nagios plugins

My colleague Marco Garcia wrote a Nagios plugin in Python to alert on HTTP 500 errors. To obtain the data from Graphite, he queries a special ‘rawData’ URL of this form:

http://graphite.yourdomain.com/render/?from=-5minutes&until=-1minutes&target=asPercent(sum(appserver.app01.500.*.reqs),sum(appserver.app01.*.*.reqs))&rawData

which returns something like

asPercent(sumSeries(appserver.app01.500.*.reqs),sumSeries(appserver.app01.*.*.reqs)),1325778360,1325778600,60|value1,value2,value3,value4

where the 4 values represent the 4 data points, one per minute, from 5 minutes ago to 1 minute ago. Each data point is the percentage of HTTP 500 errors calculated against the total number of HTTP requests.

The script then compares the values returned with a warning threshold (0.5) and a critical threshold (1.0). If all values are greater than the respective threshold, we issue a warning or a critical Nagios alert.

Mission control dashboard

My colleague Josh Frederick came up with the idea of presenting all relevant graphs based on data captured in Graphite in a single page which he dubbed ‘mission control’. This enables us to make correlations at a glance between things such as increased I/O wait on DB servers and spikes in HTTP 500 errors.

To generate a given graph, Josh uses some Javascript magic to come up with an URL such as:

http://graphite.yourdomain.com/render/?width=500&height=200&hideLegend=1&from=-60minutes&until=-0minutes&bgcolor=FFFFFF&fgcolor=000000&areaMode=stacked&title=500s%20as%20%&target=asPercent(sum(appserver.app*.500.*.reqs),%20sum(appserver.app*.*.*.reqs))&_ts=1325783046438

which queries Graphite for the percentage of HTTP 500 errors from all HTTP requests across all app servers for the last 60 minutes. The resulting graph looks like this:

Our mission control page currently has 29 such graphs.

We also have (courtesy of Josh again) a different set of charts based on the Google Visualization Python API. We get the data from Graphite in CSV format (by adding format=csv to the Graphite query URLs), then we display the data using the Google Visualization API.

If you don’t want to roll your own JS-based dashboard talking to Graphite, there’s a tool called statsdash that may be of help.

Agile Testing

Tuesday, January 24, 2012

An ode to running a database on bare metal

Thursday, January 05, 2012

Graphing, alerting and mission control with Graphite and Nagios

Modifying EC2 security groups via AWS Lambda functions

Followers