Tuesday, January 24, 2012

An ode to running a database on bare metal

No, my muse is not quite as strong as to inspire me to write an ode, but I still want to emphasize a few points about the goodness of running a database on bare metal.

At Evite, we use sharded MySQL for our production database. We designed the current architecture in 2009, when NoSQL was still very much in its infancy, so MySQL seemed a solid choice, a technology that we could at least understand. As I explained elsewhere, we do use MySQL in an almost non-relational way, and we sharded from the get-go, with the idea that it's better to scale horizontally than vertically.

We initially launched with the database hosted at a data center on a few Dell PE2970 servers, each with 16 GB of RAM and 2 quad-core CPUs. Each server was running 2 MySQL instances. We didn't get a chance to dark launch, but the initial load testing we did showed that we should be OK. However, there is nothing like production traffic to really stress test your infrastructure, and we soon realized that we have an insufficient number of servers for the peak traffic we were expecting towards the end of the year.

We decided to scale horizontally in EC2, with one MySQL instance per m1.xlarge EC2 instance. At the time we also engaged Percona and they helped us fine-tune our Percona XtraDB MySQL configuration so we could get the most out of the m1.xlarge horsepower. We managed to scale sufficiently enough for our high season in 2010, although we had plenty of pain points. We chose to use EBS volumes for our database files, because at the time EBS still gave people the illusion of stability and durability. We were very soon confronted with severe performance issues, manifested as very high CPU I/O wait times, which were sometimes so high as to make the instance useless.

I described in a previous post how proficient we became at failing over from a master that went AWOL to a slave. Our issues with EBS volumes were compounded by the fact that our database access pattern is very write-intensive, and a shared medium such as EBS was far from ideal. Our devops team was constantly on the alert, and it seemed like we were always rebuilding instances and recovering from EC2 instance failures, although the end-user experience was not affected.

Long story short, we decided to bring the database back in-house, at the data center, on 'real' bare-metal servers. No virtualization, thanks. The whole process went relatively smoothly. One important point I want to make here is that we already had a year's worth of hard numbers at that point regarding the access patterns to our database, iops/sec, MySQL query types, etc, etc. So it made it easy to do proper capacity planning this time, in the presence of production traffic.

We started by buying 2 Dell C2100 servers, monster machines, with dual Intel Xeon X5650 processors (for a total of 24 cores), 144 GB RAM, and 12 x 1 TB hard disks out of which we prepared a 6 TB RAID 10 volume which we further divided in LVM logical volumes for specific types of MySQL files.

We put 2 MySQL instances on each server, and we engaged Percona again to help us fine-tune the configuration, this time including not only MySQL, but also the hardware and the OS. They were super helpful to us, as usual. Here are only some of the things they recommended, which we implemented:
  • set vm.swappiness kernel setting to 0 in /etc/sysctl.conf
  • set InnoDB flush method to O_DIRECT because we can rely on the RAID controller to do the caching (we also mounted XFS with the nobarrier option in conjunction with this change)
  • disable MySQL query cache, which uses a global mutex that can cause performance issues when used on a multi-core server
  • various other optimizations which were dependent on our setup, things like tweaking MySQL configuration options such as key_buffer_size and innodb_io_capacity
One important MySQL configuration option that we had to tweak was innodb_buffer_pool_size. If we set it too high, the server could start swapping. If we set it too low, the disk I/O on the server could become too problematic. Since we had 144 GB of RAM and we were running 2 MySQL instances per server, we decided to give each instance 60 GB of RAM. This proved to strike a good balance.

Once the fine-tuning was done, we directed production traffic away from 4 EC2 m1.xlarge instances to 2 x 2 MySQL instances, with each pair running on a C2100. We then sat back and wallowed for a while in the goodness of the I/O numbers we were observing. Basically, the servers were barely working. This is how life should be. 

We soon migrated all of our MySQL masters back into the data center. We left the slaves running in EC2 (still one m1.xlarge slave per MySQL master instance), but we changed them from being EBS-backed to using the local ephemeral disk in RAID 0 with LVM. We look at EC2 in this case as a secondary data center, used only in emergency situations.

One thing that bit us in our bare-metal setup was....a bare-metal issue around the LSI MegaRAID controllers. I already blogged about the problems we had with the battery relearning cycle, and with decreased performance in the presence of bad drives. But these things were easy to fix (again thanks to our friends at Percona for diagnosing these issues correctly in the first place...)

I am happy to report that we went through our high season for 2011 without a glitch in this setup. Our devops team slept much better at night too! One nice thing about having EC2 as a 'secondary data center' is that if need be, we can scale out horizontally   by launching more EC2 instances. In fact, we doubled the number of MySQL slave instances for the duration of our high season, with the thought that if we need to, we can double the number of shards at the application layer, and thus scale horizontally that way. We didn't have to do any tweaking fortunately, but we were able to -- a strategy which would otherwise be hard to pull off if we didn't have any cloud presence, unless we bought a lot of extra capacity at the data center.

This brings me to one of the points I want to make in this post: it is a very valuable strategy to be able to use the cloud to roll out a new architecture (which you designed from the get-go however to be horizontally scalable) and to gauge its performance in the presence of real production traffic. You will get less than optimal performance per instance (because of virtualization vs. real hardware) , but since you can scale horizontally, you should be able to sustain the desired level of traffic for your application. You will get hard numbers that will help you do capacity planning and you will be able to bring the database infrastructure back to real hardware if you so wish, like we did. Note that Zynga has a similar strategy -- they roll out new games in EC2 and once they get a handle on how much traffic a game has, they bring it back into the data center (although it looks like they still use a private cloud and not bare metal).

Another point I want to make is that the cloud is not ready yet for write-intensive transactional databases, mainly because of the very poor I/O performance that you get on virtual instances in the cloud (compounded by shared network storage such as EBS). Adrian Cockcroft will reply that Netflix is doing just fine and they're exclusively in EC2. I hope they are doing just fine, and I hope his devops team is getting some good sleep at night, but I'm not sure. I need to perhaps qualify my point and say that the cloud is not ready for traditional transactional databases such as MySQL and PostgreSQL, which require manual sharding to be horizontally scalable. If I had to look at redesigning our database architecture today, I'd definitely try out HBase, Riak and maybe Cassandra. The promise there at least is that adding a new node to the cluster in these technologies is much less painful than in the manual sharding and scaling scenario. This still doesn't guarantee that you won't end up paying for a lot of instances to compensate for poor individual I/O per instance. Maybe a cloud vendor like Joyent with their SmartMachines will make a difference in this area (in fact, it is on our TODO list to test out their Percona SmartMachine).

Note however that there's something to be said about using good ol' RDBMS technologies. Ryan Mack says this in a Facebook Engineering post:

"After a few discussions we decided to build on four of our core technologies: MySQL/InnoDB for storage and replication, Multifeed (the technology that powers News Feed) for ranking, Thrift for communications, and memcached for caching. We chose well-understood technologies so we could better predict capacity needs and rely on our existing monitoring and operational tool kits."

The emphasis on the last sentence is mine. It's the operational aspect of a new architecture that will kill you first. With a well understood architecture, at least you have a chance to tame it.

Yet another point I'd like to make is: do not base your disaster recovery strategy in EC2 around EBS volumes, especially if you have a write-intensive database. It's not worth the performance loss, and most of all it's not worth the severe and unpredictable fluctuation in performance. It works much better in our experience to turn the ephemeral disks of an m1.xlarge EC2 instance into a RAID 0 array and put LVM on top of that, and use it for storing the various MySQL file types. We are then able to do LVM snapshots of that volume, and upload the snapshots to S3. To build a new slave, we can restore the snapshot from S3, then catch up the replication with the master. Works fine.

There you have it. An ode in prose to running your database on bare metal. Try it, you may sleep better at night!


Steve said...

Nice post: it's always good to find out what works and what doesn't—thanks a lot.

MichaƂ Kwiatkowski said...

Great post Grig, I especially liked this sentence: "We chose to use EBS volumes for our database files, because at the time EBS still gave people the illusion of stability and durability." :)

I've started playing with EC2 recently, so your posts finally start making sense to me ;) and the whole blog is super helpful. Thanks again for sharing your experience!

Grig Gheorghiu said...

Thanks Steve!

Grig Gheorghiu said...

Hey Michal - great to hear from you! Glad you found my post useful. Let me know if you need any help with the AWS stuff. You know how to find me ;-)

Justin Ellison said...

So, if you moved your MySQL instances back to bare metal in a datacenter, I'm guessing you moved your appserver instances to the same datacenter? If not, how in the world did you overcome the latency?

EJ said...

What is your ping time between your app and your data center vs aws? Did that pose a problem during the migration?

Raul Macias said...

Very interesting; thanks for sharing this

Grig Gheorghiu said...

Justin -- we've always had the app servers inside the data center. We moved the DB out into EC2 at some point to scale it horizontally, but we kept the app servers inside the DC. Latency has never been a problem.

Grig Gheorghiu said...

EJ -- latency between DC and EC2 has never been a problem for us.

Chris Smith said...

Hi Grig,

Great post and I really enjoyed the rest of your blog as well. I was wondering if you would be interested in having some of your recent posts featured on DZone.com. If so, please contact me for more information.



Sunil said...

Very good explanation, more informative it helped me a lot.

Anonymous said...

Great post. Full of facts and real outcomes. Very nice read.

Gheorghe Gheorghiu said...

Mi-a placat si mie mult "Muza" ca punct de pornire si uite ca sunt aprecieri frumoase de la toti. Bravo!

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...