Saturday, April 04, 2009

Experiences deploying a large-scale infrastructure in Amazon EC2

At OpenX we recently completed a large-scale deployment of one of our server farms to Amazon EC2. Here are some lessons learned from that experience.

Expect failures; what's more, embrace them

Things are bound to fail when you're dealing with large-scale deployments in any infrastructure setup, but especially when you're deploying virtual servers 'in the cloud', outside of your sphere of influence. You must then be prepared for things to fail. This is a Good Thing, because it forces you to think about failure scenarios upfront, and to design your system infrastructure in a way that minimizes single points of failure.

As an aside, I've been very impressed with the reliability of EC2. Like many other people, I didn't know what to expect, but I've been pleasantly surprised. Very rarely does an EC2 instance fail. In fact I haven't yet seen a total failure, only some instances that were marked as 'deteriorated'. When this happens, you usually get a heads-up via email, and you have a few days to migrate your instance, or launch a similar one and terminate the defective one.

Expecting things to fail at any time leads to and relies heavily on the next lesson learned, which is...

Fully automate your infrastructure deployments

There's simply no way around this. When you need to deal with tens and even hundreds of virtual instances, when you need to scale up and down on demand (after all, this is THE main promise of cloud computing!), then you need to fully automate your infrastructure deployment (servers, load balancers, storage, etc.)

The way we achieved this at OpenX was to write our own custom code on top of the EC2 API in order to launch and destroy AMIs and EBS volumes. We rolled our own AMI, which contains enough bootstrap code to make it 'call home' to a set of servers running slack. When we deploy a machine, we specify a list of slack 'roles' that the machine belongs to (for example 'web-server' or 'master-db-server' or 'slave-db-server'). When the machine boots up, it will run a script that belongs to that specific slack role. In this script we install everything the machine needs to do its job -- pre-requisite packages and the actual application with all its necessary configuration files.

I will blog separately about how exactly slack works for us, but let me just say that it is an extremely simple tool. It may seem overly simple, but that's exactly its strength, since it forces you to be creative with your postinstall scripts. I know that other people use puppet, or fabric, or cfengine. Whatever works for you, go ahead and use, just use SOME tool that helps with automated deployments.

The beauty of fully automating your deployments is that it truly allows you to scale infinitely (for some value of 'infinity' of course ;-). It almost goes without saying that your application infrastructure needs to be designed in such a way that allows this type of scaling. But having the building blocks necessary for automatically deploying any type of server that you need is invaluable.

Another thing we do which helps with automating various pieces of our infrastructure is that we keep information about our deployed instances in a database. This allows us to write tools that inspect the database and generate various configuration files (such as the all-important role configuration file used by slack), and other text files such as DNS zone files. This database becomes the one true source of information about our infrastructure. The DRY principle applies to system infrastructure, not only to software development.

Speaking of DNS, specifically in the context of Amazon EC2, it's worth rolling out your own internal DNS servers, with zones that aren't even registered publicly, but for which your internal DNS servers are authoritative. Then all communication within the EC2 cloud can happen via internal DNS names, as opposed to IP addresses. Trust me, your tired brain will thank you. This would be very hard to achieve though if you were to manually edit BIND zone files. Our approach is to automatically generate those files from the master database I mentioned. Works like a charm. Thanks to Jeff Roberts for coming up with this idea and implementing it.

While we're on the subject of fully automated deployments, I'd like to throw an idea out there that I first heard from Mike Todd, my boss at OpenX, who is an ex-Googler. One of his goals is for us never to have to ssh into any production server. We deploy the server using slack, the application gets installed automatically, monitoring agents get set up automatically, so there should really be no need to manually do stuff on the server itself. If you want to make a change, you make it in a slack role on the master slack server, and it gets pushed to production. If the server misbehaves or gets out of line with the other servers, you simply terminate that server instance and launch another one. Since you have everything automated, it's one command line for terminating the instance, and another one for deploying a brand new replacement. It's really beautiful.

Design your infrastructure so that it scales horizontally

There are generally two ways to scale an infrastructure: vertically, by deploying your application on more powerful servers, and horizontally, by increasing the number of servers that support your application. For 'infinite' scaling in a cloud computing environment, you need to design your system infrastructure so that it scales horizontally. Otherwise you're bound to hit limits of individual servers that you will find very hard to get past. Horizontal scaling also eliminates single points of failure.

Here are a few ideas for deploying a Web site with a database back-end so that it uses multiple tiers, with each tier being able to scale horizontally:

1) Deploy multiple Web servers behind one or more load balancers. This is pretty standard these days, and this tier is the easiest to scale. However, you also want to maximize the work done by each Web server, so you need to find the sweet spot of that particular type of server in terms of httpd processes it can handle. Too few processes and you're wasting CPU/RAM on the server, too many and you're overloading the server. You also need to be cognizant of the fact that each EC2 instance costs you money. It can become so easy to launch a new instance that you don't necessarily think of getting the most out of the existing instances. Don't go wild unless absolutely necessary if you don't want to have a sticker shock when you get the bill from Amazon at the end of the month.

2) Deploy multiple load balancers. Amazon doesn't yet offer load balancers, so what we've been doing is using HAProxy-based load balancers. Let's say you have an HAProxy instance that handles traffic for www.yourdomain.com. If your Web site becomes wildly successful, it is imaginable that a single HAProxy instance will not be able to handle all the incoming network traffic. One easy solution for this, which is also useful for eliminating single points of failure, is to use round-robin DNS, pointing www.yourdomain.com to several IP addresses, with each IP address handled by a separate HAProxy instance. All HAProxy instances can be identical in terms of back-end configuration, so your Web server farm will get 1/N of the overall traffic from each of your N load balancers. It worked really well for us, and the traffic was spread out very uniformly among the HAProxies. You do need to make sure the TTL on the DNS record for www.yourdomain.com is low.

3) Deploy several database servers. If you're using MySQL, you can set up a master DB server for writes, and multiple slave DB servers for reads. The slave DBs can sit behind an HAProxy load balancer. In this scenario, you're limited by the capacity of the single master DB server. One thing you can do is to use sharding techniques, meaning you can partition the database into multiple instances that each handle writes for a subset of your application domain. Another thing you can do is to write to local databases deployed on the Web servers, either in memory or on disk, and then periodically write to the master DB server (of course, this assumes that you don't need that data right away; this technique is useful when you have to generate statistics or reports periodically for example).

4) Another way of dealing with databases is to not use them, or at least to avoid the overhead of making a database call each time you need something from the database. A common technique for this is to use memcache. Your application needs to be aware of memcache, but this is easy to implement in all of the popular programming languages. Once implemented, you can have your Web servers first check a value in memcache, and only if it's not there have them hit the database. The more memory you give to the memcached process, the better off you are.

Establish clear measurable goals

The most common reason for scaling an Internet infrastructure is to handle increased Web traffic. However, you need to keep in mind the quality of the user experience, which means that you need to keep the response time of the pages your serve under a certain limit which will hopefully meet and surpass the user's expectations. I found it extremely useful to have a very simple script that measures the response time of certain pages and that graphs it inside a dashboard-type page (thanks to Mike Todd for the idea and the implementation). As we deployed more and more servers in order to keep up with the demands of increased traffic, we always kept an eye on our goal: keep reponse time/latency under N milliseconds (N will vary depending on your application). When we would see spikes in the latency chart, we knew we need to act at some level of our infrastructure. And this brings me to the next point...

Be prepared to quickly identify and eliminate bottlenecks

As I already mentioned in the design section above, any large-scale Internet infrastructure will have different types of servers: web servers, application servers, database servers, memcache servers, and the list goes on. As you scale the servers at each tier/level, you need to be prepared to quickly identify bottlenecks. Examples:

1) Keep track of how many httpd processes are running on your Web servers; this depends on the values you set for MaxClients and ServerLimit in your Apache configuration files. If you're using an HAProxy-based load balancer, this also depends on the connection throttling that you might be doing at the backend server level. In any case, the more httpd processes are running on a given server, the more CPU and RAM they will use up. At some point, the server will run out of resources. At that point, you either need to scale the server up (by deploying to a larger EC2 instance, for example an m1.large with more RAM, or a c1.medium with more CPU), or you need to scale your Web server farm horizontally by adding more Web servers, so the load on each server decreases.

2) Keep track of the load on your database servers, and also of slow queries. A great tool for MySQL database servers is innotop, which allows you to see the slowest queries at a glance. Sometimes all it takes is a slow query to throw a spike into your latency chart (can you tell I've been there, done that?). Also keep track of the number of connections into your database servers. If you use MySQL, you will probably need to bump up the max_connections variable in order to be able to handle an increased number of concurrent connections from the Web servers into the database.

Since we're discussing database issues here, I'd be willing to bet that if you were to discover your single biggest bottleneck in your application, it would be at the database layer. That's why it is especially important to design that layer with scalability in mind (think memcache, and load balanced read-only slaves), and also to monitor the database servers carefully, with an eye towards slow queries that need to be optimized (thanks to Chris Nutting for doing some amazing work in this area!)

3) Use your load balancer's statistics page to keep track of things such as concurrent connections, queued connections, HTTP request or response errors, etc. One of your goals should be never to see queued connections, since that means that some user requests couldn't be serviced in time.

I should mention that a good monitoring system is essential here. We're using Hyperic, and while I'm not happy at all with its limits (in the free version) in defining alerts at a global level, I really like its capabilities in presenting various metrics in both list and chart form: things like Apache bytes and requests served/second, memcached hit ratios, mysql connections, and many other statistics obtained by means of plugins specific to these services.

As you carefully watch various indicators of your systems' health, be prepared to....

Play wack-a-mole for a while, until things get stable

There's nothing like real-world network traffic, and I mean massive traffic -- we're talking hundreds of millions of hits/day -- to exercise your carefully crafted system infrastructure. I can almost guarantee that with all your planning, you'll still feel that a tsunami just hit you, and you'll scramble to solve one issue after another. For example, let's say you notice that your load balancer starts queuing HTTP requests. This means you don't have enough Web server in the pool. You scramble to add more Web servers. But wait, this increases the number of connections to your database pool! What if you don't have enough servers there? You scramble to add more database servers. You also scramble to increase the memcache settings by giving more memory to memcached, so more items can be stored in the cache. What if you still see requests taking a long time to be serviced? You scramble to optimize slow database queries....and the list goes on.

You'll say that you've done lots of load testing before. This is very good....but it still will not prepare you for the sheer amount of traffic that the internets will throw at your application. That's when all the things I mentioned before -- automated deployment of new instances, charting of the important variables that you want to keep track of, quick identification of bottlenecks -- become very useful.

That's it for this installment. Stay tuned for more lessons learned, as I slowly and sometimes painfully learn them :-) Overall, it's been a blast though. I'm really happy with the infrastructure we've built, and of course with the fact that most if not all of our deployment tools are written in Python.

48 comments:

Anonymous said...

Grig,
That was brilliant.
Steve Williams

Alex UK said...

That's great article. Thank you very much for sharing experience.

JoAnne said...

So, while you're embracing your failures, are you posting them somewhere? I'm new to Agile testing and need to come up with the defect tracking process for it. Does that exist for Agile? How are the developers notified that something failed? Are you using a tool to track defects? Is it then assumed the defect will be fixed immediately for the next sprint? Is there any defect remediation involved?
Thank you for the wealth of information you share.

Grig Gheorghiu said...

JoAnne,

We are using JIRA to keep track of bugs/issues/defects. The bugs do get scheduled to be fixed during the next iteration. Ideally, you would also write automated tests for the bugs you discover, so you can make sure they don't hit you again. I'm in the process of doing some of that for the deployment infrastructure, but I have a long way to go...

Hope this helps,

Grig

Anonymous said...

Great post!

I have a few questions regarding your DB and LB configurations.

As for your primary DB (the one conducting all of the writes) is this on dedicated infrastructure or in the cloud? I would be concerned with loosing a read/write MYSql instance in mid-steam. The same goes for the LB pair. Are these are dedicated boxes or in the cloud?

Frank Febbraro said...

To handle the automated deployment and configuration for various environments we used RightScale instead of writing our own app for generating config files and infrastructire for launching/running. They both do the same thing though.

The added benefit is that RightScale can scale automatically for you on alerts based on various resource usages. You can also easily clone entire deployments to setup identical dev/test/stag/prod environments. It is not free, but neither is your time.

Great article, the lessons learned resonated with me and my experiences.

Frank Ruscica said...

thanks much for this. what server types are you running in ec2? didn't seem clear from the write-up. thanks for any insight.

best,

Grig Gheorghiu said...

We're running pretty much all the server types out there, depending on the specific needs of a server class. Web servers seem to do fine on c1.medium instances (or m1.large if you need more RAM), while DB servers are OK on m1.large and of course m1.xlarge. We run the databases out of EBS volumes, and the performance is suitable for our needs.

Grig

Grig Gheorghiu said...

Frank Febbraro: all our servers/lbs are EC2 instances. As I said in the article, we have redundancy in place at every layer. The write DB is replicated, so if we lose it we can turn the slave into a master.

Grig

Sebastian Stadil said...

Hi Grig,

Interesting post. Would you care to present this to an audience? The Silicon Valley Cloud Computing Group would love to hear you present this story.

Grig Gheorghiu said...

Sebastian -- thanks for the interest! Yes, I'd love to present to your group. Let's schedule something via email. My email is grig at gheorghiu dot net.

Grig

Unknown said...

Way to go Grig. Saw the post on HighScalability! Hope all is well - we miss you over here! ;-)

Niklas Bivald said...

Great article!

Richard said...

Great post. I'd be interested to know whether most of your monitoring occurs on server instances that are actually providing your service as opposed to administrative instances (such as your slack servers). How much overhead does monitoring incur?

Grig Gheorghiu said...

Richard -- we run monitoring agents (Hyperic for the moment, but this may change) on all the instances that we want monitored. The monitoring management node itself (the Hyperic server) is on its own dedicated instance. Turns out alerts are almost impossible to manage with the free version of Hyperic, so for that we're going with good old Nagios, with checks done over ssh as opposed to NRPE.

Grig

Stefan said...

Great write up.

What is the slack product you refer to? Could you provide a link please?

Googling for "slack" is not so helpful. :) I'm finding slackware distro of linux, but what you are talking about sounds like a admin tool.

Thanks!

Grig Gheorghiu said...

Stefan -- I thought I linked to the slack google code page in my post, but here it is:

http://code.google.com/p/slack/

Grig

Dave said...

Umm, you're putting your zone info into MySQL and then printing it out as text files?

http://mysql-bind.sourceforge.net/

-Dave

Grig Gheorghiu said...

Dave -- we're not using a MySQL backend for bind. We're generating the bind zone files from information about instances that we keep in a database.

Grig

Dave said...

Exactly my point. Thanks.

-Dave

kentlangley said...

This is an excellent article. Thank you for sharing.

Some thoughts on the design section...

Regarding Item #2, I'd add that the minor investment in using a 3rd party dynamic DNS service can be pretty helpful. Something like DNSMadeEasy for example. Services like that even have an API so you can force your load balancer role slack scripts to register them as they come on/offline via the API.

Regarding #3, you might like to take a look at OurDelta or Percona MySQL instances since they incorporate useful features like slave autopromtion, replica setup, various tuning tweaks for performance, etc.

RE: #4, I've been pretty impressed by what I've seen coming out of the memcacheDB work as well. Worth a look. Also, when you run an instance large enough to keep memcached on the application/web server you can avoid a trip over the network to the DB for many requests as well.

RE: Monitoring, I use Munin quite a bit these days. For most Linux distro's it is exceptionally easy to install and configure now. It's also pretty easy to add plugins since you can write them in your scripting language of choice.

Kent Langley
www.productionscale.com
www.nscaled.com

Umut Muhaddisoglu said...

Amazed with the post. About to launch a web app on EC2 first time and inspired me a lot.

Anonymous said...

Hi Grig
This is an excellent summary.

We also came across many of the challenges that you mentioned in this article:


1) Deploy multiple Web servers
2) Deploy multiple load balancers
3) Deploy several database servers.
4) Another way of dealing with databases is to not use them


Since this seem to be fairly generic problems that anyone looking for building scalable application on the cloud would deal with we thought that we could offer a solution on top of amazon that will take care of those challenges in a generic way. At the same time we came to the same realization that you did that we don't want to lock the user to some sort of a "blackbox" solution and tried to keep most of the solution open at the OS, scripting and configuration level.

You can read more about the architecture we used to address those issues here.

I was wondering if you came across our service at gigaspaces.com/mycloud

Once again great and very useful article.

Nati S.

Anonymous said...

Take a look at this article as you might want to rethink round robin with your HAProxy. link

Grig Gheorghiu said...

Kent -- thanks for the pointers to resources, I'll definitely check out the MySQL-related ones.

Grig

Grig Gheorghiu said...

Nati -- I did look at the Gigaspaces products a while ago. We preferred to roll our own, because we have certain needs that might not be fulfilled by a generic management platform as yours. But I'll keep an eye on your offerings.

Grig

Rick Lebherz said...

You should check out OpSource.
We handle all of this for you and behind a firewall.

Load balancing, firewalls, multiple servers, millions of users, 24x7 end user support, app management...the list goes on. And we handle some of the largest and most robust applications on the web today. We deal with this everyday.

Why do it yourself?
Leverage expertise and get rid of the down time.

Anonymous said...

Python rocks!
Thanks for such helpful information on better using Amazon EC2!! :)

Jhon Amstrong said...

Good article,.... i love this one, stay update yeah !!!

Unknown said...

Why don't you use Amazon's Elastic Block Store store guarentee you DB exists if all your DB instances die for some reason? Is it a cost issues? Or you trust that you will always have an DB instance running.

Grig Gheorghiu said...

Dan -- we do run all our DB servers with the DB files on an EBS. I thought I made that point in my post...

Grig

fredericsidler said...

Grig, I follow the openX project for a long time (since it was called phpadsnew ;-) and I'm very happy you are using Amazon too. A european Manager of Amazon send me the link to your blog post and after reading it I was wondering why you didn't use scalr.net to manage what you achieved with openX and Amazon. We are also heavy user of Amazon WS and scalr.net too and this tool saved us a lot of time ;-)

I have a second question regarding the Amazon experience in regards to your old way of hosting openX. Can you share any info regarding the economic facts. We are a startup and we never used anything else than Amazon and cloud computing, but I'm looking for info about big player like you who were hosting openX traditionnally and are moving to the cloud for flexibility, scalability and maybe to save some money. I really would like to hear about this last thing.

Grig Gheorghiu said...

Frederic -- to answer your first question, we didn't want to use a 3rd party product such as RightScale, scalr.net or GigaSpaces because we have certain requirements that would be hard to achieve using a more general 3rd party tool. We preferred to roll our own infrastructure deployment tools, which give us the very fine grained level of control we need in terms of automated 'elastic' deployments.

As for the financials, I can't comment too much except to say that we are saving money by going to EC2. The fact that we're paying by the hour allows us to save money during off-peak hours by terminating instances that are not needed. Of course, to achieve this you need to have a completely automated way of terminating and deploying instances of different types (web servers, db server, haproxy load balancers etc). This works very well for us. So in this case, extreme flexibility and scalability actually contribute towards financial savings.

Another thing I'd like to say is that after you enjoyed the benefits of deploying EC2 instances automatically, whenever you need them, you find it very restrictive to work in a traditional hosting environment. It's like floating in the clouds versus carrying a heavy terrestrial weight ;-)

Grig

fredericsidler said...

Grig, I think you just begin to work with the cloud and regaring how much money you are going to save, this will come after few months.

Can you share how you manage to start/stop instances based on the peak. Do you use LA like scalr.net does or do you use some other algorithm?

I like the cloud every day more and more, but some economical journalists are asking us how much a big player could save and as I have never used anything else as the cloud to host web services, I cannot really compare. This is why Amazon points me to your blog post. You will probably able to talk about the calculable and uncalcalculable effets of migrating to the cloud in a near future.

Grig Gheorghiu said...

Frederic -- we don't currently scale up and down continuously, based on load average. We are working on coming up with an algorithm that makes sense for us, which will include load average, but also other important metrics such as overall latency etc.

For now, we are scripting the bulk termination and deployment of instances, but that operation is triggered manually based on peak vs. non-peak traffic (which usually corresponds to day of week). This is currently sufficient for our needs, because it still automatically modifies the various pieces of infrastructure that are involved, such as load balancers.

fredericsidler said...

Is this system built to host the hosted part of openX.

What is the lowest number of instance you are running and how much instance max do you start at peak load.

I'm looking for big player in Switzerland that moved to the cloud for an economical article in a swiss newspaper. Maybe I will find some, but in case I won't find any, are interested?

Grig Gheorghiu said...

Frederic -- while I'm happy to blog about technologies that we use at OpenX, I don't want to go into details about specific business decisions we've made (such as which exact portions of our infrastructure we're hosting in EC2, or what our peak usage is, etc.).

I prefer to blog here rather than write an article in a newspaper. Thanks for inquiring though.

Grig

A said...

Good stuff - especially because it's based on hands-on experience. Thanks for sharing.

Anonymous said...

This is interesting.

But keep in mind that a hosting service like Amazon can have some pretty large consequences especially as you reach larger scales. Just because its cheap when your small doesnt mean there isnt a more cost effective solution as you grown. Keep in mind Economies of Scale. Not to mention many other hidden costs.

Take a look at this article today from SaaSblogs.com. If Salesforce.com were just single tenant running in a hosted EC2 environment.

http://www.saasblogs.com/2009/05/05/what-if-salesforcecom-werent-multi-tenant/

Mikayel said...

for EC2 monitoring I will suggest http://www.monitis.com, you can get up and running monitoring in a couple of minutes.

martin van nijnatten said...

Grig -

great story.

"I will blog separately about how exactly slack works for us"

really keen that story too ;)

-- Martin

PY said...

Awesome article. Thank You.

Shlomo said...

I like the idea of never needing to SSH into an instance. But if the instance needs to be able to attach its own EBS drives and access S3 then you need to put your EC2 credentials on the instance somehow.

Do you burn them into the AMI?
Do you pass them in via the User-Data?
Do you keep them encrypted in the AMI and pass in a decryption key in the User-Data?

Grig Gheorghiu said...

Shlomo -- we are attaching EBS volumes to instances at instance creation time, from a management node. When the instance starts up, it knows via user data what volume to mount (device name) and where to mount it (local directory).

Grig

Theodis Butler said...

Super Great article. Well written. You inspired me to get into blogging :)

Whatever you do, make more time to blog!!!

Ubay Oramas Díaz said...

Great Article! Thank you!

free games said...

Nati -- I did look at the Gigaspaces products a while ago. We preferred to roll our own, because we have certain needs that might not be fulfilled by a generic management platform as yours. But I'll keep an eye on your offerings.

carlos selva said...

Way to go Grig. Saw the post on HighScalability! Hope all is well we miss you over here!;-)

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...