At OpenX we recently completed a large-scale deployment of one of our server farms to Amazon EC2. Here are some lessons learned from that experience.
Expect failures; what's more, embrace them
Things are bound to fail when you're dealing with large-scale deployments in any infrastructure setup, but especially when you're deploying virtual servers 'in the cloud', outside of your sphere of influence. You must then be prepared for things to fail. This is a Good Thing, because it forces you to think about failure scenarios upfront, and to design your system infrastructure in a way that minimizes single points of failure.
As an aside, I've been very impressed with the reliability of EC2. Like many other people, I didn't know what to expect, but I've been pleasantly surprised. Very rarely does an EC2 instance fail. In fact I haven't yet seen a total failure, only some instances that were marked as 'deteriorated'. When this happens, you usually get a heads-up via email, and you have a few days to migrate your instance, or launch a similar one and terminate the defective one.
Expecting things to fail at any time leads to and relies heavily on the next lesson learned, which is...
Fully automate your infrastructure deployments
There's simply no way around this. When you need to deal with tens and even hundreds of virtual instances, when you need to scale up and down on demand (after all, this is THE main promise of cloud computing!), then you need to fully automate your infrastructure deployment (servers, load balancers, storage, etc.)
The way we achieved this at OpenX was to write our own custom code on top of the EC2 API in order to launch and destroy AMIs and EBS volumes. We rolled our own AMI, which contains enough bootstrap code to make it 'call home' to a set of servers running slack. When we deploy a machine, we specify a list of slack 'roles' that the machine belongs to (for example 'web-server' or 'master-db-server' or 'slave-db-server'). When the machine boots up, it will run a script that belongs to that specific slack role. In this script we install everything the machine needs to do its job -- pre-requisite packages and the actual application with all its necessary configuration files.
I will blog separately about how exactly slack works for us, but let me just say that it is an extremely simple tool. It may seem overly simple, but that's exactly its strength, since it forces you to be creative with your postinstall scripts. I know that other people use puppet, or fabric, or cfengine. Whatever works for you, go ahead and use, just use SOME tool that helps with automated deployments.
The beauty of fully automating your deployments is that it truly allows you to scale infinitely (for some value of 'infinity' of course ;-). It almost goes without saying that your application infrastructure needs to be designed in such a way that allows this type of scaling. But having the building blocks necessary for automatically deploying any type of server that you need is invaluable.
Another thing we do which helps with automating various pieces of our infrastructure is that we keep information about our deployed instances in a database. This allows us to write tools that inspect the database and generate various configuration files (such as the all-important role configuration file used by slack), and other text files such as DNS zone files. This database becomes the one true source of information about our infrastructure. The DRY principle applies to system infrastructure, not only to software development.
Speaking of DNS, specifically in the context of Amazon EC2, it's worth rolling out your own internal DNS servers, with zones that aren't even registered publicly, but for which your internal DNS servers are authoritative. Then all communication within the EC2 cloud can happen via internal DNS names, as opposed to IP addresses. Trust me, your tired brain will thank you. This would be very hard to achieve though if you were to manually edit BIND zone files. Our approach is to automatically generate those files from the master database I mentioned. Works like a charm. Thanks to Jeff Roberts for coming up with this idea and implementing it.
While we're on the subject of fully automated deployments, I'd like to throw an idea out there that I first heard from Mike Todd, my boss at OpenX, who is an ex-Googler. One of his goals is for us never to have to ssh into any production server. We deploy the server using slack, the application gets installed automatically, monitoring agents get set up automatically, so there should really be no need to manually do stuff on the server itself. If you want to make a change, you make it in a slack role on the master slack server, and it gets pushed to production. If the server misbehaves or gets out of line with the other servers, you simply terminate that server instance and launch another one. Since you have everything automated, it's one command line for terminating the instance, and another one for deploying a brand new replacement. It's really beautiful.
Design your infrastructure so that it scales horizontally
There are generally two ways to scale an infrastructure: vertically, by deploying your application on more powerful servers, and horizontally, by increasing the number of servers that support your application. For 'infinite' scaling in a cloud computing environment, you need to design your system infrastructure so that it scales horizontally. Otherwise you're bound to hit limits of individual servers that you will find very hard to get past. Horizontal scaling also eliminates single points of failure.
Here are a few ideas for deploying a Web site with a database back-end so that it uses multiple tiers, with each tier being able to scale horizontally:
1) Deploy multiple Web servers behind one or more load balancers. This is pretty standard these days, and this tier is the easiest to scale. However, you also want to maximize the work done by each Web server, so you need to find the sweet spot of that particular type of server in terms of httpd processes it can handle. Too few processes and you're wasting CPU/RAM on the server, too many and you're overloading the server. You also need to be cognizant of the fact that each EC2 instance costs you money. It can become so easy to launch a new instance that you don't necessarily think of getting the most out of the existing instances. Don't go wild unless absolutely necessary if you don't want to have a sticker shock when you get the bill from Amazon at the end of the month.
2) Deploy multiple load balancers. Amazon doesn't yet offer load balancers, so what we've been doing is using HAProxy-based load balancers. Let's say you have an HAProxy instance that handles traffic for www.yourdomain.com. If your Web site becomes wildly successful, it is imaginable that a single HAProxy instance will not be able to handle all the incoming network traffic. One easy solution for this, which is also useful for eliminating single points of failure, is to use round-robin DNS, pointing www.yourdomain.com to several IP addresses, with each IP address handled by a separate HAProxy instance. All HAProxy instances can be identical in terms of back-end configuration, so your Web server farm will get 1/N of the overall traffic from each of your N load balancers. It worked really well for us, and the traffic was spread out very uniformly among the HAProxies. You do need to make sure the TTL on the DNS record for www.yourdomain.com is low.
3) Deploy several database servers. If you're using MySQL, you can set up a master DB server for writes, and multiple slave DB servers for reads. The slave DBs can sit behind an HAProxy load balancer. In this scenario, you're limited by the capacity of the single master DB server. One thing you can do is to use sharding techniques, meaning you can partition the database into multiple instances that each handle writes for a subset of your application domain. Another thing you can do is to write to local databases deployed on the Web servers, either in memory or on disk, and then periodically write to the master DB server (of course, this assumes that you don't need that data right away; this technique is useful when you have to generate statistics or reports periodically for example).
4) Another way of dealing with databases is to not use them, or at least to avoid the overhead of making a database call each time you need something from the database. A common technique for this is to use memcache. Your application needs to be aware of memcache, but this is easy to implement in all of the popular programming languages. Once implemented, you can have your Web servers first check a value in memcache, and only if it's not there have them hit the database. The more memory you give to the memcached process, the better off you are.
Establish clear measurable goals
The most common reason for scaling an Internet infrastructure is to handle increased Web traffic. However, you need to keep in mind the quality of the user experience, which means that you need to keep the response time of the pages your serve under a certain limit which will hopefully meet and surpass the user's expectations. I found it extremely useful to have a very simple script that measures the response time of certain pages and that graphs it inside a dashboard-type page (thanks to Mike Todd for the idea and the implementation). As we deployed more and more servers in order to keep up with the demands of increased traffic, we always kept an eye on our goal: keep reponse time/latency under N milliseconds (N will vary depending on your application). When we would see spikes in the latency chart, we knew we need to act at some level of our infrastructure. And this brings me to the next point...
Be prepared to quickly identify and eliminate bottlenecks
As I already mentioned in the design section above, any large-scale Internet infrastructure will have different types of servers: web servers, application servers, database servers, memcache servers, and the list goes on. As you scale the servers at each tier/level, you need to be prepared to quickly identify bottlenecks. Examples:
1) Keep track of how many httpd processes are running on your Web servers; this depends on the values you set for MaxClients and ServerLimit in your Apache configuration files. If you're using an HAProxy-based load balancer, this also depends on the connection throttling that you might be doing at the backend server level. In any case, the more httpd processes are running on a given server, the more CPU and RAM they will use up. At some point, the server will run out of resources. At that point, you either need to scale the server up (by deploying to a larger EC2 instance, for example an m1.large with more RAM, or a c1.medium with more CPU), or you need to scale your Web server farm horizontally by adding more Web servers, so the load on each server decreases.
2) Keep track of the load on your database servers, and also of slow queries. A great tool for MySQL database servers is innotop, which allows you to see the slowest queries at a glance. Sometimes all it takes is a slow query to throw a spike into your latency chart (can you tell I've been there, done that?). Also keep track of the number of connections into your database servers. If you use MySQL, you will probably need to bump up the max_connections variable in order to be able to handle an increased number of concurrent connections from the Web servers into the database.
Since we're discussing database issues here, I'd be willing to bet that if you were to discover your single biggest bottleneck in your application, it would be at the database layer. That's why it is especially important to design that layer with scalability in mind (think memcache, and load balanced read-only slaves), and also to monitor the database servers carefully, with an eye towards slow queries that need to be optimized (thanks to Chris Nutting for doing some amazing work in this area!)
3) Use your load balancer's statistics page to keep track of things such as concurrent connections, queued connections, HTTP request or response errors, etc. One of your goals should be never to see queued connections, since that means that some user requests couldn't be serviced in time.
I should mention that a good monitoring system is essential here. We're using Hyperic, and while I'm not happy at all with its limits (in the free version) in defining alerts at a global level, I really like its capabilities in presenting various metrics in both list and chart form: things like Apache bytes and requests served/second, memcached hit ratios, mysql connections, and many other statistics obtained by means of plugins specific to these services.
As you carefully watch various indicators of your systems' health, be prepared to....
Play wack-a-mole for a while, until things get stable
There's nothing like real-world network traffic, and I mean massive traffic -- we're talking hundreds of millions of hits/day -- to exercise your carefully crafted system infrastructure. I can almost guarantee that with all your planning, you'll still feel that a tsunami just hit you, and you'll scramble to solve one issue after another. For example, let's say you notice that your load balancer starts queuing HTTP requests. This means you don't have enough Web server in the pool. You scramble to add more Web servers. But wait, this increases the number of connections to your database pool! What if you don't have enough servers there? You scramble to add more database servers. You also scramble to increase the memcache settings by giving more memory to memcached, so more items can be stored in the cache. What if you still see requests taking a long time to be serviced? You scramble to optimize slow database queries....and the list goes on.
You'll say that you've done lots of load testing before. This is very good....but it still will not prepare you for the sheer amount of traffic that the internets will throw at your application. That's when all the things I mentioned before -- automated deployment of new instances, charting of the important variables that you want to keep track of, quick identification of bottlenecks -- become very useful.
That's it for this installment. Stay tuned for more lessons learned, as I slowly and sometimes painfully learn them :-) Overall, it's been a blast though. I'm really happy with the infrastructure we've built, and of course with the fact that most if not all of our deployment tools are written in Python.