Step 0. If you're fortunate enough to participate in the design of your infrastructure (as opposed to being thrown at the deep end and having to maintain some 'legacy' one), then try to aim for horizontal scalability. It's easier to scale out than to scale up, and failures in this mode will hopefully impact a smaller percentage of your users.
Step 1. Configure a good monitoring and alerting system
This is the single most important thing you need to do for your infrastructure. It's also a great way to learn a new infrastructure that you need to maintain.
I talked about different types of monitoring in another blog post. My preferred approach is to have 2 monitoring systems in place:
- an internal monitoring system which I use to check the health of individual servers/devices
- an external monitoring system used to check the behavior of the application/web site as a regular user would.
My preferred internal monitoring/alerting system is Nagios, but tools like Zabbix, OpenNMS, Zenoss, Monit etc. would definitely also do the job. I like Nagios because there is a wide variety of plugins already available (such as the extremely useful check_mysql_health plugin) and also because it's very easy to write custom plugin for your specific application needs. It's also relatively easy to generate Nagios configuration files automatically.
For external monitoring I use a combination of Pingdom and Akamai alerts. Pingdom runs checks against certain URLs within our application, whereas Akamai alerts us whenever the percentage of HTTP error codes returned by our application is greater than a certain threshold.
I'll talk more about correlating internal and external alerts below.
Step 2. Configure a good resource graphing system
This is the second most important thing you need to do for your infrastructure. It gives you visibility into how your system resources are used. If you don't have this visibility, it's very hard to do proper capacity planning. It's also hard to correlate monitoring alerts with resource limits you might have reached.
I use both Munin and Ganglia for resource graphing. I like the graphs that Munin produces and also some of the plugins that are available (such as the munin-mysql plugin), and I also like Ganglia's nice graph aggregation feature, which allows me to watch the same system resource across a cluster of nodes. Munin has this feature too, but Ganglia was designed from the get-go to work on clusters of machines.
Step 3. Dashboards, dashboards, dashboards
Everybody knows that whoever has the most dashboards wins. I am talking here about application-specific metrics that you want to track over time. I gave an example before of a dashboard I built for visualizing the outgoing email count through our system.
It's very easy to build such a dashboard with the Google Visualization API, so there's really no excuse for not having charts for critical metrics of your infrastructure. We use queuing a lot internally at Evite, so we have dashboards for tracking various queue sizes. We also track application errors from nginx logs and chart them in various ways: by server, by error code, by URL, aggregated, etc.
Dashboards offered by external monitoring tools such as Pingdom/Keynote/Gomez/Akamai are also very useful. They typically chart uptime and response time for various pages, and edge/origin HTTP traffic in the case of Akamai.
Step 4. Correlate errors with resource state and capacity
The combination of internal and external monitoring, resource charting and application dashboards is very powerful. As a rule, whenever you have an external alert firing off, you should have one or more internal ones firing off too. If you don't, then you don't have sufficient internal alerts, so you need to work on that aspect of your monitoring.
Once you do have external and internal alerts firing off in unison, you will be able to correlate external issues (such as increased percentages of HTTP error codes, or timeouts in certain application URLs) with server capacity issues/bottlenecks within your infrastructure. Of course, the fact that you are charting resources over time, and that you have a baseline to go from, will help you quickly identify outliers such as spikes in CPU usage or drops in Akamai traffic.
A typical work day for me starts with me opening a few tabs in my browser: the Nagios overview page, the Munin overview, the Ganglia overview, the Akamai HTTP content delivery dashboard, and various application-specific dashboards.
Let's say I get an alert from Akamai that the percentage of HTTP 500 error codes is over 1%. I start by checking the resource graphs for our database servers. I look at in-depth MySQL metrics in Munin, and at CPU metrics (especially CPU I/O wait time) in Ganglia. If nothing is out of the ordinary, I look at our various application services (our application consists of a multitude of RESTful Web services). The nginx log dashboard may show increased HTTP 500 errors from a particular server, or it may show an increase in such errors across the board. This may point to insufficient capacity at our services layer. Time to deploy more services on servers with enough CPU/RAM capacity. I know which servers those are, because I keep tabs on them with Munin and Ganglia.
As another example, I know that if the CPU I/O wait on my database servers approaches 30%, the servers will start huffing and puffing, and I'll see an increased number of slow queries. In this case, it's time to either identify queries to be optimized, or reduce the number of queries to the database -- or if everything else fails, time to add more database servers. (BTW, if you haven't yet read @allspaw's book "The Art of Capacity Planning", add reading it as a task for Step 0)
My point is that all these alerts and metric graphs are interconnected, and without looking at all of them, you're flying blind.
Step 5. Expect failures and recover quickly and gracefully
It's not a question whether failures will happen, it's WHEN they will happen. When they do happen, you need to be prepared. Hopefully you designed your infrastructure in a way that allows you to bounce back quickly from failures. If a database server goes down, hopefully you have a slave that you can quickly promote to a master, or even better you have another passive master ready to become the active server. Even better, maybe you have a fancy self-healing distributed database -- kudos to you then ;-)
One thing that you can do here is to have various knobs that turn on and off certain features or pieces of functionality within your application (again, John Allspaw has some blog posts and presentations on that from his days at Flickr and his current role at Etsy). These knobs allow you to survive an application server outage, or even (God forbid) a database outage, while still being able to present *something* to your end-users.
To quickly bounce back from a server failure, I recommend you use automated deployment and configuration management tools such as Chef, Puppet, Fabric, etc. (see some posts of mine on this topic). I personally use a combination of Chef (to bootstrap a new machine and do things as file system layout, pre-requisite installation etc) and Fabric (to actually deploy the application code).
Comment on Twitter:
"@griggheo Good stuff. Any thoughts on figuring out symptom's from causes? eg. load balancer is having issues which causes db load to drop?"
Good question, and something similar actually has happened to us. To me, it's a matter of knowing your baseline graphs. In our case, whenever I see an Akamai traffic drop, it's usually correlated to an increase in the percentage of HTTP 500 errors returned by our Web services. If I also see DB traffic dropping, then I know the bottleneck is at the application services layer. If the DB traffic is increasing, then the bottleneck is most likely the DB. Depending on the bottleneck, we need to add capacity at that layer, or to optimize code or DB queries at the respective layer.
Main accomplishment of this post
I'm most proud of the fact that I haven't used the following words in the post above: 'devops', 'cloud', 'noSQL' and 'agile'.