When monitoring a Web site, you need to look at it both from a 'micro' perspective (i.e. are the individual devices and servers in your infrastructure running smoothly?) and from a 'macro' perspective (i.e. do your customers have a pleasant experience when accessing and using your site?; can they use your site's functionality to conduct their business online?). You can also think about these types of monitoring as 'engineering' monitoring vs. 'business' monitoring. They're both important, but most system administrators obviously focus on the engineering/micro-type monitoring, since it's closer to their skills and interests, and tend to not put much emphasis on the business/macro-type monitoring.
In this post I'll talk about various types of monitoring we're doing for the Evite Web site. I'll focus on what I call 'deep application monitoring', which to me means gathering and displaying metrics about your Web application's behavior as seen by the users.
System infrastructure monitoring
This is the 'micro' or 'engineering' type of monitoring that is the bread and butter of a sysops team. I need to eat crow here and say that there is a great open source monitoring tool out there, and its name is Nagios. That's what we use to monitor the health of all our internal servers and network devices, and to alert us via email and SMS if anything goes wrong. It works great, it's easy to set up and deploy automatically, it has tons of plugins to monitor pretty much everything under the sun. My mistake in the past was expecting it to also be a resource graphing tool, but that was simply asking too much from one single tool.
For resource graphing and visualization we use a combination of cacti, ganglia and munin. They're all good, but going forward I'll use munin because it's easy to set up and automate, and also allows you to group devices together and see the same metric (for example system load) across the devices in the group. This makes it easy to spot trends and outliers.
One non-typical aspect of our use of Nagios is that we run passive checks from the servers being monitored to the main Nagios server using NSCA. This avoids the situation where the central Nagios server becomes a bottleneck because it needs to poll all the devices it monitors either via ssh or via NRPE. In our setup, each plugin deployed on a monitored server pushes its own notifications to the central server using send_nsca. It's also fairly easy to write your own Nagios plugins.
What kinds of things do we monitor and graph? The usual suspects -- system load, CPU, memory, disk, processes -- but also things like number of database connections on each Web/app server, number of Apache and Tomcat/Java processes AND threads, size of the mail queue on mail servers, MySQL queries/second, threads, throughput...and many others.
In general, the more metrics you capture and graph, the easier it is to spot trends and correlations. Because that's what you need to do during normal day-to-day traffic on your Web site -- you need to establish baseline numbers which will tell you right away if something goes wrong, for example when you're being hit with more traffic than usual. Having a baseline makes it easy to spot spikes and flare-ups in the metrics you capture. Noticing correlations between these metrics makes it easier for you to troubleshoot things like site slowness or unresponsiveness.
For example, if you see that the Web servers aren't being pounded with more traffic than usual, yet the number of database connections on each Web server keeps going up, you can quickly look at the database metrics and see that indeed the database is slower than usual. It could be that there's a disk I/O problem, or maybe the database is working hard to reindex a large table...but in any case your monitoring and graphing systems should be your friends in diagnosing the problem.
I compare system infrastructure monitoring with unit testing in a development project. Just like when you develop you write a unit test every time you find a bug, with monitoring you add another metric or resource to be monitored and graphed every time you discover a problem with your Web site. Both system monitoring and unit testing work at the micro/engineering level.
By the way, one of the best ways to learn a new system architecture is to delve into the monitoring system, or to roll one out if there's none. This will make you deeply familiar with every server and device in your infrastructure.
External HTTP monitoring
This type of monitoring typically involves setting up your own monitoring server outside of your site's infrastructure, or using a 3rd party service such as Pingdom, Keynote or Gomez. We've been using Pingdom and have been very happy with it. While it doesn't provide the in-depth analysis that Keynote or Gomez do, it costs much less and still gives you a great window into the performance and uptime of your site, as seen by an external user.
In any such system, you choose certain pages within your Web application and you have the external monitor check for a specific character string within the page, to make sure the expected content is there, as opposed to a cryptic error message.
However, with Pingdom you can go deeper than that. They provide 'custom HTTP checks' which allow you to deploy an internal monitor reachable from the outside via HTTP, which checks whatever resource you need within your internal infrastructure, and sends some well-formatted XML back to Pingdom. This is useful when you have a farm of web/app servers behind a load balancer, and you want to check each one but without exposing an external IP address for each server. Or you can have processes running on various port numbers on your web/app servers, and you don't want to expose those directly to Pingdom checks. Instead, you need to open port 80 to just one server, the one running the custom checks.
We've built a RESTful Web service based on restish for this purpose. We call it from Pingdom with certain parameters in the query string, and based on those we run various checks such as 'is our Web application app1 running on web servers A, B, C and D functioning properly?'. If the application doesn't respond as expected on server C, we send details about the error in the XML payload back to Pingdom and we get alerted on it.
We also use Pingdom for monitoring cloud-based resources that we interact with, for example Amazon S3 or SQS. For SQS, we have another custom HTTP check which goes against our restish-based Web service, which then connects to SQS, does some operations and reports back the elapsed time. We want to know if that time/latency exceeds certain thresholds. While we could have done that from our internal Nagios monitoring system, it would have required writing a Nagios plugin, while our restish application already knew how to send the properly formatted XML payload back to Pingdom.
This type of monitoring doesn't go very deep into checking the business logic rules of your application. It is however a 'macro'-type monitoring technique that should be part of your monitoring strategy because it mimics the user experience of your site's customers. I compare it to doing functional testing for your application. You don't test things end-to-end, but you choose some specific portion of your functionality (a specific Web page that is being hit by many users for example) and you make sure that your application provides indeed that functionality to its users in a timely manner.
Also, if you use a CDN provider, then they probably offer you a way to see statistics about your HTTP traffic using some sort of Web dashboard. Most providers also give you a breakdown of HTTP codes returned over some period of time. Again trending is your friend here. If you notice that the percentage of non-200 HTTP codes suddenly increases, you know something is not going well somewhere in your infrastructure. Many CDN providers offer a Web service API for you to query and get these stats, so you can integrate that functionality into your monitoring and alerting system.
Business intelligence monitoring
We've come to the 'deep application monitoring' section of this post. Business intelligence (BI) deals with metrics pertaining to the business of your Web site. Think of all the metrics you can get with Google Analytics -- and yes, Google Analytics in itself is a powerful monitoring tool.
BI is very much concerned with trends. Everybody wants to see traffic and number of users to the site going up and up. Your site may have very specific metrics that you want to keep track of. Let's say you want users to register and fill in details about themselves, upload a photo, etc. You want to keep track of how many user profiles are created per day, and you want to keep track of the rate of change from day to day, or even from hour to hour. It would be nice if you could spot a drop in that rate right away, and then identify the root cause of the problem. Well, we do this here at Evite with a combination of tools, the main one being a non-open-source non-free tool called SAM (for Simple Application Monitoring) written by Randy Wigginton.
In a nutshell, SAM uses a modified memcached server to capture URLs and SQL queries as they are processed by your application, and aggregates them by ignoring query strings in URLs and values in SQL statements. It saves their count, average/min/max time of completion and failure count in a database, then displays stats and graphs in a simple Web application that queries that database. The beauty of this approach is that you can see metrics related to specific portions of your application aggregated across all users hitting that functionality. So if a user profile is saved via a POST to a URL with a specific query string and HTTP payload, SAM will aggregate all such requests and will allow you to see exactly how many users have saved their profiles in the last hour, how many failures you had, how long it took, etc. What's more, you can query the database directly and take action based on certain thresholds (for example if the failure rate for a specific action, e.g. a POST to a specific URL, has surpassed a given threshold). We're doing exactly this within our restish-based Web service monitoring tool.
Note that this technique based on aggregation can be implemented using other approaches, for example by mining your Web server logs, or database logs. The nice thing about SAM is that it captures all this information in real time. As far as your application is concerned, it only interacts with memcached using standard memcache client calls, and it doesn't even bother to check the result of the memcache push.
Another advantage of having such a macro/business-type of monitoring and graphing in place is that your business stakeholders (including your CEO) can see these metrics and trends almost in real time, and can alert you (the ops team) if something seems out of whack. This is almost guaranteed to correlate with something going amiss in your infrastructure. Once you figure out what that is, see if your internal system monitoring system caught it or not. If it didn't, it's a great opportunity for you to add it in so as to know about it next time before your CEO does. Trust me, these things can happen, I speak from experience.
End-to-end application monitoring
Companies such as BrowserMob and Sauce Labs allow you to run Selenium scripts against your Web application using clients hosted by them 'in the cloud'. With BrowserMob, you can also use these scripts to monitor a given path through your application. This is obviously the monitoring equivalent of an end-to-end application test. If you have an e-commerce site for example, you can test and monitor your entire check out process this way. I am still scratching the surface when it comes to this type of monitoring, but it's high on my TODO list.
So there you have it. Similar to having a sound application testing strategy, your best bet is to implement all these types of monitoring, or as many of them as your time and budget allow. If you do, you can truly say you have a holistic monitoring strategy in place. And remember, the more dashboards you have into the health of your application and your systems, the better off you are. It's an uncontested rule that the one with the most dashboards wins.
For further reading and more dashboard ideas, see these resources: