Saturday, October 27, 2012

Monitoring doesn't have to suck

This is a follow up to my previous post, where I detailed some of the things I want in a modern monitoring tool. A month has passed, and I thought I'd give a quick overview of some tools we started to use as part of our monitoring and graphing strategy.

Several people recommended Sensu, so my colleague Jeff Roberts has given it a shot, and he liked what he saw (blog post with more technical details hopefully soon from Jeff!). We're still tinkering with it, but we're already using it to monitor our Linux-based machines (both Ubuntu and CentOS). Jeff is working on our Chef infrastructure, and soon we'll be able to deploy the Sensu client via Chef. What's nice about Sensu, and different from other tools, is the queueing mechanism it uses for client-server communication, and for posting events such as 'send this metric to Graphite' or 'send this alert to Pager Duty' or 'send this notification to this email address'. It does have a few rough edges, but @portertech seems to be on IRC 24x7, so that helps a lot :-)

Sensu, similar to many other monitoring tools, doesn't have good support for Windows. We looked around and we settled on Server Density for monitoring our Windows machines (mostly used for our backend infrastructure). We also monitor the Sensu server itself with Server Density, and we're looking into adding RabbitMQ-specific checks for Sensu.

We integrated both Sensu and Server Density with Pager Duty, and we consider every alert sent to Pager Duty as critical (which means the engineer on duty has to acknowledge, solve or escalate it). Pager Duty is an amazingly useful service, and most useful of all are probably the escalation processes and policies it provides. No more excuses for people who are on pager that they weren't notified when shit hit the fan!

For non-critical alerts we send email notifications outside of Pager Duty. With Sensu we do this via its mailer handler, and for Server Density by using a profile that goes to an email alias.

Two more monitoring tools we signed up for are Boundary, which we haven't been using in anger yet, but will do so in the next couple of weeks, and New Relic, whose sweet spot seems to be application-level monitoring. We'll deploy some Java app servers soon and we'll try the JVM New Relic plugin at the same time. Boundary will be very useful for looking at network traffic, establishing patterns and baselines, and getting notified when something gets out of whack. It will tell us what we don't know that we don't know.

Update: I forgot to mention Pingdom, which is a very inexpensive but extremely useful external monitoring service. We use it to monitor important user-facing resources (mostly web pages, but Pingdom can also do generic TCP checks, and mail-related checks) so we can get alerted when something is wrong from the perspective of our users, looking outside in so to speak.

For graphing, we deployed Graphite. We haven't started to use it very heavily, but again this will change soon, as we'll send there various business metrics obtained by querying the critical databases within our infrastructure. Jeff already integrated it with Sensu, so we'll be able to use the same queuing mechanism for Graphite as we are using for the monitoring alerts.

So there you have it: Sensu, Pingdom, Server Density, Pager Duty, Boundary, New Relic and Graphite. Modern tools that give a good name to monitoring. No, monitoring no longer sucks.

4 comments:

Doug Lane said...

Have you heard of Tracelytics? I recommend you take a look at that.

Bryan said...

If/when Jeff writes a post about it, make sure you tweet it so I can go read that too. Thanks...

Aurelie said...

Would be great to have further details on your Sensu deployment and experience as we are considering it.

Unknown said...

How do you send a ack'd status back to sensu when you ack the alert on pagerduty?

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...