Agile Testing: What I want in a monitoring tool

Sunday, September 30, 2012

What I want in a monitoring tool

I started a new job a few weeks ago, and I'm now at a point where I'm investigating monitoring options. At past jobs I used Nagios, which I know will work, but I would like to look into other more modern tools. I am aware that #monitoringsucks, and I am pretty sure people have hashed these topics before, but here are some of the things I want from a modern monitoring tool:

Ideally open source, of if not affordable per host per month pricing (we already signed up as a paying customer of Boundary for example)
Installation and configuration should be easily scriptable

server installation, as well as addition/modification of clients should be easily automated so it can be done with Puppet/Chef
API would be ideal

Robust notifications/alerting rules

escalations
service dependencies
event handler scripts
alerts based on subsets of hosts/services

for example alert me only when 2+ servers of the same type are down

Out-of-the-box plugins

database-specific checks for example

Scalability

the monitoring server shouldn't become a bottleneck as more clients are added

nagios is OK with 100-200 clients (with passive checks)

hierarchy of servers should be supported
agent-based clients

Reporting/dashboards

Hosts/services status dashboards
Downtime/outages dashboards
Latency (for HTTP checks)

Resource graphing would be great

but in my experience very few tools do both alerting and resource/metrics graphing well
in the past I used Nagios for alerting and Ganglia/Graphite for graphing

Integration with other tools

Send events to graphing tools (Graphite), alerting tools (PagerDuty), notification mechanisms (irc, Campfire), logging tools (Graylog2)

I also asked a question on Twitter about what monitoring tool people would recommend, and here are some of the tools mentioned in the replies:

Sensu
OpenNMS
Icinga
Zenoss
Riemann
Ganglia
Datadog

Several people told me to look into Sensu, and a quick browsing of the home page tells me it would be worth giving it a whirl. So I think I'll do that next. Stay tuned, and also please leave a comment if you know of other tools that might fit the profile I am looking for.

Update: more tools mentioned in comments or on Twitter after I posted a link to this blog post:

New Relic (which I am actually in the process of evaluating, having paid for 1 host)
Circonus
Zabbix
Server Density
Librato
Comostas
OpsView
Shinken
PRTG
NetXMS
Tracelytics

19 comments:

Anonymous said...: In the same hosted space as Datadog I'll also put forward our service: wwwserverdensity.com (in fact DD's agent is a fork of our open source agent).; Sunday, September 30, 2012 at 5:36:00 PM PDT
Grig Gheorghiu said...: Thanks, added Server Density to list of tools in my post.; Sunday, September 30, 2012 at 5:53:00 PM PDT
Anonymous said...: We're using check_mk at work, and quite like both its setup and its look. It builds/works on top of nagios, so I'm not sure if it counts in your list.

Rollout is especially easy on the clients.; Sunday, September 30, 2012 at 6:41:00 PM PDT
Akira (ねこ) Kri said...: Hello there, you can try comostas at comostas.com
there is 2 versions of it but both support most of the things you mentioned in your post.
The tool is quite new and was never advertised in anyplace yet though it already robust and used in monitoring by most of Israel banks ;-); Sunday, September 30, 2012 at 6:57:00 PM PDT
Doug Napoleone said...: Have you looked at OpsView?
http://www.opsview.com/

There is an Open Source base version with pay-for extensions.; Sunday, September 30, 2012 at 7:24:00 PM PDT
Anonymous said...: Have you looked at Tracelytics? They have more of a cloud based pricing model to scale..; Sunday, September 30, 2012 at 7:37:00 PM PDT
Anonymous said...: Have a look at netxms.; Sunday, September 30, 2012 at 10:49:00 PM PDT
Unknown said...: Since you'use python, have a look at Shinken. It's a rewrite of nagios in python, with modern concepts (distributed, integrated graphite, webui agnostic etc.); Sunday, September 30, 2012 at 11:02:00 PM PDT
Unknown said...: Since you're using puthon extensively, have a look at shinken, a Nagios rewrite in python with modern technologies!; Sunday, September 30, 2012 at 11:03:00 PM PDT
Radomir Dopieralski said...: Good luck at the new job!; Monday, October 1, 2012 at 2:12:00 AM PDT
Winni said...: I work at a Teleport for B2B Satellite Communication, and we use a mix of the commercial product WhatsUp Gold for Monitoring and the Open Source project Cacti (www.cacti.net) for Graphing.

Personally, if the budget allows, I'd recommend PRTG by Paessler (www.paessler.com) - it can do both graphing and monitoring and works very well. If you need to analyze traffic flows, Scrutinizer by Plixer (www.plixer.com) is an awesome tool.; Monday, October 1, 2012 at 7:00:00 AM PDT
Grig Gheorghiu said...: Thanks everybody for the recommendations, I updated my original post with all the tools mentioned in the comments so far.; Monday, October 1, 2012 at 8:09:00 AM PDT
Scumola said...: Stick with nagios. I still use it and it's great.; Monday, October 1, 2012 at 10:47:00 AM PDT
Anonymous said...: For system monitoring I am using check_mk with pnp4nagios for graphing and NagVis for visualising data.

When it comes to application monitoring, New Relic is a good choice.; Wednesday, October 3, 2012 at 1:27:00 AM PDT
Sandeep Netha said...: Nice posting; Wednesday, October 3, 2012 at 6:50:00 AM PDT
Mark Waite said...: We started with Nagios, tried the check_mk extension over Nagios, and then were delighted to discover the next level of nagios + check_mk, OMD.

OMD (Open Monitoring Distribution) seems to be from the author of check_mk and provides a system that is easy to install, easy to administer (web UI for many operations), and configured with Python scripts.

It uses all the Nagios plugins, plus has an additional way of allowing a monitoring agent to discover services on a computer and inform the server of the new services.; Sunday, October 7, 2012 at 4:55:00 PM PDT
Mark Waite said...: Grig, my earlier posting was incorrect in the expansion of OMD. It is the Open Monitoring Distribution. omdistro.org is the web site for the distribution. It is open source and works well in the tests I've run (monitoring 100+ machines from a 5 year old laptop); Sunday, October 7, 2012 at 4:57:00 PM PDT
Unknown said...: I have used several monitoring solutions for more than three years; I recommend
from among them IPHost network monitor; Monday, October 15, 2012 at 9:15:00 PM PDT
Anonymous said...: So, what solution did you choose and why?; Friday, October 26, 2012 at 2:44:00 AM PDT

Agile Testing

Sunday, September 30, 2012

What I want in a monitoring tool

19 comments:

Modifying EC2 security groups via AWS Lambda functions

Followers