Sunday, September 30, 2012

What I want in a monitoring tool

I started a new job a few weeks ago, and I'm now at a point where I'm investigating monitoring options. At past jobs I used Nagios, which I know will work, but I would like to look into other more modern tools. I am aware that #monitoringsucks, and I am pretty sure people have hashed these topics before, but here are some of the things I want from a modern monitoring tool:
  • Ideally open source, of if not affordable per host per month pricing (we already signed up as a paying customer of Boundary for example)
  • Installation and configuration should be easily scriptable
    • server installation, as well as addition/modification of clients should be easily automated so it can be done with Puppet/Chef
    • API would be ideal
  • Robust notifications/alerting rules
    • escalations
    • service dependencies
    • event handler scripts
    • alerts based on subsets of hosts/services
      • for example alert me only when 2+ servers of the same type are down
  • Out-of-the-box plugins
    • database-specific checks for example
  • Scalability
    • the monitoring server shouldn't become a bottleneck as more clients are added
      • nagios is OK with 100-200 clients (with passive checks)
    • hierarchy of servers should be supported
    • agent-based clients
  • Reporting/dashboards
    • Hosts/services status dashboards
    • Downtime/outages dashboards
    • Latency (for HTTP checks)
  • Resource graphing would be great
    • but in my experience very few tools do both alerting and resource/metrics graphing well
    • in the past I used Nagios for alerting and Ganglia/Graphite for graphing
  • Integration with other tools
    • Send events to graphing tools (Graphite), alerting tools (PagerDuty), notification mechanisms (irc, Campfire), logging tools (Graylog2)
I also asked a question on Twitter about what monitoring tool people would recommend, and here are some of the tools mentioned in the replies:
  • Sensu
  • OpenNMS
  • Icinga
  • Zenoss
  • Riemann
  • Ganglia
  • Datadog
Several people told me to look into Sensu, and a quick browsing of the home page tells me it would be worth giving it a whirl. So I think I'll do that next. Stay tuned, and also please leave a comment if you know of other tools that might fit the profile I am looking for.

Update: more tools mentioned in comments or on Twitter after I posted a link to this blog post:
  • New Relic (which I am actually in the process of evaluating, having paid for 1 host)
  • Circonus
  • Zabbix
  • Server Density
  • Librato
  • Comostas
  • OpsView
  • Shinken
  • PRTG
  • NetXMS
  • Tracelytics

19 comments:

Anonymous said...

In the same hosted space as Datadog I'll also put forward our service: wwwserverdensity.com (in fact DD's agent is a fork of our open source agent).

Grig Gheorghiu said...

Thanks, added Server Density to list of tools in my post.

Anonymous said...

We're using check_mk at work, and quite like both its setup and its look. It builds/works on top of nagios, so I'm not sure if it counts in your list.

Rollout is especially easy on the clients.

Akira (ねこ) Kri said...

Hello there, you can try comostas at comostas.com
there is 2 versions of it but both support most of the things you mentioned in your post.
The tool is quite new and was never advertised in anyplace yet though it already robust and used in monitoring by most of Israel banks ;-)

Doug Napoleone said...

Have you looked at OpsView?
http://www.opsview.com/

There is an Open Source base version with pay-for extensions.

Anonymous said...

Have you looked at Tracelytics? They have more of a cloud based pricing model to scale..

Anonymous said...

Have a look at netxms.

Unknown said...

Since you'use python, have a look at Shinken. It's a rewrite of nagios in python, with modern concepts (distributed, integrated graphite, webui agnostic etc.)

Unknown said...

Since you're using puthon extensively, have a look at shinken, a Nagios rewrite in python with modern technologies!

Radomir Dopieralski said...

Good luck at the new job!

Winni said...

I work at a Teleport for B2B Satellite Communication, and we use a mix of the commercial product WhatsUp Gold for Monitoring and the Open Source project Cacti (www.cacti.net) for Graphing.

Personally, if the budget allows, I'd recommend PRTG by Paessler (www.paessler.com) - it can do both graphing and monitoring and works very well. If you need to analyze traffic flows, Scrutinizer by Plixer (www.plixer.com) is an awesome tool.

Grig Gheorghiu said...

Thanks everybody for the recommendations, I updated my original post with all the tools mentioned in the comments so far.

Scumola said...

Stick with nagios. I still use it and it's great.

Anonymous said...

For system monitoring I am using check_mk with pnp4nagios for graphing and NagVis for visualising data.

When it comes to application monitoring, New Relic is a good choice.

Sandeep Netha said...

Nice posting

Mark Waite said...

We started with Nagios, tried the check_mk extension over Nagios, and then were delighted to discover the next level of nagios + check_mk, OMD.

OMD (Open Monitoring Distribution) seems to be from the author of check_mk and provides a system that is easy to install, easy to administer (web UI for many operations), and configured with Python scripts.

It uses all the Nagios plugins, plus has an additional way of allowing a monitoring agent to discover services on a computer and inform the server of the new services.

Mark Waite said...

Grig, my earlier posting was incorrect in the expansion of OMD. It is the Open Monitoring Distribution. omdistro.org is the web site for the distribution. It is open source and works well in the tests I've run (monitoring 100+ machines from a 5 year old laptop)

Unknown said...

I have used several monitoring solutions for more than three years; I recommend
from among them IPHost network monitor

Anonymous said...

So, what solution did you choose and why?

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...