Thursday, January 05, 2012

Graphing, alerting and mission control with Graphite and Nagios


We’ve been using Graphite more and more for graphing of OS- and application-related metrics (here are some old-ish notes of mine on installing and configuring Graphite.) We measure and graph variables as diverse as:
  • relative and absolute state of charge of the LSI MegaRAID controller battery (why? because we’ve been burned by battery issues before)
  • database server I/O wait time (critical for EC2 instances which are notorious for their poor I/O performance; at this point we only run MySQL slaves in EC2, and we do not repeat DO NOT use EBS volumes for DB servers, instead we stripe the local disks into a RAID 0 array with LVM)
  • memcached stats such as current connections, get hits and misses, delete hits and misses, evictions
  • percentage and absolute number of HTTP return codes as served by nginx and haproxy
  • count of messages in various queues (our own and Amazon SQS)
  • count of outgoing mail messages

We do have a large Munin install base, so we found Adam Jacob’s munin-graphite.rb script very useful in sending all data captured by Munin to Graphite.

Why Graphite and not Munin or Ganglia? Mainly because it’s so easy to send arbitrarily named metrics to Graphite, but also because we can capture measurements at 1 second granularity (although this is possible with some tweaking with RRD-based tools as well).

On the Graphite server side, we set up different retention policies depending on the type of data we capture. For example, for app server logs (nginx and haproxy) we have the following retention policy specified in /opt/graphite/conf/storage-schemas.conf:

[appserver]
pattern = ^appserver\.
retentions = 60s:30d,10m:180d,60m:2y


This tells Graphite we want to keep data aggregated at 60 second intervals for 30 days, 10 minute data for 6 months and hourly data for 2 years.

The main mechanism we use for sending data to Graphite is tailing various log files at different intervals, parsing the entries in order to extract the metrics we’re interested in, and sending those metrics to Graphite by the tried-and-true method called ‘just open a socket’.

For example, we tail the nginx access log file via a log tracker script written in Python (and run as a daemon with supervisor), and we extract values such as the timestamp, the request URL, the HTTP return code, bytes sent and the request time in milliseconds. The default interval for collecting these values is 1 minute. For HTTP return codes, we group the codes such as 2xx, 3xx, 4xx, 5xx together, so we can report on each type of return code. We aggregate the values per collection interval, then send the counts to Graphite, named something like appserver.app01.500.reqs, which represents the HTTP 500 error count on server app01.

A more elegant way would be to use a tool such as logster to capture various log entries, but we haven’t had the time to write logster plugins for the 2 main services we’re interested in, nginx and haproxy. Our solution is deemed temporary, but as we all know, there’s nothing more permanent than a temporary solution.

For some more unusual metrics that we measure ourselves, such as LSI MegaRaid battery charge state, we run a shell script in an infinite loop and produce a value every second, then we send it to Graphite. To obtain the value we run something that resembles line noise:

$ MegaCli64 -AdpBbuCmd -GetBbuStatus -a0 | grep -i "Relative State of Charge" | awk '{print $5}'

(thanks to my colleague Marco Garcia for coming up with this)

Once the data is captured in Graphite, we can do several things with it:
  • visualize it using the Graphite dashboards
  • alert on it using custom Nagios plugins
  • capture it in our own more compact dashboards, so we can have multiple graphs displayed on one ‘mission control’ page

Nagios plugins

My colleague Marco Garcia wrote a Nagios plugin in Python to alert on HTTP 500 errors. To obtain the data from Graphite, he queries a special ‘rawData’ URL of this form:

http://graphite.yourdomain.com/render/?from=-5minutes&until=-1minutes&target=asPercent(sum(appserver.app01.500.*.reqs),sum(appserver.app01.*.*.reqs))&rawData

which returns something like

asPercent(sumSeries(appserver.app01.500.*.reqs),sumSeries(appserver.app01.*.*.reqs)),1325778360,1325778600,60|value1,value2,value3,value4

where the 4 values represent the 4 data points, one per minute, from 5 minutes ago to 1 minute ago. Each data point is the percentage of HTTP 500 errors calculated against the total number of HTTP requests.

The script then compares the values returned with a warning threshold (0.5) and a critical threshold (1.0). If all values are greater than the respective threshold, we issue a warning or a critical Nagios alert.

Mission control dashboard

My colleague Josh Frederick came up with the idea of presenting all relevant graphs based on data captured in Graphite in a single page which he dubbed ‘mission control’. This enables us to make correlations at a glance between things such as increased I/O wait on DB servers and spikes in HTTP 500 errors.

To generate a given graph, Josh uses some Javascript magic to come up with an URL such as:

http://graphite.yourdomain.com/render/?width=500&height=200&hideLegend=1&from=-60minutes&until=-0minutes&bgcolor=FFFFFF&fgcolor=000000&areaMode=stacked&title=500s%20as%20%&target=asPercent(sum(appserver.app*.500.*.reqs),%20sum(appserver.app*.*.*.reqs))&_ts=1325783046438

which queries Graphite for the percentage of HTTP 500 errors from all HTTP requests across all app servers for the last 60 minutes. The resulting graph looks like this:



Our mission control page currently has 29 such graphs.

We also have (courtesy of Josh again) a different set of charts based on the Google Visualization Python API. We get the data from Graphite in CSV format (by adding format=csv to the Graphite query URLs), then we display the data using the Google Visualization API.

If you don’t want to roll your own JS-based dashboard talking to Graphite, there’s a tool called statsdash that may be of help.

4 comments:

Mark Seger said...

I've only recently heard about graphite and it does look pretty cool. In fact I wrote an interface to allow collectl to send performance stats to it. Rather than repeat myself, have a look here - https://answers.launchpad.net/graphite/+question/183898 and let me know if you'd like to try out a pre-release.

Bryan said...

Nice post. I appreciate this stuff and learn a lot from it. We're getting some stuff in place right now using Cube (http://square.github.com/cube/) and Ostrich (https://github.com/twitter/ostrich). At the moment, we're not dealing with nuts and bolts type stuff like "LSI MegaRAID controller batteries", rather application specific stuff. IE... counts/failure rates/durations for executions of particular strategically important bits of code (calls to backend subsystems we depend on that are out of our control, DB calls, etc.) I'd describe it as more developer-centric, but definitely overlaps into that Dev/Ops grey area.

Anyway, good reading, thanks for the post.

Anonymous said...

Thanks for the post. I've used a similar approach, but using monit instead of nagios for firing alerts. I also 'borrowed' your graphite query for alerting on 500's on my nginx. Great idea.

If you're interested, I posted about graphite alerting with monit on http://blog.gingerlime.com/2013/graphite-alerts-with-monit/

trobrock said...

Great post, don't personally use nagios, but would be curious how you think it compares to something like http://fisherapp.com <- New tool I just built, hopefully this will make monitoring graphite possible for those who aren't using monit or nagios as well.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...