Monday, September 27, 2010

Getting detailed I/O stats with Munin

Ever since Vladimir Vuksan pointed me to his Ganglia script for getting detailed disk stats, I've been looking for something similar for Munin. The iostat and iostat_ios Munin plugins, which are enabled by default when you install Munin, do show disk stats across all devices detected on the system. I wanted more in-depth stats per device though. In my case, the devices I'm interested in are actually Amazon EBS volumes mounted on my database servers.

I finally figured out how to achieve this, using the diskstat_ Munin plugin which gets installed by default when you install munin-node.

If you run

/usr/share/munin/plugins/diskstat_ suggest

you will see the various symlinks you can create for the devices available on your server.

In my case, I have 2 EBS volumes on each of my database servers, mounted as /dev/sdm and /dev/sdn. I created the following symlinks for /dev/sdm (and similar for /dev/sdn):


ln -snf /usr/share/munin/plugins/diskstat_ /etc/munin/plugins/diskstat_latency_sdm
ln -snf /usr/share/munin/plugins/diskstat_ /etc/munin/plugins/diskstat_throughput_sdm
ln -snf /usr/share/munin/plugins/diskstat_ /etc/munin/plugins/diskstat_iops_sdm

Here's what metrics you get from these plugins:

  • from diskstat_iops: Read I/O Ops/sec, Write I/O Ops/sec, Avg. Request Size, Avg. Read Request Size, Avg. Write Request Size
  • from diskstat_latency: Device Utilization, Avg. Device I/O Time, Avg. I/O Wait Time, Avg. Read I/O Wait Time, Avg. Write I/O Wait Time
  • from diskstat_throughput: Read Bytes, Write Bytes
My next step is to follow the advice of Mark Seger (the author of collectl) and graph the output of collectl in real time, so that the stats are displayed in fine-grained intervals of 5-10 seconds instead of the 5-minute averages that RRD-based tools offer.

5 comments:

Anonymous said...

With stats like those, "fine-grained" should be 1 second or less. You can see a lot of interesting information about how operations collate.

I'd suggest looking at that stuff at 250ms intervals - neat stuff can be witnessed on a busy server, often resulting in obvious tuning to be had.

You can do that with Reconnoiter.

Grig Gheorghiu said...

postwait -- thanks! I gave reconnoiter a cursory try a couple of weeks ago but I was discouraged at the lack of tutorial-like documentation. I'll try again at some point though ;-)

Mark Seger said...

not a problem getting sub-second stats with collectl:

collectl -i.250 will give you 1/4 second stats. I've played with intervals in the 0.01 range. Just depends on which stats you're collecting and how heavily you want to beat up your system. Just doing disk stats shouldn't be a problem. And being able to graph them with colplot is an added bonus

Grig Gheorghiu said...

Mark -- speaking of rendering collectl plot files with colplot, I haven't had a lot of success doing that from the cmdline :-( What's a good email to send you an example?

Mark Seger said...

Sorry greg, I just saw your comment about getting colplot to work as I wasn't notified there was a reply to this thread. You can always email me at mark.seger@hp.com or better yet send something to the collectl-utils mailing list of sourceforge OR post something in the forum there.
-mark

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...