Friday, December 11, 2009

NetApp SNMP monitoring with Nagios

Here are some tips regarding the monitoring of NetApp filers with Nagios. First off, the Nagios Exchange includes many NetApp-specific monitoring scripts, all based on SNMP. I ended up using check_netapp3.pl, but I hit some roadblocks when it came to checking disk space on NetApp volumes (the SNMP queries were timing out in that case).

The check_netapp3.pl script works fine for things such as CPU load. For example, I created a new command called check_netapp_cpu in /usr/local/nagios/etc/objects/commands.cfg on my Nagios server:

define command {
command_name check_netapp_cpu
command_line $USER1$/check_netapp3.pl -H $HOSTADDRESS$ -C mycommunity -v CPULOAD -w 50 -c 80
}

However, for things such as percent of disk used for a given NetApp volume, I had to use good old SNMP checks directly against the NetApp. Any time you use SNMP, you need to know which OIDs to hit. In this case, the task is a bit easier because you can look inside the check_netapp3.pl script to see some example of NetApp-specific OIDs. But let's assume you have no clue where to start. Here's a step-by-step procedure:

1) Find the NetApp MIB -- I found one online here.

2) Do an snmpwalk against the top-level OID, which in this case is 1.3.6.1.4.1.789. Save the output in a file.
Example: snmpwalk -v 1 -c mycommunity IP_OF_NETAPP_FILER 1.3.6.1.4.1.789 > myfiler.out

3) Search for a volume name that you know about in myfiler.out. I searched for /vol/vol0 and found this line:
SNMPv2-SMI::enterprises.789.1.5.4.1.2.5 = STRING: "/vol/vol0/"
This will give you a clue as to the OID range that corresponds to volume information. If you search for "1.5.4.1.2" in the NetApp MIB, you'll see that it corresponds to dfTable.dfEntry.dfFileSys. So the entries
1.5.4.1.2.1 through 1.5.4.1.2.N will show the N file systems available on that particular filer.

4) I was interested in percentage of disk used on those volumes, so I found the variable dfPerCentKBytesCapacity in the MIB, corresponding to the OID 1.3.6.1.4.1.789.1.5.4.1.6. This means that for /vol/vol0 (which is the 5th entry in my file system table), I need to use 1.3.6.1.4.1.789.1.5.4.1.6.5 to get the percentage of disk used.

So, to put all this detective work together, it's easy to create specific commands that query a particular filer for the percentage disk used for a particular volume. Here's an example that uses the check_snmp Nagios plugin:

define command {
  command_name check_netapp_percent_diskused_myfiler_vol0
  command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C mycommunity -o .1.3.6.1.4.1.789.1.5.4.1.6.3 -w 75 -c 90
}

Then I defined a service corresponding to that filer, similar to this:

define service{
use active-service
host_name myfiler
check_command check_netapp_percent_diskused_myfiler_vol0
service_description PERCENT DISK USED VOL0
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_interval 1440
notification_period 24x7
notification_options w,c,r
}

Hope this helps somebody out there!

7 comments:

Steve Francis said...

Nice post.
There are quite a few gotchas with monitoring NetApps via snmp:
- NetApp changes the oid's used to report the same thing between minor releases. (e.g. fans failed is .1.3.6.1.4.1.789.1.21.1.2.1.20 post OnTap 7.3.2, but .1.3.6.1.4.1.789.1.21.1.2.1.18 before it.)
- NetApps will renumber the indexes of volume quite regularly, especially if you use snap mirror.
- some stuff (volume latency and operations per second) is simply not available via snmp, only the API. And the units reported by the API also change with releases.
- if trying to show actual disk space per volume collected by snmp, NetApp doesn't use 64 bit counters, but 2 32 bit counters, requiring some careful math.

As alternative to rolling your own, LogicMonitor's NetApp Monitoring can automate it all for you. Of course, it's not free for the software, as an open source system is, but the time saved is usually well worth it.

Nadeem said...

Thanks a lot,

It was very helpful.

Robin said...

Very interisting, thx!

The latest versions of the Network Appliance MIBs and traps.dat files are available online on the NetApp On the Web (NOW) site at

http://now.netapp.com/NOW/download/tools/mib/filer.shtml
(Login needed)

Ingo Lantschner said...

And if you are looking for an alternative to SNMP have a look at the API from NetApp. It can be addressed directly by perl and allows to write nice monitoring-scripts, once you have learned the basics.

A whole suite of such plugins can be found here: www.netapp-monitoring.info/en - although not freeware but quiet easy to implement into Nagios, well tested and actively developed.

Anonymous said...

I wanted to know how did you get the correct OID from your dfTable.dfEntry.dfFileSys? ... If I insert this OID 1.3.6.1.4.1.789.1.5.4.1.6, Nagios tells me "no validated data returned"
please help me.

Amit said...

I am working on supporting NetApp SNMP monitor for our product. Your post really saved my day. Thanks for crafting all the useful info.

Anonymous said...

Thanks a lot!
It just works :)

cheers
Jan

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...