Tuesday, November 10, 2009

NFS troubleshooting with iostat and lsof

Scenario: you mount a volume exported from a NetApp on several Linux clients via NFS

Problem: you see constant high CPU usage on the NetApp, and some of the Linux clients become sluggish, primarily in terms of I/O

Troubleshooting steps:

1) If iostat is not already on the clients, install the sysstat utilities.

2) On each client mounting from the filer, or on a representative sample of the clients, run iostat with -n so that it shows NFS-related statistics. The following
command will run iostat every 5 seconds and show NFS stats in a nicely tabulated output:

# iostat -nh 5

3) Notice which client exhibits the most NFS operations per second, and correlate it with the NFS volume on that client which shows the most NFS reads and/or writes per second.

At this point you found the most likely culprit in terms of sending NFS traffic to the filer (there could be several client machines in this position, for example if they are part of a cluster).

5) If not already installed, download and install lsof.

6) Run lsof on the client(s) discovered in step 4, and grep for the directory representing the mount point of the NFS volume with the most reads and/or writes. For example:
# lsof | grep /var/log

This will show you, among other things, which processes are accessing which files under that directory. Usually something will jump out at you in terms of things that are going on outside of the ordinary. In my case, it was logrotate kicking off from a daily cron and compressing a huge log file -- since the log file was on a volume NFS-mounted from the filer, this caused the filer to do extra work, hence its increased CPU usage.

That's about it. Of courser these steps can be refined/modified/added to -- but even in this simple form, they can help you pinpoint NFS issues fairly quickly.


rick tait said...

What Linux distro are you using here?

Note that RHEL5, CentOS5 do NOT have the -n option to their released version of iostat.

Perhaps you're using ubuntu?

Grig Gheorghiu said...

Rick -- I was using Gentoo and I compiled the sysstat utilities from source.

Anonymous said...

lsof doesn't work for me, e.g. lsof -N outputs nothing. The server is using nfs-kernel-server and lsof gives the error

lsof: WARNING: can't stat() tmpfs file system /dev/shm/var.run
Output information may be incomplete.
lsof: WARNING: can't stat() tmpfs file system /dev/shm/var.lock
Output information may be incomplete.

Might that be relevant?

I've not found any suggestion that nfs-kernel-server prevents lsof from working per se yet...

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...