Thursday, December 17, 2009

Deploying Tornado in production

We've been using Tornado at Evite very successfully for a while, for a subset of functionality on our web site. While the instructions in the official documentation make it easy to get started with Tornado and get an application up and running, they don't go a long way towards explaining issues that you face when deploying to a production environment -- things such as running the Tornado processes as daemons, logging, automated deployments. Here are some ways we found to solve these issues.

Running Tornado processes as Unix daemons

When we started experimenting with Tornado, we ran the Tornado processes via nohup. It did the job, but it was neither solid nor elegant. I ended up using the grizzled Python library, specifically the os module, which encapsulates best practices in Unix system programming. This module offers a function called daemonize, which converts the calling process into a Unix daemon. The function's docstring says it all:

Convert the calling process into a daemon. To make the current Python
process into a daemon process, you need two lines of code:

from grizzled.os import daemon
daemon.daemonize()

I combined this with standard library's os.execve, which replaces the current (already daemonized) process with another program -- in my case with the Tornado process.

I'll show some code in a second, but first I also want to mention...

Logging

We use the standard Python logging module and send the log output to stdout/stderr. However, we wanted to also rotate the log files using the rotatelogs utility, so we looked for a way to pipe stdout/stderr to the rotatelogs binary, while also daemonizing the Tornado process.

Here's what I came up with (the stdout/stderr redirection was inspired by this note on StackOverflow):


from socket import gethostname
from grizzled.os import daemonize
PYTHON_BINARY = "python2.6"
PATH_TO_PYTHON_BINARY = "/usr/bin/%s" % PYTHON_BINARY
ROTATELOGS_CMD = "/usr/sbin/rotatelogs"
LOGDIR = "/opt/tornado/logs"
LOGDURATION = 86400

logdir = LOGDIR
logger = ROTATELOGS_CMD
hostname = gethostname()
# service is the name of the Python module pointing to your Tornado web server
# for example myapp.web
execve_args = [PYTHON_BINARY, "-m", service]
logfile = "%s_%s_log.%%Y-%%m-%%d" % (service, hostname)
pidfile = "%s/%s.pid" % (logdir, service)
logpipe ="%s %s/%s %d" % (logger, logdir, logfile, LOGDURATION)
execve_path = PATH_TO_PYTHON_BINARY

# open the pipe to ROTATELOGS
so = se = os.popen(logpipe, 'w')

# re-open stdout without buffering
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

# redirect stdout and stderr to the log file opened above
os.dup2(so.fileno(), sys.stdout.fileno())
os.dup2(se.fileno(), sys.stderr.fileno())

# daemonize the calling process and replace it with the Tornado process
daemonize(no_close=True, pidfile=pidfile)
os.execv(execve_path, execve_args)


The net result of all this is that our Tornado processes run as daemons, and the logging output is captured in files managed by the rotatelogs utility. Note that it is easy to switch out rotatelogs and use scribe instead (which is something we'll do very soon).

Automated deployments

I already wrote about our use of Fabric for automated deployments. We use Fabric to deploy Python eggs containing the Tornado code to each production server in turn. Each server runs N Tornado processes, and we run nginx to load balance between all M x N tornados (M servers with N processes each).

Here's how easy it is to run our Fabric-based deployment scripts:

fab -f fab_myapp.py nginx disable_tornado_in_lb:web1
fab -f fab_myapp.py web1 deploy
# run automated tests, then:
fab -f fab_myapp.py nginx enable_tornado_in_lb:web1

We first disable web1 in the nginx configuration file (as detailed here), then we deploy the egg to web1, then we run a battery of tests against web1 to make sure things look good, and finally we re-enable web1 in nginx. Rinse and repeat for all the other production web servers.

Friday, December 11, 2009

NetApp SNMP monitoring with Nagios

Here are some tips regarding the monitoring of NetApp filers with Nagios. First off, the Nagios Exchange includes many NetApp-specific monitoring scripts, all based on SNMP. I ended up using check_netapp3.pl, but I hit some roadblocks when it came to checking disk space on NetApp volumes (the SNMP queries were timing out in that case).

The check_netapp3.pl script works fine for things such as CPU load. For example, I created a new command called check_netapp_cpu in /usr/local/nagios/etc/objects/commands.cfg on my Nagios server:

define command {
command_name check_netapp_cpu
command_line $USER1$/check_netapp3.pl -H $HOSTADDRESS$ -C mycommunity -v CPULOAD -w 50 -c 80
}

However, for things such as percent of disk used for a given NetApp volume, I had to use good old SNMP checks directly against the NetApp. Any time you use SNMP, you need to know which OIDs to hit. In this case, the task is a bit easier because you can look inside the check_netapp3.pl script to see some example of NetApp-specific OIDs. But let's assume you have no clue where to start. Here's a step-by-step procedure:

1) Find the NetApp MIB -- I found one online here.

2) Do an snmpwalk against the top-level OID, which in this case is 1.3.6.1.4.1.789. Save the output in a file.
Example: snmpwalk -v 1 -c mycommunity IP_OF_NETAPP_FILER 1.3.6.1.4.1.789 > myfiler.out

3) Search for a volume name that you know about in myfiler.out. I searched for /vol/vol0 and found this line:
SNMPv2-SMI::enterprises.789.1.5.4.1.2.5 = STRING: "/vol/vol0/"
This will give you a clue as to the OID range that corresponds to volume information. If you search for "1.5.4.1.2" in the NetApp MIB, you'll see that it corresponds to dfTable.dfEntry.dfFileSys. So the entries
1.5.4.1.2.1 through 1.5.4.1.2.N will show the N file systems available on that particular filer.

4) I was interested in percentage of disk used on those volumes, so I found the variable dfPerCentKBytesCapacity in the MIB, corresponding to the OID 1.3.6.1.4.1.789.1.5.4.1.6. This means that for /vol/vol0 (which is the 5th entry in my file system table), I need to use 1.3.6.1.4.1.789.1.5.4.1.6.5 to get the percentage of disk used.

So, to put all this detective work together, it's easy to create specific commands that query a particular filer for the percentage disk used for a particular volume. Here's an example that uses the check_snmp Nagios plugin:

define command {
  command_name check_netapp_percent_diskused_myfiler_vol0
  command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C mycommunity -o .1.3.6.1.4.1.789.1.5.4.1.6.3 -w 75 -c 90
}

Then I defined a service corresponding to that filer, similar to this:

define service{
use active-service
host_name myfiler
check_command check_netapp_percent_diskused_myfiler_vol0
service_description PERCENT DISK USED VOL0
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_interval 1440
notification_period 24x7
notification_options w,c,r
}

Hope this helps somebody out there!

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...