.comment-link {margin-left:.6em;}

Agile Testing

Thursday, February 04, 2010

Web site monitoring techniques and tools

When monitoring a Web site, you need to look at it both from a 'micro' perspective (i.e. are the individual devices and servers in your infrastructure running smoothly?) and from a 'macro' perspective (i.e. do your customers have a pleasant experience when accessing and using your site?; can they use your site's functionality to conduct their business online?). You can also think about these types of monitoring as 'engineering' monitoring vs. 'business' monitoring. They're both important, but most system administrators obviously focus on the engineering/micro-type monitoring, since it's closer to their skills and interests, and tend to not put much emphasis on the business/macro-type monitoring.

In this post I'll talk about various types of monitoring we're doing for the Evite Web site. I'll focus on what I call 'deep application monitoring', which to me means gathering and displaying metrics about your Web application's behavior as seen by the users.

System infrastructure monitoring

This is the 'micro' or 'engineering' type of monitoring that is the bread and butter of a sysops team. I need to eat crow here and say that there is a great open source monitoring tool out there, and its name is Nagios. That's what we use to monitor the health of all our internal servers and network devices, and to alert us via email and SMS if anything goes wrong. It works great, it's easy to set up and deploy automatically, it has tons of plugins to monitor pretty much everything under the sun. My mistake in the past was expecting it to also be a resource graphing tool, but that was simply asking too much from one single tool.

For resource graphing and visualization we use a combination of cacti, ganglia and munin. They're all good, but going forward I'll use munin because it's easy to set up and automate, and also allows you to group devices together and see the same metric (for example system load) across the devices in the group. This makes it easy to spot trends and outliers.

One non-typical aspect of our use of Nagios is that we run passive checks from the servers being monitored to the main Nagios server using NSCA. This avoids the situation where the central Nagios server becomes a bottleneck because it needs to poll all the devices it monitors either via ssh or via NRPE. In our setup, each plugin deployed on a monitored server pushes its own notifications to the central server using send_nsca. It's also fairly easy to write your own Nagios plugins.

What kinds of things do we monitor and graph? The usual suspects -- system load, CPU, memory, disk, processes -- but also things like number of database connections on each Web/app server, number of Apache and Tomcat/Java processes AND threads, size of the mail queue on mail servers, MySQL queries/second, threads, throughput...and many others.

In general, the more metrics you capture and graph, the easier it is to spot trends and correlations. Because that's what you need to do during normal day-to-day traffic on your Web site -- you need to establish baseline numbers which will tell you right away if something goes wrong, for example when you're being hit with more traffic than usual. Having a baseline makes it easy to spot spikes and flare-ups in the metrics you capture. Noticing correlations between these metrics makes it easier for you to troubleshoot things like site slowness or unresponsiveness.

For example, if you see that the Web servers aren't being pounded with more traffic than usual, yet the number of database connections on each Web server keeps going up, you can quickly look at the database metrics and see that indeed the database is slower than usual. It could be that there's a disk I/O problem, or maybe the database is working hard to reindex a large table...but in any case your monitoring and graphing systems should be your friends in diagnosing the problem.

I compare system infrastructure monitoring with unit testing in a development project. Just like when you develop you write a unit test every time you find a bug, with monitoring you add another metric or resource to be monitored and graphed every time you discover a problem with your Web site. Both system monitoring and unit testing work at the micro/engineering level.

By the way, one of the best ways to learn a new system architecture is to delve into the monitoring system, or to roll one out if there's none. This will make you deeply familiar with every server and device in your infrastructure.

External HTTP monitoring

This type of monitoring typically involves setting up your own monitoring server outside of your site's infrastructure, or using a 3rd party service such as Pingdom, Keynote or Gomez. We've been using Pingdom and have been very happy with it. While it doesn't provide the in-depth analysis that Keynote or Gomez do, it costs much less and still gives you a great window into the performance and uptime of your site, as seen by an external user.

In any such system, you choose certain pages within your Web application and you have the external monitor check for a specific character string within the page, to make sure the expected content is there, as opposed to a cryptic error message.

However, with Pingdom you can go deeper than that. They provide 'custom HTTP checks' which allow you to deploy an internal monitor reachable from the outside via HTTP, which checks whatever resource you need within your internal infrastructure, and sends some well-formatted XML back to Pingdom. This is useful when you have a farm of web/app servers behind a load balancer, and you want to check each one but without exposing an external IP address for each server.  Or you can have processes running on various port numbers on your web/app servers, and you don't want to expose those directly to Pingdom checks. Instead, you need to open port 80 to just one server, the one running the custom checks.

We've built a RESTful Web service based on restish for this purpose. We call it from Pingdom with certain parameters in the query string, and based on those we run various checks such as 'is our Web application app1 running on web servers A, B, C and D functioning properly?'. If the application doesn't respond as expected on server C, we send details about the error in the XML payload back to Pingdom and we get alerted on it.

We also use Pingdom for monitoring cloud-based resources that we interact with, for example Amazon S3 or SQS. For SQS, we have another custom HTTP check which goes against our restish-based Web service, which then connects to SQS, does some operations and reports back the elapsed time. We want to know if that time/latency exceeds certain thresholds. While we could have done that from our internal Nagios monitoring system, it would have required writing a Nagios plugin, while our restish application already knew how to send the properly formatted XML payload back to Pingdom.

This type of monitoring doesn't go very deep into checking the business logic rules of your application. It is however a 'macro'-type monitoring technique that should be part of your monitoring strategy because it mimics the user experience of your site's customers. I compare it to doing functional testing for your application. You don't test things end-to-end, but you choose some specific portion of your functionality (a specific Web page that is being hit by many users for example) and you make sure that your application provides indeed that functionality to its users in a timely manner.

Also, if you use a CDN provider, then they probably offer you a way to see statistics about your HTTP traffic using some sort of Web dashboard. Most providers also give you a breakdown of HTTP codes returned over some period of time. Again trending is your friend here. If you notice that the percentage of non-200 HTTP codes suddenly increases, you know something is not going well somewhere in your infrastructure. Many CDN providers offer a Web service API for you to query and get these stats, so you can integrate that functionality into your monitoring and alerting system.

Business intelligence monitoring

We've come to the 'deep application monitoring' section of this post. Business intelligence (BI) deals with metrics pertaining to the business of your Web site. Think of all the metrics you can get with Google Analytics -- and yes, Google Analytics in itself is a powerful monitoring tool.

BI is very much concerned with trends. Everybody wants to see traffic and number of users to the site going up and up. Your site may have very specific metrics that you want to keep track of. Let's say you want users to register and fill in details about themselves, upload a photo, etc. You want to keep track of how many user profiles are created per day, and you want to keep track of the rate of change from day to day, or even from hour to hour. It would be nice if you could spot a drop in that rate right away, and then identify the root cause of the problem. Well, we do this here at Evite with a combination of tools, the main one being a non-open-source non-free tool called SAM (for Simple Application Monitoring) written by Randy Wigginton.

In a nutshell, SAM uses a modified memcached server to capture URLs and SQL queries as they are processed by your application, and aggregates them by ignoring query strings in URLs and values in SQL statements. It saves their count, average/min/max time of completion and failure count in a database, then displays stats and graphs in a simple Web application that queries that database. The beauty of this approach is that you can see metrics related to specific portions of your application aggregated across all users hitting that functionality. So if a user profile is saved via a POST to a URL with a specific query string and HTTP payload, SAM will aggregate all such requests and will allow you to see exactly how many users have saved their profiles in the last hour, how many failures you had, how long it took, etc. What's more, you can query the database directly and take action based on certain thresholds (for example if the failure rate for a specific action, e.g. a POST to a specific URL, has surpassed a given threshold). We're doing exactly this within our restish-based Web service monitoring tool.

Note that this technique based on aggregation can be implemented using other approaches, for example by mining your Web server logs, or database logs. The nice thing about SAM is that it captures all this information in real time. As far as your application is concerned, it only interacts with memcached using standard memcache client calls, and it doesn't even bother to check the result of the memcache push.

Another advantage of having such a macro/business-type of monitoring and graphing in place is that your business stakeholders (including your CEO) can see these metrics and trends almost in real time, and can alert you (the ops team) if something seems out of whack. This is almost guaranteed to correlate with something going amiss in your infrastructure. Once you figure out what that is, see if your internal system monitoring system caught it or not. If it didn't, it's a great opportunity for you to add it in so as to know about it next time before your CEO does. Trust me, these things can happen, I speak from experience.

End-to-end application monitoring

Companies such as BrowserMob and Sauce Labs allow you to run Selenium scripts against your Web application using clients hosted by them 'in the cloud'. With BrowserMob, you can also use these scripts to monitor a given path through your application. This is obviously the monitoring equivalent of an end-to-end application test. If you have an e-commerce site for example, you can test and monitor your entire check out process this way. I am still scratching the surface when it comes to this type of monitoring, but it's high on my TODO list.

So there you have it. Similar to having a sound application testing strategy, your best bet is to implement all these types of monitoring, or as many of them as your time and budget allow. If you do, you can truly say you have a holistic monitoring strategy in place. And remember, the more dashboards you have into the health of your application and your systems, the better off you are. It's an uncontested rule that the one with the most dashboards wins.

For further reading and more dashboard ideas, see these resources:

Labels:

Thursday, January 28, 2010

Bug in tornado-0.2

If you're using the latest packaged tornado distribution, tornado-0.2.tar.gz, then you might hit on the following bug when using AsyncHTTPClient:

ERROR:root:Exception in callback >
Traceback (most recent call last):
  File "/usr/local/lib/python2.6/dist-packages/tornado/ioloop.py", line 238, in _run_callback
    callback()
  File "/usr/local/lib/python2.6/dist-packages/tornado/httpclient.py", line 214, in _perform
    self.io_loop.remove_handler(fd)
  File "/usr/local/lib/python2.6/dist-packages/tornado/ioloop.py", line 133, in remove_handler
    self._impl.unregister(fd)
IOError: [Errno 2] No such file or directory


For me, this was in the context of using a load testing script written by my colleague Dan Mesh. The script makes use of AsyncHTTPClient and runs several such objects in parallel, hitting a number of URLs. When the number of concurrent HTTP requests reaches around 100, I start seeing this error.

Solution? Install tornado from the latest source code on GitHub.

Friday, January 15, 2010

Twittering

I've been holding back from using Twitter, which I generally still consider a waste of time. However, Bryan Landers convinced me that it's a good idea to tweet occasionally, for example when I post to my blog. Apparently many people (Bryan included) are past using RSS feed readers and other such antiquated tools that used to be popular in the first decade of this millennium. Also, there have been discussions within the SoCal Piggies group about increasing the communication and collaboration within the group, and with the Python community at large.

So...I created 2 twitter accounts, one for the SoCal Piggies (@socalpiggies) and one for myself (@griggheo). I'll generally post links to stuff I come across that might be of interest to people who share my interests.

Friday, January 08, 2010

Ian Bicking does it again with toppcloud

I haven't seen this on Planet Python yet, so I'll give it a shout out in the hope that more people will benefit from it. Via Ben Bangert, I heard yesterday about Ian Bicking's new brainchild, toppcloud -- a tool for automated deployments of cloud instances AND Python web apps. It's based on libcloud, and it currently only supports Rackspace's cloud, but it seems very promising already. The Python deployment seems to be inspired by Google AppEngine's deployment philosophy. I haven't played with it yet, but I'll do so soon and report back here. I've been using Fabric (and Puppet) successfully for automated application deployments, but toppcloud seems to be a very valuable tool to add to my arsenal.

Thursday, December 17, 2009

Deploying Tornado in production

We've been using Tornado at Evite very successfully for a while, for a subset of functionality on our web site. While the instructions in the official documentation make it easy to get started with Tornado and get an application up and running, they don't go a long way towards explaining issues that you face when deploying to a production environment -- things such as running the Tornado processes as daemons, logging, automated deployments. Here are some ways we found to solve these issues.

Running Tornado processes as Unix daemons

When we started experimenting with Tornado, we ran the Tornado processes via nohup. It did the job, but it was neither solid nor elegant. I ended up using the grizzled Python library, specifically the os module, which encapsulates best practices in Unix system programming. This module offers a function called daemonize, which converts the calling process into a Unix daemon. The function's docstring says it all:

Convert the calling process into a daemon. To make the current Python
process into a daemon process, you need two lines of code:

from grizzled.os import daemon
daemon.daemonize()

I combined this with standard library's os.execve, which replaces the current (already daemonized) process with another program -- in my case with the Tornado process.

I'll show some code in a second, but first I also want to mention...

Logging

We use the standard Python logging module and send the log output to stdout/stderr. However, we wanted to also rotate the log files using the rotatelogs utility, so we looked for a way to pipe stdout/stderr to the rotatelogs binary, while also daemonizing the Tornado process.

Here's what I came up with (the stdout/stderr redirection was inspired by this note on StackOverflow):


from socket import gethostname
from grizzled.os import daemonize
PYTHON_BINARY = "python2.6"
PATH_TO_PYTHON_BINARY = "/usr/bin/%s" % PYTHON_BINARY
ROTATELOGS_CMD = "/usr/sbin/rotatelogs"
LOGDIR = "/opt/tornado/logs"
LOGDURATION = 86400

logdir = LOGDIR
logger = ROTATELOGS_CMD
hostname = gethostname()
# service is the name of the Python module pointing to your Tornado web server
# for example myapp.web
execve_args = [PYTHON_BINARY, "-m", service]
logfile = "%s_%s_log.%%Y-%%m-%%d" % (service, hostname)
pidfile = "%s/%s.pid" % (logdir, service)
logpipe ="%s %s/%s %d" % (logger, logdir, logfile, LOGDURATION)
execve_path = PATH_TO_PYTHON_BINARY

# open the pipe to ROTATELOGS
so = se = os.popen(logpipe, 'w')

# re-open stdout without buffering
sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)

# redirect stdout and stderr to the log file opened above
os.dup2(so.fileno(), sys.stdout.fileno())
os.dup2(se.fileno(), sys.stderr.fileno())

# daemonize the calling process and replace it with the Tornado process
daemonize(no_close=True, pidfile=pidfile)
os.execv(execve_path, execve_args)


The net result of all this is that our Tornado processes run as daemons, and the logging output is captured in files managed by the rotatelogs utility. Note that it is easy to switch out rotatelogs and use scribe instead (which is something we'll do very soon).

Automated deployments

I already wrote about our use of Fabric for automated deployments. We use Fabric to deploy Python eggs containing the Tornado code to each production server in turn. Each server runs N Tornado processes, and we run nginx to load balance between all M x N tornados (M servers with N processes each).

Here's how easy it is to run our Fabric-based deployment scripts:

fab -f fab_myapp.py nginx disable_tornado_in_lb:web1
fab -f fab_myapp.py web1 deploy
# run automated tests, then:
fab -f fab_myapp.py nginx enable_tornado_in_lb:web1

We first disable web1 in the nginx configuration file (as detailed here), then we deploy the egg to web1, then we run a battery of tests against web1 to make sure things look good, and finally we re-enable web1 in nginx. Rinse and repeat for all the other production web servers.

Friday, December 11, 2009

NetApp SNMP monitoring with Nagios

Here are some tips regarding the monitoring of NetApp filers with Nagios. First off, the Nagios Exchange includes many NetApp-specific monitoring scripts, all based on SNMP. I ended up using check_netapp3.pl, but I hit some roadblocks when it came to checking disk space on NetApp volumes (the SNMP queries were timing out in that case).

The check_netapp3.pl script works fine for things such as CPU load. For example, I created a new command called check_netapp_cpu in /usr/local/nagios/etc/objects/commands.cfg on my Nagios server:

define command {
command_name check_netapp_cpu
command_line $USER1$/check_netapp3.pl -H $HOSTADDRESS$ -C mycommunity -v CPULOAD -w 50 -c 80
}

However, for things such as percent of disk used for a given NetApp volume, I had to use good old SNMP checks directly against the NetApp. Any time you use SNMP, you need to know which OIDs to hit. In this case, the task is a bit easier because you can look inside the check_netapp3.pl script to see some example of NetApp-specific OIDs. But let's assume you have no clue where to start. Here's a step-by-step procedure:

1) Find the NetApp MIB -- I found one online here.

2) Do an snmpwalk against the top-level OID, which in this case is 1.3.6.1.4.1.789. Save the output in a file.
Example: snmpwalk -v 1 -c mycommunity IP_OF_NETAPP_FILER 1.3.6.1.4.1.789 > myfiler.out

3) Search for a volume name that you know about in myfiler.out. I searched for /vol/vol0 and found this line:
SNMPv2-SMI::enterprises.789.1.5.4.1.2.5 = STRING: "/vol/vol0/"
This will give you a clue as to the OID range that corresponds to volume information. If you search for "1.5.4.1.2" in the NetApp MIB, you'll see that it corresponds to dfTable.dfEntry.dfFileSys. So the entries
1.5.4.1.2.1 through 1.5.4.1.2.N will show the N file systems available on that particular filer.

4) I was interested in percentage of disk used on those volumes, so I found the variable dfPerCentKBytesCapacity in the MIB, corresponding to the OID 1.3.6.1.4.1.789.1.5.4.1.6. This means that for /vol/vol0 (which is the 5th entry in my file system table), I need to use 1.3.6.1.4.1.789.1.5.4.1.6.5 to get the percentage of disk used.

So, to put all this detective work together, it's easy to create specific commands that query a particular filer for the percentage disk used for a particular volume. Here's an example that uses the check_snmp Nagios plugin:

define command {
  command_name check_netapp_percent_diskused_myfiler_vol0
  command_line $USER1$/check_snmp -H $HOSTADDRESS$ -C mycommunity -o .1.3.6.1.4.1.789.1.5.4.1.6.3 -w 75 -c 90
}

Then I defined a service corresponding to that filer, similar to this:

define service{
use active-service
host_name myfiler
check_command check_netapp_percent_diskused_myfiler_vol0
service_description PERCENT DISK USED VOL0
is_volatile 0
check_period 24x7
max_check_attempts 3
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_interval 1440
notification_period 24x7
notification_options w,c,r
}

Hope this helps somebody out there!

Monday, November 23, 2009

Compiling Python 2.6 with sqlite3 support

Quick note to self, hopefully useful to others too:

If you compile Python 2.6 (or 2.5) from source, and you want to enable sqlite3 support (which is included in the stdlib for 2.5 and above), then you need to pass a special USE flag to the configuration command line, like this:

./configure USE="sqlite"

(note "sqlite" and not "sqlite3")

Thursday, November 19, 2009

5 years of blogging

Today marks the 5th anniversary of my blog. It's been a fun and rewarding experience, and I hope to never run out of interesting topics to post about ;-)

As a sort of retrospective, I was curious to see which of my blog posts have been getting the most traffic. Here's the top 10 over the last 9 months, according to Google Analytics:

1. Performance vs. load vs. stress testing (as an aside, I think this has been wildly popular because I inadvertently hit on a lot of keywords in the title)
2. Experiences deploying a large-scale infrastructure in Amazon EC2
3. Ajax testing with Selenium using waitForCondition
4. Useful tools for writing Selenium tests
5. Load balancing in EC2 with HAProxy
6. Python unit testing part 1: the unittest module
7. HTTP performance testing with httperf, autobench and openload
8. Running a Python script as a Windows service
9. Apache virtual hosting with Tomcat and mod_jk
10. Configuring Apache 2 and Tomcat 5.5 with mod_jk

It's interesting that 2 of the top 5 posts are Selenium-related. I think Selenium documentation is not where it needs to be generally speaking, hence people find my old posts on this topic. Adam, you really need to write a Selenium RC book!

Tuesday, November 17, 2009

Monitoring multiple MySQL instances with Munin

I've been using Munin for its resource graphing capabilities. I especially like the fact that you can group servers together and watch a common metric (let's say system load) across all servers in a group -- something that is hard to achieve with other similar tools such as Cacti and Ganglia.

I did have the need to monitor multiple MySQL instances running on the same server. I am using mysql-sandbox to launch and manage these instances. I haven't found any pointers on how to use Munin to monitor several MySQL instances, so I rolled my own solution.

First of all, here is my scenario:
  • server running Ubuntu 9.04 64-bit
  • N+1 MySQL instances installed as sandboxes rooted in /mysql/m0, /mysql/m1,..., /mysql/mN
  • munin-node package version 1.2.6-8ubuntu3 (installed via 'apt-get install munin-node')
Step 1

Locate mysql_* plugins already installed by the munin-node package in /usr/share/munin/plugins. I have 5 such plugins: mysql_bytes, mysql_isam_space_, mysql_queries, mysql_slowqueries and mysql_threads. I don't use ISAM, so I am ignoring mysql_isam_space_.


Step 2

Make a copy of each plugin for each MySQL instance you want to monitor. I know this contradicts the DRY principle, but I just wanted something quick that worked. The alternative is to modify the plugins and add extra parameters so they refer to specific MySQL instances.

For example, I made N + 1 copies of mysql_bytes and called them mysql_m0_bytes, mysql_m1_bytes,..., mysql_mN_bytes. In each copy, I modified the line "echo 'graph_title MySQL throughput'" to say "echo 'graph_title MySQL throughput for mN'". I did the same for mysql_threads, mysql_queries and mysql_slowqueries. So at the end of this step I have 4 x (N+1) new plugins in /usr/share/munin/plugins.


As I said, the alternative is to modify for example mysql_bytes and add new parameters, e.g. a parameter for the title of the graph. However, I don't know exactly how the plugin is called from within Munin, and I don't want to fiddle with the number and order of parameters it's called with -- which is why I chose the easy way out.

Step 3

Create symlinks in /etc/munin/plugins to the newly created plugins. Example:

ln -s /usr/share/munin/plugins/mysql_m0_bytes /etc/munin/plugins/mysql_m0_bytes

(and similar for all the other plugins).


Step 4

Specify the path to msyqladmin for the newly defined plugins. You do this by editing the plugin configuration file /etc/munin/plugin-conf.d/munin-node.

Here's what I have in this file related to MySQL:

[mysql_m0*]
user m0
env.mysqladmin /mysql/m0/my sqladmin

[mysql_m1*]
user m1
env.mysqladmin /mysql/m1/my sqladmin

[mysql_m2*]
user m2
env.mysqladmin /mysql/m2/my sqladmin

[mysql_m3*]
user m3
env.mysqladmin /mysql/m3/my sqladmin

What the above lines say is that for each class of plugins starting with mysql_mN, I want to use the mysqladmin utility for that particular MySQL instance. The way mysql-sandbox works, mysqladmin is actually available per instance as "/mysql/mN/my sqladmin".

Note that the naming convention is important. The syntax of the munin-node plugin configuration file says that the plugin name "May include one wildcard ('*') at the start or end of the plugin-name, but not both, and not in the middle." Trust me, I haven't read this fine print initially, and I named my new plugins something like mysql_bytes_mN, then tried to configure the plugins as mysql_*mN. Pulling hair time ensued.

Step 5

Restart munin-node via 'service munin-node restart'. At this point you're supposed to see the new graphs under the Mysql link corresponding to the munin node where you set all this up. You should see N+1 graphs for each type of plugin (mysql_bytes, mysql_threads, mysql_queries and mysql_slowqueries). The graphs can be easily differentiated by their titles, e.g. 'MySQL throughput for m0' or 'MySQL queries for m1', etc.

One other quick tip: if you want to easily group nodes together, come up with some domain name which doesn't need to correspond to a real DNS domain name. For example, I called my MySQL servers something like mysqlN.myproject.mydomain.com in /etc/munin/munin.conf on the Munin server side. This allows me to see the myproject.mydomain.com group at a glance, with all the metrics for the nodes in that group shown side by side.

Here's how I defined each node in munin.conf:

[mysqlN.myproject.mydomain.com]
 address 192.168.0.N
 use_node_name yes

(where N is 1, 2, etc)

Behaviour Driven Infrastructure

I just read a post by Matthew Flanagan on Behaviour Driven Infrastructure or BDI, a concept that apparently originates with Martin Englund's post on this topic. The idea is that you describe what you need your system to do in natural language, using for example a tool such as Cucumber. What's more, you can then use the cucumber-nagios plugin to express the desired behaviour of the new system as a series of Nagios checks. The checks will initiall fail (just like in a TDD or BDD development cycle), but you will make them pass by deploying the appropriate packages and applications to the system.

I also expressed the need for automated testing of production deployments in one of my blog posts. However, BDI goes one step further, by describing a test plan for production deployments in natural language. Pretty cool, and again I can only wish that the Python testing tools kept up with Ruby-based tools such as Cucumber and friends....

Friday, November 13, 2009

Great series of posts on Tokyo Tyrant

Matt Yonkovit has started a series of posts on Tokyo Tyrant at Percona's MySQL Performance Blog. Great in-depth analysis of the reliability and performance of TT.

Part 1: Tokyo Tyrant -- is it durable?
Part 2: Tokyo Tyrant -- the performance wall
Part 3: Tokyo Tyrant -- write bottleneck

(parts 4 and 5, about replication and scaling, are hopefully coming soon)

Tuesday, November 10, 2009

Google using buildbot for Chromium continuous integration

Via Ben Bangert, this gem of a page showing the continuous integration status for the Chromium project at Google. It's cool to see that they're using buildbot. But just like Ben says -- I wish they open sourced the look and feel of that buildbot status page ;-)

NFS troubleshooting with iostat and lsof

Scenario: you mount a volume exported from a NetApp on several Linux clients via NFS

Problem: you see constant high CPU usage on the NetApp, and some of the Linux clients become sluggish, primarily in terms of I/O

Troubleshooting steps:

1) If iostat is not already on the clients, install the sysstat utilities.

2) On each client mounting from the filer, or on a representative sample of the clients, run iostat with -n so that it shows NFS-related statistics. The following
command will run iostat every 5 seconds and show NFS stats in a nicely tabulated output:

# iostat -nh 5

3) Notice which client exhibits the most NFS operations per second, and correlate it with the NFS volume on that client which shows the most NFS reads and/or writes per second.

At this point you found the most likely culprit in terms of sending NFS traffic to the filer (there could be several client machines in this position, for example if they are part of a cluster).

5) If not already installed, download and install lsof.

6) Run lsof on the client(s) discovered in step 4, and grep for the directory representing the mount point of the NFS volume with the most reads and/or writes. For example:
 
# lsof | grep /var/log

This will show you, among other things, which processes are accessing which files under that directory. Usually something will jump out at you in terms of things that are going on outside of the ordinary. In my case, it was logrotate kicking off from a daily cron and compressing a huge log file -- since the log file was on a volume NFS-mounted from the filer, this caused the filer to do extra work, hence its increased CPU usage.

That's about it. Of courser these steps can be refined/modified/added to -- but even in this simple form, they can help you pinpoint NFS issues fairly quickly.

Thursday, November 05, 2009

Automated deployments with Puppet and Fabric

I've been looking into various configuration management/automated deployment tools lately. At OpenX we used slack, but I wanted something with a bit more functionality than that (although I'm not badmouthing slack by any means -- it can definitely be bent to your will to do pretty much whatever you need in terms of automating your deployments).

From what I see, there are 2 types of configuration management tools:
  1. The first type I call 'pull', which means that the servers pull their configurations and their marching orders in terms of applying those configurations from a centralized location -- both slack and Puppet are in this category. I think this is great for initial configuration of a server. As I described in another post, you can have a server bootstrap itself by installing Puppet (or slack) and then 'call home' to the central Puppet master (or slack repository) and get all the information it needs to configure itself
  2. The second type I call 'push', which means that you send configurations and commands to a list of servers from a centralized location -- Fabric is in this category. I think this is a more appropriate mode for application-specific deployments, where you might want to deploy first to a subset of servers, then push it to all servers.
So, as a rule of thumb, I think it makes sense to use a tool like Puppet for the initial configuration of the OS and of the packages required by your application (things like MySQL, Apache, Tomcat, Tornado, Nginx, or whatever your application relies on). When it comes time to deploy your application, I think a tool like Fabric is more appropriate, since it gives you more immediate and finer-grained control over what you want to do.

I also like the categorization of these tools done by the people at ControlTier. Check out their blog post on Achieving Fully Automated Provisioning (which also links to a white paper PDF) for a nice diagram of hierarchy of deployment tools:
  • at the bottom you have tools that install or launch the initial OS on physical servers (via Kickstart/Jumpstart/Cobbler) or on virtual machines/cloud instances (via various vendor tools, or by rolling your own)
  • in the middle you have what they call 'system configuration' tools, such as Puppet/Chef/SmartFrog/cfengine/bcfg2
  • at the top you have what they call 'application service deployment' tools, such as Fabric/Capistrano/Func -- and of course their own ControlTier tool
In a comment on one of my posts,  Damon Edwards from ControlTier calls Fabric a "command dispatching tool", as opposed to Puppet, which he calls a "configuration management tool". I think this relates to the 2 types of tools I described above, where you 'push' or 'dispatch' commands with Fabric, and you 'pull' configurations and actions with Puppet.

Before I go on, let me just say that in my evaluation of different deployment tools, I quickly eliminated the ones that use XML as their configuration language. In my experience, many tools that aim to be language-neutral end up using XML as their configuration language, and then they try to bend XML into a 'real' programming language, thus ending up reinventing the wheel badly. I'd rather use a language I like (Python in my case) as the glue around the various tools in my toolchain. Your mileage may vary of course.

OK, enough theory, let's see some practical examples of Puppet and Fabric in action. While Fabric is very easy to install and has a minimal learning curve, I can't say the same about Puppet. It takes a while to get your brain wrapped around it, and there isn't a lot of great documentation online, so for this reason I warmly recommend that you go buy the book.

Puppet examples

The way I organize things in Puppet is by creating a module for each major package I need to configure. On my puppetmaster server, under /etc/puppet/modules, I have directories such as apache2, mysqlserver, nginx, scribe, tomcat, tornado. Under each such directory I have 2 directories, one called files and one called manifests. I keep files and directories that I need downloaded to the puppet clients under files, and I create manifests (series of actions to be taken on the puppet clients) under manifests. I usually have a single manifest file called init.pp.

Here's an example of the init.pp manifest file for my tornado module:

class tornado {
 $tornado = "tornado-0.2"
 $url = "http://mydomain.com/download"

 $tornado_root_dir = "/opt/tornado"
 $tornado_log_dir = "/opt/tornado/logs"
 $tornado_src_dir = "/opt/tornado/$tornado"

 Exec {
  logoutput => on_failure,
  path => ["/bin", "/sbin", "/usr/bin", "/usr/sbin", "/usr/local/bin",  "/usr/local/sbin"]
 }

 file { 
  "$tornado_root_dir":
  ensure => directory,
  recurse => true,
  source =>  "puppet:///tornado/bin";
 }

 file { 
  "$tornado_log_dir":
  ensure => directory,
 }

 package {
  ["curl", "libcurl3", "libcurl3-gnutls", "python-setuptools", "python-pycurl", "python-simplejson", "python-memcache", "python-mysqldb", "python-imaging"]:
  ensure => installed;
 }

 define install_pkg ($pkgname, $extra_easy_install_args = "", $module_to_test_import) {
  exec {
   "InstallPkg_$pkgname":
   command => "easy_install-2.6 $extra_easy_install_args $pkgname",
   unless => "python2.6 -c 'import $module_to_test_import'",
   require => Package["python-setuptools"];
  }
 }

 install_pkg {
  "virtualenv":
  pkgname => "virtualenv",
  module_to_test_import => "virtualenv";

  "boto":
  pkgname => "boto",
  module_to_test_import => "boto";

  "grizzled":
  pkgname => "grizzled",
  module_to_test_import => "grizzled.os";
 }

 $oracle_root_dir = "/opt/oracle"
 
 case $architecture {
  i386, i686: { 
   $oracle_instant_client_pkg = "instantclient_11_2-linux-i386"
   $oracle_instant_client_dir = "instantclient_11_2"
  }
  x86_64: { 
   $oracle_instant_client_pkg = "instantclient_11_1-linux-x86_64"
   $oracle_instant_client_dir = "instantclient_11_1"
  }
 }

 package {
  ["libaio-dev", "gcc"]:
  ensure => installed;
 }

 file { 
  "$oracle_root_dir":
  ensure => directory;
 }

 exec {
  "InstallOracleInstantclient":
  command => "(cd $oracle_root_dir; wget $url/$oracle_instant_client_pkg.tar.gz; tar xvfz $oracle_instant_client_pkg.tar.gz; rm $oracle_instant_client_pkg.tar.gz; 
cd $oracle_instant_client_dir; ln -s libclntsh.so.11.1 libclntsh.so); echo $oracle_root_dir/$oracle_instant_client_dir > /etc/ld.so.conf.d/oracleinstantclient.conf; ldconfig",
  creates => "$oracle_root_dir/$oracle_instant_client_dir",
  require => File[$oracle_root_dir];
 }

 $cx_oracle = "cx_Oracle-5.0.2"
 exec {
  "InstallCxOracle":
  command => "(cd $oracle_root_dir; wget $url/$cx_oracle.tar.gz; tar xvfz $cx_oracle.tar.gz; rm $cx_oracle.tar.gz; cd $oracle_root_dir/$cx_oracle; export ORACLE_HO
ME=$oracle_root_dir/$oracle_instant_client_dir; python2.6 setup.py install)",
  unless => "python2.6 -c 'import cx_Oracle'",
  require => [Package["libaio-dev"], Package["gcc"], Exec["InstallOracleInstantclient"]];
 }

 exec {
  "InstallTornado":
  command => "(cd $tornado_root_dir; wget $url/$tornado.tar.gz; tar xvfz $tornado.tar.gz; rm $tornado.tar.gz; cd $tornado; python2.6 setup.py install)",
  creates => $tornado_src_dir,
  unless => "python2.6 -c 'import tornado.web'",
  require => [File[$tornado_root_dir], Package["python-pycurl"], Package["python-simplejson"], Package["python-memcache"], Package["python-mysqldb"]];
 }
}

I'll go through this file from the top down. At the very top I declare some variables that are referenced throughout the file. In particular, $url points to the location where I keep large files that I need every puppet client to download. I could have kept the files inside the tornado module's files directory, and they would have been served by the puppetmaster process, but I prefered to use Apache for better performance and scalability. Note that I do this only for relatively large files such as tar.gz archives.

The Exec stanza (note upper case E) defines certain parameters that will be common to all 'exec' actions that follow. In my case, I specify that I only want to log failures, and I also specify the path for the binaries called in the various 'exec' actions -- this is so I don't have to specify that path each and every time I call 'exec' (alternatively, you can specify the full path to each binary that you call).

The next 2 stanzas define files and directories that I want created on the puppet client nodes. Both 'exec' and 'file' are what is called 'types' in Puppet lingo. I first specify that I wanted the directory /opt/tornado created on each node, and by setting 'recurse=>true' I'm saying that the contents of that directory should be taken from a source which in my case is "puppet:///tornado/bin". This translates to a directory called bin which I created under /etc/puppet/modules/tornado/files. The contents of that directory will be copied over via the puppet internal communication protocol to the destination /opt/tornado by each Puppet client node.

The 'package' type that follows specifies the list of packages I want installed on the client nodes. Note that I don't need to specify how I want those packages installed, only what I want installed. Puppet's language is mostly declarative -- you tell Puppet what you want done, and it does it for you, using OS-specific commands that can vary from one client node to another. It so happens in my case that I know my client nodes all run Ubuntu, so I did specify Ubuntu/Debian-specific package names.

Next in my manifest file is a function definition. You can have these definitions inline, or in a separate manifest file. In my case, I declare a function called 'install_pkg' which takes 3 arguments: the package name, any extra arguments to be passed to the installer, and a module name to test the installation with. The function runs the easy_install command via the 'exec' type, but only if the specified module wasn't already installed on the system.

A paranthesis: the Puppet docs don't recommend the overuse of the 'exec' type, because it strays away from the declarative nature of the Puppet language. With exec, you specifically tell the remote node how to run a specific command, not merely what to do. I find myself using exec very heavily though. I means that I don't grokk Puppet fully yet, but it also means that Puppet doesn't have enough native types yet that can hide OS-specific commands.

One important thing to keep in mind is that for every exec action that you write, you need to specify a condition which becomes true after the successful completion of the action. Otherwise exec will be called each and every time the manifest will be inspected by the puppet nodes. Examples of such conditions:
  • 'creates' -- specifies a file or directory that gets created by the exec action; if the file or directory is already there, exec won't be called
  • 'unless' -- specifies a condition that, if true, results in exec not being called. In my case, this condition is the import of a given Python module, but it can be any shell command that returns 0
Another thing to note in the exec action is the 'require' parameter. You'll find yourself using 'require' over and over again. It is a critical component of Puppet manifests, and it is so important because it allows you to order the actions in the manifest. Without it, actions would be executed in random order, which is most likely something you don't want. In my function definition, I require the existence of the package python-setuptools, and I do it because I need the easy_install command to be present on the remote node.

After defining the function 'install_pkg', I call it 3 times, with various parameters, thus installing 3 Python packages -- virtualenv, boto and grizzled. Note that the syntax for calling a function is funky; it's one of the many things I don't necessarily like about Puppet, but it's an evil you learn to deal with.

Next up in my manifest file is a case statement based on the $architecture variable. Puppet makes several such variables available to your manifests, based on facts gathered from the remote nodes via Facter (which comes with Puppet).

Moving along, we have a package definition, a file definition -- both should be familiar by now -- followed by 3 exec actions:
  • InstallOracleInstantclient performs the download and unpacking of this package, followed by some ldconfig incantations to actually make it work
  • InstallCxOracle downloads and installs the cx_Oracle Python package (not a trivial feat at all in and of itself); note that for this action, the require parameter contains Package["libaio-dev"], Package["gcc"], Exec["InstallOracleInstantclient"] -- so we're saying that these 2 packages, and the Instantclient Oracle libraries need to be installed before attempting to even install cx_Oracle
  • InstallTornado -- pretty self-explanatory, with the observation that the require parameter again points to a directory and several packages that need to be on the remote node before the installation of Tornado is attempted
Whew. Nobody said Puppet is easy. But let me tell you, when you get everything working smoothly (after much pulling of hair), it's a great feeling to let a node 'phone home' to the puppetmaster server and configure itself unattended in a matter of minutes. It's worth the effort and the pain.

One more thing here: once you have a module with manifests and files defined properly, you need to define the set of nodes that this module will apply to. The way I do it is to have the following files on the puppet master, in /etc/puppet/manifests:

1) A file called modules.pp which imports the modules I have defined, for example:
import "common" 
import "tornado"
('common' can be a module where you specify actions that are common across all types of nodes)

2) A file called nodetemplates.pp which contains definitions for 'node templates', i.e. classes of nodes that have the same composition in terms of modules they import and actions they perform. For example:
node basenode {
    include common
}

node default inherits basenode {
}

node webserver inherits basenode {
    include scribe
    include apache2
    $required_apache2_modules = ["rewrite", "proxy", "proxy_http", "proxy_balancer", "deflate", "headers", "expires"]
    apache2::module {
        $required_apache2_modules:
        ensure => 'present',
    }
    include tomcat
    include tornado
}

Here I defined 3 types of nodes: basenode (which includes the 'common' module), default (which applies to any machine not associated with a specific node definition) and webserver (which includes modules such as apache2, tomcat, tornado, and also requires that certain apache modules be enabled).

3) A file called nodes.pp which maps actual machine names of the Puppet clients to node template definitions. For example:
node "web1.mydomain.com" inherits webserver {}
4) A file called site.pp which ties together all these other files. It contains:
import "modules"
import "nodetemplates"
import "nodes" 

Much more documentation on node definition and node inheritance can be found on the Puppet wiki, especially in the Language Tutorial.

Fabric examples

In comparison with Puppet, Fabric is a breeze. I wanted to live on the cutting edge, so I installed the latest version (alpha, pre-1.0) from github via:

git clone git://github.com/bitprophet/fabric.git

I also easy_install'ed paramiko, which at this time brings down paramiko-1.7.6 (the Fabric documentation warns against using 1.7.5, but I assume 1.7.6 is OK).

Then I proceeded to create a so-called 'fabfile', which is a Python module containing fabric-specific functions. Here is a fragment of a file I called fab_nginx.py:

from __future__ import with_statement
import os
from fabric.api import *
from fabric.contrib.files import comment, sed

# Globals

env.user = 'myuser'
env.password = 'mypass'
env.nginx_conf_dir = '/usr/local/nginx/conf'
env.nginx_conf_file = '%(nginx_conf_dir)s/nginx.conf' % env

# Environments


def prod():
    """Nginx production environment."""
    env.hosts = ['nginx1', 'nginx2']

def test():
    """Nginx test environment."""
    env.hosts = ['nginx3']

# Tasks

def disable_server_in_lb(hostname):
    require('hosts', provided_by=[nginx,nginxtest])
    comment(env.nginx_conf_file, "server %s" % hostname, use_sudo=True)
    restart_nginx()

def enable_server_in_lb(hostname):
    require('hosts', provided_by=[nginx,nginxtest])
    sed(env.nginx_conf_file, "#server %s" % hostname, "server %s" % hostname, use_sudo=True)
    restart_nginx()

def restart_nginx():
    require('hosts', provided_by=[nginx,nginxtest])
    sudo('/etc/init.d/nginx restart')
    is_nginx_running()

def is_nginx_running(warn_only=False):
    with settings(warn_only=warn_only):
        output = run('ps -def|grep nginx|grep -v grep')
        if warn_only:
            print 'output:', output
            print 'failed:', output.failed
            print 'return_code:', output.return_code

Note that in its 0.9 and later versions, Fabric uses the 'env' environment dictionary for configuration purposes (it used to be called 'config' pre-0.9).

My file starts by defining or assigning global env configuration variables, for example env.user and env.password (which are special pre-defined variables that I assign to, and which are used by Fabric when connecting to remote hosts via the ssh functionality provided by paramiko). I also define my own variables, for example env.nginx_conf_dir and env.nginx_conf_file. This makes it easy to pass the env dictionary as a whole when I need to format a string. Here's an example from another fab file:

cmd = 'mv -f %(crt_egg)s %(backup_dir)s' % env

I then have 2 function definitions in my fab file: one called prod, which sets env.hosts to a list of production nginx servers, and one called test, which does the same but sets env.hosts to test nginx servers.

Next I have the actions or tasks that I want performed on the remote hosts. Note the require function (similar in a way to the parameter used in Puppet manifests), which says that the function will only be executed if the given variable in the env dictionary has been assigned to (in my case, the variable is hosts, and I require that the value need to have been provided by either the prod or the test function). This is a useful mechanism to ensure that certain things have been defined before attempting to run commands on the remote servers.

The first task is called disable_server_in_lb. It takes a host name as a parameter, which is the server that I want disabled in the nginx configuration file. I use the handy 'comment' function available in fabric.contrib.files to comment out the lines that contain 'server HOSTNAME' in the nginx configuration. The comment function can be invoked with sudo rights on the remote host by passing use_sudo=True.

The task also calls another function defined in my fab file, restart_nginx. This taks simply calls '/etc/init.d/nginx restart' on the remote host, then verifies that nginx is running by calling is_nginx_running.

By default, when running a command on the remote host, if the command returns a non-zero code, it is considered to have failed by Fabric, and execution stops. In most cases, this is exactly what you want. In case you just want to run a command to get the output, and you don't care if it fails, you can set warn_only=True before running the command. I show an example if this in the is_nginx_running function.

The other main task in my fabfile is enable_server_in_lb. Here I use another handy function offered by Fabric -- the sed function. I substitute '#server  HOSTNAME' with 'server HOSTNAME' in the nginx configuration file, then I restart nginx.
So now that we have the fabfile, how do we actually perform the tasks we defined? Let's assume we have a server called 'web1.mydomain.com' that we want disabled in nginx. We want to test our task first in a test environment, so we would call:
fab -f fab_nginx.py test disable_server_in_lb:web1.mydomain.com
(note the syntax for passing parameters to a function/task)

By specifying test on the command line before specifying the task, I ensure that Fabric first calls the function named 'test' in the fabfile, which sets the hosts to the test nginx servers.

Once I'm satisfied that this works well in the test environment, I call:

fab -f fab_nginx.py prod disable_server_in_lb:web1.mydomain.com

For a real deployment procedure, let's say for deploying tornado-based servers that are behind one or more nginx load balancer, I would do something like this:

fab -f fab_nginx.py prod disable_server_in_lb:web1.mydomain.com
fab -f fab_tornado.py prod deploy
fab -f fab_nginx.py prod enable_server_in_lb:web1.mydomain.com

This will deploy my new application code to web1.mydomain.com. Of course I can script this and call the above sequence for all my production servers. I assume here that I have another fabfile called fab_tornado.py and a task defined in in which does the actual deployment of the application code (most likely by downloading and easy_install'ing an egg).

That's it for today. It's been more like a whirlwind through two types of automated deployment tools -- Puppet/pull and Fabric/push. I didn't do justice to either of these tools in terms of their full capabilities, but I hope this will still be useful for some people as a starting point into their own explorations.

Tuesday, October 13, 2009

Thierry Carrez on running your own Ubuntu Enterprise Cloud

Thierry Carrez, who works in the Ubuntu Server team, has a great series of blog posts on how to run your own Ubuntu Enterprise Cloud. I haven't had a chance to  try this yet, but it's high on my TODO list. Thierry uses the Ubuntu Enterprise Cloud product (which has been part of Ubuntu server starting with 9.04) together with Eucalyptus. Here are the links to Thierry's posts:

Friday, October 09, 2009

Compiling, installing and test-running Scribe

I went to the Hadoop World conference last week and one thing I took away was how Facebook and other companies handle the problem of scalable logging within their infrastructure. The solution found by Facebook was to write their own logging server software called Scribe (more details on the FB blog).

Scribe is mentioned in one of the best presentations I attended at the conference -- 'Hadoop and Hive Development at Facebook' by Dhruba Borthakur and Zheng Shao. If you look at page 4, you'll see the enormity of the situation they're facing: 4 TB of compressed data (mostly logs) handled every day, and 135 TB of compressed data scanned every day. All this goes through Scribe, so that gives me a warm fuzzy feeling that it's indeed scalable and robust. For more details on Scribe, see the wiki page of the project. It's my intention here to detail the steps needed for compiling and installing it, since I found that to be a non-trivial process to say the least. I'm glad Facebook open-sourced Scribe, but its packaging could have been a bit more straightforward. Anyway, here's what I did to get it to run. I followed roughly the same steps on Ubuntu and on Gentoo.

1) Install pre-requisite packages

On Ubuntu, I had to install the following packages via apt-get: g++, make, build-essential, flex, bison, libtool, mono-gmcs, libevent-dev.

2) Install the boost libraries

Very important: scribe needs boost 1.36 or newer, so make sure you don't have older boost libraries already installed. If you install libboost-* in Ubuntu, it tries to bring down 1.34 or 1.35, which will NOT work with scribe. If you have libboost-* already installed, you need to uninstall them. Now. Trust me, I spent several hours pulling my hair on this one.

- download the latest boost source code from SourceForge (I got boost 1.40 from here)

- untar it, then cd into the boost directory and run:

$ ./boostrap.sh
$ ./bjam
$ sudo ./bjam install

3) Install thrift and fb303

- get thrift source code with git, compile and install:

$ git clone git://git.thrift-rpc.org/thrift.git
$ cd thrift
$ ./bootstrap.sh
$ ./configure
$ make
$ sudo make install

- compile and install the Facebook fb303 library:

$ cd contrib/fb303
$ ./bootstrap.sh
$ make
$ sudo make install

- install the Python modules for thrift and fb303:

$ cd TOP THRIFT DIRECTORY
$ cd lib/py
$ sudo python setup.py install
$ cd TOP THRIFT DIRECTORY
$ cd contrib/fb303/py
$ sudo python setup.py install

To check that the python modules have been installed properly, run:

$ python -c 'import thrift' ; python -c 'import fb303'

4) Install Scribe

- download latest source code from SourceForge (I got it from here)

- untar, then run:

$ cd scribe
$ ./bootstrap.sh
$ make
$ sudo make install
$ sudo ldconfig (this is necessary so that the boost shared libraries are loaded)

- install Python modules for scribe:

$ cd lib/py
$ sudo python setup.py install

- to test that scribed (the scribe server process) was installed correctly, just run 'scribed' at a command line; you shouldn't get any errors
- to test that the scribe Python module was installed correctly, run
$ python -c 'import scribe'

5) Initial Scribe configuration

- create configuration directory -- in my case I created /etc/scribe
- copy one of the example config files from TOP_SCRIBE_DIRECTORY/examples/example*conf to /etc/scribe/scribe.conf -- a good one to start with is example1.conf
- edit /etc/scribe/scribe.conf and replace file_path (which points to /tmp) to a location more suitable for your system
- you may also want to replace max_size, which dictates how big the local files can be before they're rotated (by default it's 1 MB, which is too small -- I set it to 100 MB)
- run scribed either with nohup or in a screen session (it doesn't seem to have a daemon mode):

$ scribed -c /etc/scribe/scribe.conf

6) Test run

To test Scribe, you can install it on a remote machine, configure scribed on that machine to use a configuration file similar to examples/example2client.conf, then change remote_host in the config file to point to the central scribe server configured in step 5.

Once scribed is configured and running on the remote machine, you can test it with a nice utility written by Silas Sewell, called scribe_pipe. For example, you can pipe an Apache log file from the remote machine to the central scribe server by running:

cat apache_access_log | ./scribe_pipe apache.access

On the scribe server, you should see at this point a directory called apache.access under the main file_path directory, and files called apache.access_00000, apache.access_00001 etc (in chunks of max_size bytes).

I'll post separately about actually using Scribe in production. I hope this post will at least get you started on using Scribe and save you some headaches during its installation process.

Tuesday, October 06, 2009

Brandon Burton on 'Automation is the cloud'

Great post from Brandon Burton, my ex-colleague at RIS/Reliam, on why automation is the foundation of cloud computing. Brandon discusses automation at various levels, starting with virtualization and networking, then moving up the layers and covering OS, configuration management and application deployment. Highly recommended.

Thursday, September 24, 2009

Pybots success stories and a call for help

Any report of the death of the Pybots project is an exaggeration. But not by much. First, some history.

Some history

The idea behind the Pybots project is to allow people to run automated tests for their Python projects, while using Python binaries built from the very latest source code from the Python subversion repository.

The idea originated from Glyph, of Twisted fame. He sent out a message to the python-dev mailing list in which he said:
"I would like to propose, although I certainly don't have time to implement, a program by which Python-using projects could contribute buildslaves which would run their projects' tests with the latest Python trunk. This would provide two useful incentives: Python code would gain a reputation as generally well-tested (since there is a direct incentive to write tests for your project: get notified when core python changes might break it), and the core developers would have instant feedback when a "small" change breaks more code than it was expected to."

This was back in July 2006. I volunteered to maintain a buildbot master (running on a server belonging to the PSF) and also to rally a community of people interested in running this type of tests. The hard part was (and still is) to find people willing to donate client machines to act as build slaves for a particular project, and even more so people willing to keep up with the status of their build slaves. The danger here, as in any continuous integratin system, is that once the status turns to red and doesn't go back to green, people start to ignore the failed steps. Even if those steps exhibit new and interesting failures, it's too late at this point (this is related to the broken windows theory).

The project starting fairly strong, gained some momentum, but then slowly ran out of steam. It was a combination of me not having the time to do the rallying, and of people not being interested in participating in the project anymore. At the height of its momentum, in early 2007, the Pybots farm consisted of 11 buildslaves running automated tests for more than 20 Python projects, including Twisted, Django, SQLAlchemy, MySQLdb, Bazaar, nose, twill, Storm, Trac, CherryPy, Genshi, Roundup. Pretty much a who's who of the Python project world.

Early success stories

Here are some examples of bugs discovered by the buildslaves in the Python farm:
  • new keywords 'as' and 'with' in Python 2.6 causing problems for projects that had variables with those names
  • Python install step failing even though all unit tests were passing (this underscores the importance of functional testing)
  • platform-specific issues -- for example Bazaar issues on Windows due to TCP client behavior, Twisted issues on Red Hat 9 due to multicast behavior, Python core issues on OS X due to string formatting errors
(for a more thorough overview of the Pybots project, including lessons learned, see also my PyCon07 presentation)

Recent signs of life and more success stories

In the last month or so there has been a flurry of activity related to the Pybots farm. It all started with an upgrade of the buildbot version on the machine hosting the Pybots buildmaster. This broke the master's configuration file, so the Pybots status page went completely dark.

As a result, Steve Holden posted a plea for help answered by a few people who showed interest in adding build slaves to the project. In parallel, Jean-Paul Calderone jumped in to help on the buildmaster side and he managed to fix the buildbot upgrade issue (thanks, JP!) David Stanek also expressed interest in taking a more active role on the buildmaster side.

Jean-Paul also sent more success stories to the Pybots mailing list. Here they are, verbatim, with his permission:

"The skip story:

The Twisted pybots slave started skipping every Twisted test one day. I noticed and filed http://twistedmatrix.com/trac/ticket/3703 (which goes into a bit of detail about why this happened). This happened to come up during the PyCon language summit, so there was some real-time discussion about it, resulting in a Python bug being filed, http://bugs.python.org/issue5571. Then, as that ticket shows, Benjamin Peterson was nice enough to fix the incompatibility.

The array/buffer story:

The Twisted pybots slave started to fail some Twisted tests one day. ;) The tests in question were actually calling into some PyCrypto code, so this failure wasn't in Twisted directly. PyCrypto loads some bytes into an array.array and then tries to hash them (for some part of its random pool API). I filed http://bugs.python.org/issue6071 on which someone explained that hashlib switched over to the new buffer API, lost support for hashing anything that only provides the old buffer API, and that array.array still only supports the old buffer API. This one hasn't been fixed yet, but it sounds like Gregory Smith plans to fix it before 2.7 is released.

There are other success stories too, incompatible changes that are more like bugs on the Twisted side than on the Python side (assuming one is generous and believes that incompatible changes in Python can actually be Twisted bugs ;). Things like typos that didn't result in syntax errors in an older version of Python but became syntax errors in newer versions (in particular, a variable was defined as 0x+80000000 instead of 0x80000000 - the former actually being valid syntax in 2.5 but became illegal in 2.6)."


My hope is that stories like these will convince more people about the usefulness of running tests for their projects against 'live' changes in the Python trunk (or other Python branches). I am not aware of any other testing project that accomplishes this for other programming languages.

In particular, if there is enough interest, we can also configure the Pybots master to trigger test runs for your project of choice using Py3k binaries! Think how cool you'll appear to your grandchildren!

How you can help

If you want to be involved in the Pybots project, please subscribe to the Pybots mailing list and show your interest by sending a message to the list. Here are some resources to get you started:

Jeff Roberts on a scalable DNS scheme for EC2

My ex-colleague from OpenX, Jeff Roberts, has another great blog post on 'A Scalable DNS Scheme for Amazon's EC2 Cloud'. If you need to deploy an internal DNS infrastructure in EC2, you have to read this post. It's based on battle-tested experience.

Monday, September 14, 2009

A/B testing and online experimentation at Microsoft

Via Greg Linden, I found a great presentation from Ronny Kohavi on "Online experimentation at Microsoft". All kinds of juicy nuggets of information on how to conduct meaningful A/B testing and other types of controlled online experiments.

One of my favorite slides is 'Key Lessons', from which I quote:

  • "Avoid the temptation to try and build optimal features through extensive planning without early testing of ideas"
  • "Experiment often"
  • "Try radical ideas. You may be surprised"

The entire presentation is highly recommended. You can tell that this wisdom was earned in the school of hard knocks, which is the best school there is in my experience, at least for software engineering.

Wednesday, September 02, 2009

Bootstrapping EC2 images as Puppet clients

I've been looking at Puppet lately as an alternative to slack for automated deployment and configuration management. I can't say I love it, but I think it's good enough that it warrants banging your head against the wall repeatedly until you learn how to use it. I do wish it was written in Python, but hey, you do what you need to do. I did look at Fabric, and I might still use it for 'push'-type deployments, but it has nowhere near the features that Puppet has (and its development and maintenance just changed hands, which makes it too cutting edge for me at this point.)

But this is not a post about Puppet -- although I promise I'll blog about that too. This is a post on how to get to the point of using Puppet in an EC2 environment, by automatically configuring EC2 instances as Puppet clients once they're launched.

While the mechanism I'll describe can be achieved by other means, I chose to use the Ubuntu EC2 AMIs provided by alestic. As a parenthesis, if you're thinking about using Ubuntu in EC2, do yourself a favor and read Eric Hammond's blog (which can be found at alestic.com) He has a huge number of amazingly detailed posts related to this topic, and they're all worth your while to read.

Unsurprisingly, I chose a mechanism provided by the alestic AMIs to bootstrap my EC2 instances -- specifically, passing user-data scripts that will be automatically run on the first boot of the instance. You can obviously also bake this into your own custom AMI, but the alestic AMIs already have this hook baked in, which I LIKE (picture Borat's voice). What's more, Eric kindly provides another way to easily run custom scripts within the main user-data script -- I'm referring to his runurl script, detailed in this blog post. Basically you point runurl at a URL that contains the location of another script that you wrote, and runurl will download and run that script. You can also pass parameters to runurl, which will in turn be passed to your script.

Enough verbiage, let's see some examples.

Here is my user-data file, whose file name I am passing along as a parameter when launching my EC2 instances:


#!/bin/bash -ex

cat <<EOL > /etc/hosts
127.0.0.1 localhost.localdomain localhost
10.1.1.1 puppetmaster

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
EOL

wget -qO/usr/bin/runurl run.alestic.com/runurl
chmod 755 /usr/bin/runurl
runurl ec2web.mycompany.com/upgrade/apt
runurl ec2web.mycompany.com/customize/ssh
runurl ec2web.mycompany.com/customize/vim
runurl ec2web.mycompany.com/install/puppet


The first thing I do in this script is to add an entry to the /etc/hosts file pointing at the IP address of my puppetmaster server. You can obviously do this with an internal DNS server too, but I've chosen not to maintain my own internal DNS servers in EC2 for now.

My script then retrieves the runurl utility from alestic.com, puts it in /usr/bin and chmod's it to 755. Then the script uses runurl and points it at various other scripts I wrote, all hosted on an internal web server.

For example, the contents of upgrade/apt are:


#!/bin/bash
apt-get update
apt-get -y upgrade
apt-get -y autoremove


For ssh customizations, my scripts downloads a specific .ssh/authorized_keys file, so I can ssh to the new instance using certain ssh keys.

To install and customize vim, I have customize/vim:


#!/bin/bash
apt-get -y install vim
wget -qO/root/.vimrc http://ec2web.mycompany.com/configs/os/.vimrc
echo 'alias vi=vim' >> /root/.bashrc


...where .vimrc is a customized file that I keep under the document root of the same web server where I keep my scripts.

Finally, install/puppet looks like this:


#!/bin/bash
apt-get -y install puppet
wget -qO/etc/puppet/puppetd.conf http://ec2web.mycompany.com/configs/puppet/puppetd.conf
/etc/init.d/puppet restart


Here I am installing puppet via apt-get, then I'm downloading a custom puppetd.conf configuration, which points at puppetmaster as its server name (instead of the default, which is puppet). Finally, I restart puppet so that the new configuration takes effect.

Note that I want to keep these scripts to the bare minimum that allows me to:

1) ssh into the instance in case anything goes wrong
2) install and configure puppet so the instance can talk to the puppetmaster

The actual package and application installations and customizations on my newly launched image will be done through puppet, by associating the instance hostname with a node that is defined on the puppetmaster; I am also adding more entries to /etc/hosts as needed using puppet-specific mechanisms such as the 'host' type (as promised, blog post on this forthcoming...)

Note that you need to make sure you have good security for the web server instance which is serving your scripts to runurl; Eric Hammond talks about using S3 for that, but it's too complicated IMO (you need to sign URL and expire them, etc.) In my case, I preferred to use an internal Apache instance with basic HTTP authentication, and to only allow traffic on port 80 from certain security groups within EC2 (my Apache server doubles as the puppetmaster BTW).

Is your hosting provider Reliam?

Some exciting news from RIS Technology, the hosting company I used to work for. They changed their name to Reliam, which stands for Reliable Internet Application Management. And I think it's an appropriate name, because RIS has always been much more involved into the application stack than your typical hosting provider. When I was there, we rolled out Django applications, Tomcat instances, MySQL, PostgreSQL and Oracle installations, and we maintained them 24x7, which required a deep understanding of the applications. We also provided the glue that tied all the various layers together, from deployment to monitoring.

Since I left, RIS/Reliam has invested heavily in a virtual infrastructure that can be combined where it makes sense with physical dedicated servers. The DB layer is usually dedicated, since the closest you are to bare metal, the better off you are in terms of database access. But the application layer can easily be virtualized and scaled on demand. So you can the scaling benefit of cloud computing, and the performance benefit of dedicated servers.

Here are some stats for the infrastructure that RIS/Reliam used for supporting traffic during the recent Miss Universe event (they host missuniverse.com and missusa.com):

* 135 virtual servers running the Web application
* 9 virtual servers running mysql-proxy
* 1 master DB server and 5 read-only slave DB servers running MySQL
* 301.52 Mbps bandwidth
* 33,750 concurrent users
* over 150K concurrent sessions per second

An interesting note is that they used round robin DNS to load balance between the mysql proxies and had all proxies configured to use the master and all five slaves. They managed to get mysql-proxy 0.7.2 running with this patch.

So...what's the point of this note? It's a shout-out to my friends at RIS/Reliam, and a warm recommendation for them in case you need a hosting provider with strong technical capabilities that cover cloud/hybrid computing, system architecture design, application deployment and deep application monitoring and graphing.

Tuesday, August 25, 2009

New presentation and new job

I gave a presentation last night on 'Agile and Automated Testing Techniques and Tools' to the Pasadena Java User Group. It was a version of the talk I gave earlier this year to the XP/Agile SoCal User Group. This time I posted the slides to Slideshare.

If you look attentively at the first slide, you'll notice I have a new job at Evite, as a Sr. Systems Architect. I started a couple of weeks ago, and it's been great. Expect more blog posts on automated deployments to the cloud (using Ubuntu images from Alestic), on continuous integration/build/release management processes, on Hadoop, and on any interesting stuff that comes my way.

Tuesday, July 28, 2009

noSQL databases? map-reduce? Erlang? it's all in this cartoon

Hilarious cartoon (not sure why it's titled 'Fault Tolerance' though) seen on the High Scalability blog. Captures very well the spirit and hype of our times in the IT world.

Monday, July 27, 2009

Python well represented in NASA's Nebula cloud

I found out today from the cloud-computing mailing list about NASA's Nebula project. Here's what the 'About' page of the project's web site says:

"NEBULA is a Cloud Computing environment developed at NASA Ames Research Center, integrating a set of open-source components into a seamless, self-service platform. It provides high-capacity computing, storage and network connectivity, and uses a virtualized, scalable approach to achieve cost and energy efficiencies."

The Services page has some nice architectural diagrams. I wasn't surprised to see that their VM enviroment is managed via Eucalyptus. I also shouldn't have been surprised by the large number of Python modules and applications they're using, especially on the client side. Pretty much all the frontend applications are Python bindings for the various backend technologies they're using (such as LUSTRE, RabbitMQ, Subversion). Of course Trac is there too.

But the most interesting thing for Python fans will be undoubtedly their selection for the Web application framework. Maybe again unsurprisingly, they chose...Django:

"After an extensive trade study, the NEBULA team selected Django, a python-based web application framework, as the first and primary application environment for the Cloud. NEBULA users have access to an extensive collection of open-source django "apps", providing features ranging from simple blogs, wikis, and discussion forums, to more advanced collaboration suites, image processing, and more."

Other interesting tidbits from those diagrams:

  • deployments are automated with Fabric
  • distributed automated testing is done with Selenium Grid
  • continuous integration is done with CruiseControl
  • for the database backend, they use a MySQL cluster with DRBD
  • for the file system they use LUSTRE
  • queuing is done with RabbitMQ (which is written in Erlang)
  • search and indexing is done with SOLR
All in all, an interesting mix of technologies. Besides Python, Java and Erlang are well represented, as expected. Not a bad model to follow if you want to build your own private cloud environment.

Sunday, July 26, 2009

How to roll your own Amazon EC2 image

Jeff Roberts, the vim-fu guru, does it again with a great post on "Bundling versioned AMIs rapidly in Amazon's EC2". It's a step-by-step guide on how to roll your own AMI, bundle it and upload it to S3, while keeping it versioned at the same time. Highly recommended.

Tuesday, July 21, 2009

Automated testing of production deployments

When you work as a systems engineer at a company that has a large scale system infrastructure, sooner or later you realize that you need to automate pretty much everything you do. You can't afford not to, if you want to keep up with the ever-present demands of scaling up and down the infrastructure.

The main promise of cloud computing -- infinite elastic scaling based on demand -- is real, but you can only achieve it if you automate your deployments. It's fairly safe to say that most teams that are involved in such infrastructures have achieved high levels of automation. Some fearless teams practice continuous deployment, others do frequent dark launches. All these practices are great, but my thesis is that in order to achieve fearlessness you need automated tests of your production deployments.

Note the word 'production' -- I believe it is necessary to go one step beyond running automated tests in an isolated staging environment (although that is a very good thing to do, especially if staging mirrors production at a smaller scale). That next step is to run your test harness in production, every time you deploy. And deployment, at a fast moving Web company these days, can happen multiple times a day. Trust me, with no automated tests in place, you'll never get rid of that nagging feeling in the pit of your stomach that you might have broken things horribly, in production.

So how do you go about writing automated tests for your deployments? I wrote a while ago about automating and testing your system setup checklists. Even testing small things such as 'is httpd/mysqld/postfix setup to run at boot time' will go a long way in achieving peace of mind.

Assuming you have a list of things to test (it can be just a couple of critical things for starters), how and when do you run the tests? Again, you can do the simplest thing that works -- a bash shell that iterates through your production servers and runs the test scripts remotely on the servers via ssh. Some things I test this way these days are:

* do local MySQL databases on servers in a particular cluster contain the same data in certain tables? (this shows me that things are in sync across servers)
* is MySQL replication working as expected across the cluster of read-only slaves?
* are periodic operations happening as expected (here I can do a simple tail of a log file to figure it out)
* are certain PHP modules correctly installed?
* is Apache serving a number of requests per second that is not too high, but not too low either (where high and low are highly dependent on your traffic and application obviously)

I run these tests (and many others) each time I push a change to production. No matter how small the change can seem, it can have unanticipated side effects. I found that having tests that probe the system from as many angles as possible are the most efficient -- the angles in my case being Apache, MySQL, PHP, memcached for example. I also found that this type of testing (push-based if you want) is very good at showing discrepancies between servers. If you see a server being out of wack this way, then you know you need to attempt to fix it, or even terminate it and deploy a new one.

Another approach in your automated testing strategy is to run your test harness periodically (via cron for example) and also to write the harness in a proper language (Python comes to mind), integrated into a test framework. You can have the results of the tests emailed to you in case of failure. The advantage of this approach is that you can have things run automatically without your intervention (in the first approach, you still have to remember to run the test suite!).

The ultimate in terms of automated testing is to integrate it with your monitoring infrastructure. If you use Nagios for example, you can easily write plugins that essentialy probe for the same things that your tests probe for. The advantage of this approach is that the tests will run every time Nagios runs, and you can set up alerts easily. One disadvantage is that it can slow down your monitoring, depending on the number of tests you need to run on each server. Monitoring typically happens very often (every 5 minutes is a common practice), so it may be overkill to run all the tests every 5 minutes. Of course, this should be configurable in your monitoring tool, so you can have a separate class of checks that only happen every N hours for example.

In any case, let me assure you that even if you take the first approach I mentioned (ssh into all servers and run commands remotely that way), you'll reap the rewards very fast. In fact, you'll like it so much that you'll want to keep adding more tests, so you can achieve more inner peace. It's a sure way to becoming test infected, but also to achieve deployment nirvana.

Friday, July 17, 2009

Managing multiple MySQL instances with MySQL Sandbox

MySQL doesn't support multi-master replication, i.e. you can't have one MySQL instance acting as a replication slave to more than one master. There are times when you need this functionality, for example for disaster recovery purposes, where you have a machine with tons of CPU, RAM and disk running several MySQL instances, each being a replication slave to a different MySQL master.

One tool I've used for easy management of multiple MySQL instances on the same box is MySQL Sandbox. It's nothing fancy -- a Perl module which offers a collection of scripts -- but it does make your life much easier.

To install MySQL Sandbox, download it from its Launchpad page, then run 'perl Makefile.PL; make; make install'. You also need to download a MySQL binary tarball which will serve as a common base used by your MySQL instances.

Here's an example of a script I wrote which creates a new MySQL Sandbox instance under a common directory (/var/mysql_slaves in my case). The script takes 2 arguments: a database name, and the name of the MySQL master from which that database is replicated. The script automatically increments the port number that the new sandbox instance will listen on, then creates the instance via a call like this:


/usr/bin/make_sandbox /usr/local/src/mysql-5.1.32-linux-x86_64-glibc23.tar.gz \
        --upper_directory=/var/mysql_slaves \
 --sandbox_directory=$SLAVEDB_NAME --sandbox_port=$LAST_PORT_NUMBER \
 --db_user=$SLAVEDB_NAME --db_password=PASSWORD \
 --no_confirm

As a result, there will be a new directory called $SLAVEDB_NAME under /var/mysql_slaves, which serves as the sandbox for the newly created MySQL instance. The script also adds some lines related to replication to the new MySQL instance configuration file (which is /var/mysql_slaves/my.sandbox.cnf).

To start the instance, run
/var/mysql_slaves/$SLAVEDB_NAME/start

To stop the instance, run
/var/mysql_slaves/$SLAVEDB_NAME/start

To go to a MySQL prompt for this instance, run
/var/mysql_slaves/$SLAVEDB_NAME/use

At this point, you still don't have a functioning slave. You need to load the data from the master. One way to do this is to run mysqldump on the master with options such as ' --single-transaction --master-data=1'. This will include the master information (binlog name and position) in the DB dump.

The next step is to transfer the DB dump over to the box running MySQL Sandbox, and load it into the MySQL instance. I use a script similar to this.

You should now have a MySQL instance that acts as a replication slave to a specific master server. Repeat this process to set up other sandboxed MySQL instances that are slaves to other masters.

Note that MySQL Sandbox already includes some replication-related utilities (which I haven't used) and also an admin-type tool called sbtool. The documentation is pretty good.

Tuesday, July 14, 2009

Kent Langley's '10 rules for launching a web site'

The advice in this blog post by Kent Langley resonates with my experiences launching Web infrastructures of all types, large and small. Deploy early and often, automate your deployments, use version control, create checklists, have a rollback plan -- these are all very sensible things to do.

I would add one more very important thing that seems to be missing from the list: have an extensive suite of automated tests to check that your deployment steps did the right thing. Many people just stop at the automation step, and don't go beyond that to the testing step. It will come back to haunt them in the long run. But this is fodder for another blog post, which will be coming real soon now ;-)

Monday, July 13, 2009

Greatest invention since sliced bread: vimdiff

If you work at the ssh command prompt all day long (like I do), and if you need to compare text files and merge differences (like I do), then make sure you check out vimdiff (thanks to Jeff Roberts for bringing it to my attention).

If you run 'vimdiff file1 file2', the tool will split your screen vertically, with file1 displayed in a vim session on the left and file2 on the right. The differences between the 2 files will be highlighted. To jump from difference to difference, use ]c (forward) and [c (backward). When the cursor is on a difference block, use :diffget or do to merge the difference from the other file into the file where the cursor is; use :diffput or dp to merge the other way. To jump from one file's window to the other, use Ctrl-w-w. Google vimdiff for other tips and tricks. Definitely a good tool to have in your arsenal.

If you have the luxury of a graphical enviroment, I also recommend meld (thanks to Chris Nutting for the tip).

Recommended blog: Elastician

Elastician is the blog of Mitch Garnaat, the author of the amazingly useful boto Python library -- a collection of modules for managing AWS resources (EC2, S3, SQS,  SimpleDB and more recently CloudWatch).

Mitch has a great picture on what he calls the 'Cloud Computing Hierarchy of Needs' (in a reference to Maslow's self-actualization hierarchy). Very insightful.

Friday, July 10, 2009

Python mock testing techniques and tools

This is an article I wrote for Python Magazine as part of the 'Pragmatic Testers' column. Titus and I have taken turns writing the column, although we haven't produced as many articles as we would have liked.

Here is the content of my article, which appeared in the February 2009 issue of PyMag:

Mock testing is a controversial topic in the area of unit testing. Some people swear by it, others swear at it. As always, the truth is somewhere in the middle.

Let's get some terminology clarified: when people say they use mock objects in their testing, in most cases they actually mean stubs, not mocks. The difference is expanded upon with his usual brilliance by Martin Fowler in his article "Mocks aren't stubs".

In his revised version of the article, Fowler uses the terminology from Gerard Meszaros's 'xUnit Test Patterns' book. In this nomenclature, both stubs and mocks are special cases of 'test doubles', which are 'pretend' objects used in place of real objects during testing.  Here is Meszaros's definition of a test double:


Sometimes it is just plain hard to test the system under test (SUT) because it depends on other components that cannot be used in the test environment. This could be because they aren't available, they will not return the results needed for the test or because executing them would have undesirable side effects. In other cases, our test strategy requires us to have more control or visibility of the internal behavior of the SUT.

When we are writing a test in which we cannot (or chose not to) use a real depended-on component (DOC), we can replace it with a Test Double. The Test Double doesn't have to behave exactly like the real DOC; it merely has to provide the same API as the real one so that the SUT thinks it is the real one! 

These 'other components' that cannot be used in a test environment, or can only be used with a high setup cost, are usually external resources such as database servers, Web servers, XML-RPC servers. Many of these resources may not be under your control, or may return data that often contains some randomness which makes it hard or impossible for your unit tests to assert things about it.

So what is the difference between stubs and mocks? Stubs are used to return canned data to your SUT, so that you can make some assertions on how your code reacts to that data. This eliminates randomness from the equation, at least in the test environment. Mocks, on the other hand, are used to specify expectations on the behavior of the object called by your SUT. You indicate your expectations by specifying that certain methods of the mock object need to be called by the SUT in a certain order and with certain arguments.

Fowler draws a further distinction between stubs and mocks by saying that stubs are used for “state verification”, while mocks are used for “behavior verification”. When we use state verification, we assert things about the state of the SUT after the stub returned the canned data back to the SUT. We don't care how the stub obtained that data, we just care about the final result (the data itself) and about how our SUT processed that data. When we use behavior verification, not only do we care about the data, but we also make sure that the SUT made the correct calls, in the correct order, and with the correct parameters, to the object representing the external resource.

If readers are still following along after all this theory, I'm fairly sure they have at least two questions:

1) when exactly do I use mock testing in my overall testing strategy?; and
2) if I do use mock testing, should I use mocks or stubs?

I already mentioned one scenario when you might want to use mock testing: when your SUT needs to interact with external resources which are either not under your control, or which return data with enough randomness to make it hard for your SUT to assert anything meaningful about it (for example external weather servers, or data that is timestamped). Another area where mock testing helps is in simulating error conditions which are not always under your control, and which are usually hard to reproduce. In this case, you can mock the external resource, simulate any errors or exceptions you want, and see how your program reacts to them in your unit tests (for example, you can simulate various HTTP error codes, or database connection errors).

Now for the second question, should you use mocks or stubs? In my experience, stubs that return canned data are sufficient for simulating the external resources and error conditions I mentioned. However, if you want to make sure that your application interacts correctly with these resources, for example that all the correct connection/disconnection calls are made to a database, then I recommend using mocks. One caveat of using mocks: by specifying expectations on the behavior of the object you're mocking and on the interaction of your SUT with that object, you couple your unit tests fairly tightly to the implementation of that object. With stubs, you only care about the external interface of the object you're mocking, not about the internal implementation of that object.

Enough theory, let's see some practical examples. I will discuss some unit tests I wrote for an application that interacts with an external resource, in my case a SnapLogic server. I don't have the space to go into detail about SnapLogic, but it is a Python-based Open Source data integration framework. It allows you to unify the access to the data needed by your application through a single API. Behind the scenes, SnapLogic talks to database servers, CSV files, and other data sources, then presents the data to your application via a simple unified API. The main advantage is that your application doesn't need to know the particular APIs for accessing the various external data sources.

In my case, SnapLogic talks to a MySQL database and presents the data returned by a SELECT SQL query to my application as a list of rows, where each row is itself a list. My application doesn't know that the data comes from MySQL, it just retrieves the data from the SnapLogic server via the SnapLogic API. I encapsulated the code that interacts with the SnapLogic server in its own class, which I called SnapLogicManager. My main SUT is passed a SnapLogicManager object in its __init__ method, then calls its methods to retrieve the data from the SnapLogic server.

I think you know where this is going – SnapLogic is an external resource as far as my SUT is concerned. It is expensive to set up and tear down, and it could return data with enough randomness so I wouldn't be able to make meaningful assertions about it. It would also be hard to simulate errors using the real SnapLogic server. All this indicates that the SnapLogicManager object is ripe for mocking.

My application code makes just one call to the SnapLogicManager object, to retrieve the dataset it needs to process:


rows = self.snaplogic_manager.get_attrib_values()


Then the application processes the rows (list of lists) and instantiates various data structures based on the values in the rows. For the purpose of this article, I'll keep it simple and say that each row has an attribute name (element #0), and attribute value (element #1) and an attribute target (element #2). For example, an attribute could have the name “DocumentRoot”, the value “/var/www/mydocroot” and the target “apache”. The application expects that certain attributes are there with the correct target. If they're not, it raises an exception.

How do we test that the application correctly instantiates the data structure, and correctly reacts to the presence or absence of certain attributes? You guessed it, we use a mock SnapLogicManager object, and we return canned data to our application.

I will show here how to achieve this using two different Python mock testing frameworks: Mox, written by Google engineers, and Mock, written by Michael Foord.

Mox is based on the Java EasyMock framework, and it does have a Java-esque feel to it, down to the CamelCase naming convention. Mock feels more 'pythonic' – more intuitive and with cleaner APIs. The two frameworks also differ in the way they set up and verify the mock objects: Mox uses a record/replay/verify pattern, whereas Mock uses an action/assert pattern. I will go into these differences by showing actual code below.

Here is a unit test that uses Mox:


    def test_get_attrib_value_with_expected_target(self):

        # We return a SnapLogic dataset which contains attributes with correct targets
        canned_snaplogic_rows = [
        [u'DocumentRoot', u'/var/www/mydocroot', u'apache'],
        [u'dbname', u'some_dbname', u'database'],
        [u'dbuser', u'SOME_DBUSER', u'database'],
        ]

        # Create a mock SnapLogicManager
        mock_snaplogic_manager = mox.MockObject(SnapLogicManager)

        # Return the canned list of rows when get_attrib_values is called
        mock_snaplogic_manager.get_attrib_values(self.appname, self.hostname).AndReturn(canned_snaplogic_rows)

        # Put all mocks created by mox into replay mode
        mox.Replay(mock_snaplogic_manager)

        # Run the test
        myapp = MyApp(self.appname, self.hostname, mock_snaplogic_manager)
        myapp.get_attr_values_from_snaplogic()

        # Verify all mocks were used as expected
        mox.Verify(mock_snaplogic_manager)

        # We test that attributes with correct targets are retrieved correctly
        assert '/var/www/mydocroot' == myapp.get_attrib_value_with_expected_target("DocumentRoot", "apache")
        assert 'some_dbname' == myapp.get_attrib_value_with_expected_target("db_name", "database")
        assert 'SOME_DBUSER' == myapp.get_attrib_value_with_expected_target("db_user", "database")


Some explanations are in order. With the Mox framework, when you instantiate a MockObject, it is in 'record' mode, which means it's waiting for you to specify expectations on its behavior. You specify these expectations by telling the mock object what to return when called with a certain method. In my example, I tell the mock object that I want the list of canned rows to be returned when I call its 'get_attrib_values' method: mock_snaplogic_manager.get_attrib_values(self.appname, self.hostname).AndReturn(canned_snaplogic_rows)

I only have one method that I am recording the expectations for in my example, but you could have several. When you are done recording, you need to put the mock object in 'replay' mode by calling mox.Replay(mock_snaplogic_manager). This means the mock object is now ready to be called by your application, and to verify that the expectations are being met.

Then you call your application code, in my example by passing the mock object in the constructor of MyApp: myapp = MyApp(self.appname, self.hostname, mock_snaplogic_manager). My test then calls myapp.get_attr_values_from_snaplogic(), which in turn interacts with the mock object by calling its get_attrib_values() method.

At this point, you need to verify that the expectations you set happened correctly. You do this by calling the Verify method of the mock object: mox.Verify(mock_snaplogic_manager).

If any of the methods you recorded were not called, or where called in the wrong order, or with the wrong parameters, you would get an exception at this point and your unit tests would fail.

Finally, you also assert various things about your application, just as you would in any regular unit test. In my case, I assert that the get_attrib_value_with_expected_target method of MyApp correctly retrieves the value of an attribute.

This seems like a lot of work if all you need to do is to return canned data to your application. Enter the other framework I mentioned, Mock, which lets you specify canned return values very easily, and also allows you to assert certain things about the way the mock objects were called without the rigorous record/replay/verify pattern.

Here's how I rewrote my unit test using Mock:


    def test_get_attrib_value_with_expected_target(self):
        # We return a SnapLogic dataset which contains attributes with correct targets
        canned_snaplogic_rows = [
        [u'DocumentRoot', u'/var/www/mydocroot', u'apache'],
        [u'dbname', u'some_dbname', u'database'],
        [u'dbuser', u'SOME_DBUSER', u'database'],
        ]

        # Create a mock SnapLogicManager
        mock_snaplogic_manager = Mock()

        # Return the canned list of rows when get_attrib_values is called
        mock_snaplogic_manager.get_attrib_values.return_value = canned_snaplogic_rows

        # Run the test
        myapp = MyApp(self.appname, self.hostname, mock_snaplogic_manager)
        myapp.get_attr_values_from_snaplogic()

        # Verify that mocks were used as expected
        mock_snaplogic_manager.get_attrib_values.assert_called_with(self.appname, self.hostname)

        # We test that attributes with correct targets are retrieved correctly
        assert '/var/www/mydocroot' == myapp.get_attrib_value_with_expected_target("DocumentRoot", "apache")
        assert 'some_dbname' == myapp.get_attrib_value_with_expected_target("db_name", "database")
        assert 'SOME_DBUSER' == myapp.get_attrib_value_with_expected_target("db_user", "database")


As you can see, Mock allows you to specify the return value for a given method of the mock object, in my case for the 'get_attrib_values' method. Mock also allows you to verify that the method has been called with the correct arguments. I do that by calling assert_called_with on the mock object. If you just want to verify that the method has been called at all, with no regard to the arguments, you can use assert_called.

There are many other things you can do with both Mox and Mock. Space doesn't permit me to go into many more details here, but I strongly encourage you to read the documentation and try things out on your own.

Another technique I want to show is how to simulate exceptions using the Mox framework. In my unit tests, I wanted to verify that my application reacts correctly to exceptions thrown by the SnapLogicManager class. Those exception are thrown for example when the SnapLogic server is not running. Here is the unit test I wrote:


    def test_get_attr_values_from_snaplogic_when_errors(self):
        # We simulate a SnapLogicManagerError and verify that it is caught properly

        # Create a mock SnapLogicManager
        mock_snaplogic_manager = mox.MockObject(SnapLogicManager)

        # Simulate a SnapLogicManagerError when get_attrib_values is called
        mock_snaplogic_manager.get_attrib_values(self.appname, self.hostname).AndRaise(SnapLogicManagerError('Boom!'))

        # Put all mocks created by mox into replay mode
        mox.Replay(mock_snaplogic_manager)

        # Run the test
        myapp = MyApp(self.appname, self.hostname, mock_snaplogic_manager)
        myapp.get_attr_values_from_snaplogic()


        # Verify all mocks were used as expected
        mox.Verify(mock_snaplogic_manager)

        # Verify that MyApp caught and logged the exception
        line = get_last_line_from_log(self.logfile)
        assert re.search('myapp - CRITICAL - get_attr_values_from_snaplogic --> SnapLogicManagerError: \'Boom!\'', line)


I used the following Mox API for simulating an exception: mock_snaplogic_manager.get_attrib_values(self.appname, self.hostname).AndRaise(SnapLogicManagerError('Boom!')).


To verify that my application reacted correctly to the exception, I checked the application log file, and I made sure that the last line logged contained the correct exception type and value.


Space does not permit me to show a Python-specific mock testing technique which for lack of a better name I call 'namespace overriding' (actually this is a bit like monkey patching, but for testing purposes; so maybe we can call it monkey testing?). I refer the reader to my blog post on 'Mock testing examples and resources' and I just quickly describe here the technique. Imagine that one of the methods of your application calls urllib.urlretrieve in order to download a file from an external Web server. Did I say external Web server, as in 'external resource not under your control'? I did, so you know that mock testing will help. My blog post shows how you can write a mocked_urlretrieve function, and override the name urllib.urlretrieve in your unit tests with your mocked version mocked_urlretrieve. Simple and elegant. The blog post also shows how you can return various canned valued from the mocked version of urlretrieve, based on different input values.

I started this article by saying that mock testing is a controversial topic in the area of unit testing. Many people feel that you should not use mock testing because you are not testing your application in the presence of the real objects on which it depends, so if the code for these objects changes, you run the risk of having your unit tests pass even though the application will break when it interacts with the real objects. This is a valid objection, and I don't recommend you go overboard with mocking every single interaction in your application. Instead, limit your mock testing, as I said in this article, to resources whose behavior and returned data are hard to control.

Another important note: whatever your unit testing strategy is, whether you use mock testing techniques or not, do not forget that you also need to have functional test and integration tests for your application. Integration tests especially do need to exercise all the resources that your application needs to interact with. For more information on different types of testing that you need to consider, please see my blog posts 'Should acceptance tests be included in the continuous build process?' and 'On the importance of functional testing'.

Thursday, July 09, 2009

Setting system-wide environment variables on RedHat-based machines

I keep forgetting this, so I'm committing it to long-term memory. If you have a RedHat-based operating system (RH, CentOS etc) and you need to set certain environment variables so they're available to every user, one good place to do it is to drop a script ending in .sh in /etc/profile.d. Then export your desired environment variables there.

Here's an example from a CentOS machine I have:

# cd /etc/profile.d
# cat java.sh 
export JAVA_HOME=/usr/java/default

Note that you can do whatever else you need in these scripts -- for example you can set up aliases etc. Every script in /etc/profile.d which ends in .sh gets sourced in /etc/profile.

Monday, July 06, 2009

Resource monitoring and graphing with Cacti in EC2

My colleague Jeff Roberts just posted a blog entry on 'Using Cacti to monitor a large scale infrastructure in Amazon’s EC2'. Highly recommended for people interested in monitoring and graphing their system resources across hundreds of nodes in Amazon EC2.

Thursday, July 02, 2009

Dark launching and other lessons from Facebook on massive deployments

I came across this note from the Engineering team at Facebook which talks about how they managed to smoothly launch their recent 'pick a username' feature. The title of the note is, appropriately enough, 'Hammering usernames' -- this is of course because they were expecting their infrastructure to be hammered.

In the note I saw for the first time a name for a strategy that teams I've been involved with have applied before: 'dark launching'. Essentially, dark launching is releasing a new feature to a subset of your users, mostly with no UI changes, but otherwise exercising all the parts of your infrastructure involved in serving that feature. A good strategy to apply when you're dealing with massive, large-scale deployments, and when you want to see how your infrastructure behaves in conditions that are as close to production as possible. Because remember, there's nothing like production! Your careful load/stress testing exercises in a lab environment ain't gonna cut it.

The note from Facebook has all sorts of other nuggets of wisdom related to massive infrastructure deployments. I recommend subscribing to the RSS feed for 'Engineering @ Facebook's Notes'.

While googling for 'dark launching', I also came across this very good post by Dare Obasanjo. Recommended reading.

Tuesday, June 30, 2009

A redbot from mnot

mnot == Mark Nottingham (RESTful Web Services guru and blogger)

redbot == Resource Expert Droid == a testing tool written in Python that "checks HTTP resources to see how they use HTTP, makes suggestions, and finds common protocol mistakes"

Use this tool if you want to see how well your Web application does in terms of HTTP connections, content negotiation, caching and validation. Very useful both for traditional Web sites and for Web services.

Monday, June 29, 2009

Presentations from AWS Start-Up Event

OpenX was invited to present as part of the AWS Start-Up Tour 2009, which stopped in Los Angeles on June 15th. John Linden, our CTO, gave a presentation on the ways we use EC2 here at OpenX. He got a lot of questions, which means people were awake and alert ;-)

Until John's slides are posted to Slideshare, here are 2 other presentations, one from eHarmony on the way they use Elastic MapReduce, and one from Amazon's own Jinesh Varia on 'Architecting for the AWS cloud'. Interesting stuff. Check out this Slideshare page for more links to other AWS-related presentations.

Thursday, May 21, 2009

The Second Law of Automated Testing

I was invited by Paul Moore and Paul Hodgetts to give a presentation at the Agile/XPSoCal monthly evening meeting, which happened last night in Irvine, at the Capital Group offices. The topic of my presentation was 'How to Get to "Done" - Agile and Automated Testing Techniques and Tools'. I think it went pretty well, there were 30+ people in attendance and I got a lot of questions at the end, which is always a good sign. Here are my slides in PDF format. I presented a lot of tools as live demos outside of the slides, but I hope that the points I made in the slides will still be useful to some people.

In particular, I want to present here what I claim to be...

The Second Law of Automated Testing

"If you ship versioned software, you need automated tests."

At the talk last night I was waiting to be asked about the first law of automated testing, but nobody ventured to ask that question ;-) (for the record, my answer would have been 'you need to buy me a beer to find that out').

But I strongly believe that if you have software that SHIPS and that is VERSIONED, then you need automated tests for it. Why? Because how would you know otherwise that version 1.4 didn't break things horribly compared to version 1.3? You either employ an army of testers to manually test each and every 1.3 feature that is present in 1.4, or you use a strong suite of automated regression tests that cover all major features of 1.3 and that show you right away if any were broken in 1.4. Your choice.

Notice that I also qualify the software as 'software that ships'. This implies that you hopefully use sound software engineering processes and techniques to build that software. I am not referring to toy projects, or 1-page Web sites for temporary events, or even academic projects that are never shipped widely to users. All these can probably survive with no automated tests.

If you think you have some software that ships and is versioned, but you found that you're doing very well with no automated tests, I'd like to hear about it, so please leave a comment.

Friday, May 15, 2009

MySQL fault-tolerance and disaster recovery techniques

Any non-trivial MySQL installation needs to be protected against failures, and especially so in a 'cloud' environment, where failure should be expected. I've had bad experiences with MySQL clustering (I tenderly refer to it as MySQL clusterf**k), so I'm going to talk about MySQL replication in this post.

The most common fault-tolerance scenario in a MySQL environment is to have a master database server and a pool of load-balanced slave database servers. Hopefully your application is configurable so it can write to the master DB and read from the slave DB pool. If it is not, you can still use this technique (with some limitations) by going through MySQL Proxy, as detailed in another blog post of mine.

There is plenty of documentation available on setting up MySQL replication. I will jot down here some notes on things I find myself doing over and over again, in a condensed format that hopefully will benefit others too.

Step 0 is to enable binary logging on the master database. That's all you need to do for a MySQL DB server to be able to function as a master. To achieve this, you can add lines like these in /etc/my.cnf and restart mysqld:
 
server-id = 1log-bin = /var/lib/mysql/mysql-bin

One other option you might want to set up is the binlog format. For recent MySQL versions, the default is STATEMENT. For some types of updates to the master, I found it is better to specify ROW as the binlog format (for an explanation of the differences between the 2 types, and for more info that you ever wanted about binary logging, see the official documentation):
binlog_format = ROW
You also need to create a MySQL user on the master DB and grant it REPLICATION SLAVE rights. You can use a statement like this:
GRANT REPLICATION SLAVE ON *.* TO 'replicant'@'IP_of_slave_DB' IDENTIFIED BY 'somepassword';


Setting up a MySQL slave when you can lock tables on the master

This is the recommended way of setting up a MySQL slave DB machine. It requires locking the tables for writes on the master DB, which is something you may or may not afford to do. Here are the steps you need to go through:

1) Connect to the master DB server and issue this command:

FLUSH TABLES WITH READ LOCK;

2) Note the binlog file name and position on the master by running this command:

SHOW MASTER STATUS;
| File| Position | Binlog_Do_DB | Binlog_Ignore_DB
| mysql-bin.000004 | 87547369 || |
1 row in set (0.01 sec)


3) Leave the current mysql session open so that the tables are still locked on the master, and in a different session take a database dump of the mysql database and of the application database on the master. You can use a command line such as:


mysqldump -u root -p$MY_ROOT_PW --database mysql \
--lock-all-tables | /bin/gzip > mysql.sql.gz

mysqldump -u root -p$MY_ROOT_PW --database $MYDB \
--lock-all-tables | /bin/gzip > $MYDB.sql.gz

4) Once the dump is done (a process which on a very large database can take hours), go ahead and unlock the tables in the first MySQL session:

UNLOCK TABLES;

5) Now you're ready to set up a MySQL slave database. It's a good idea to set up binary logging on all your slaves, so that if your master DB fails, any slave can be promoted to a master. If you do turn binary logging on, do NOT also enable log-slave-updates (because if you do, and if you promote a slave to a master, then the other slaves might receive some updates twice -- complete explanation available here).

The DB machine you want to set up as a slave should have lines similar to these in its /etc/my.cnf file (server-id needs to be different from the master ID and any other slave IDs that talk to the same master):

server-id = 2
log-bin = /var/lib/mysql/mysql-binbinlog_format = ROW
6) On the machine you want to set up as a slave, load the mysql dumps of the mysql DB and of your application database (the ones you took in step 3). Note that you may need to create the application database before you can load the application DB dump into it.

7) On the slave, fire up a mysql prompt and use the 'CHANGE MASTER TO' command to specify the master DB, the binglog file and the binlog position (you need to use the values from step 2):

STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO
MASTER_HOST='master_database_server_name',
MASTER_USER='replicant',
MASTER_PASSWORD='somepassword',
MASTER_LOG_FILE='mysql-bin.000004',
MASTER_LOG_POS=87547369;
START SLAVE;

8) Run the 'SHOW SLAVE STATUS \G' command on the newly created slave DB and make sure that the values for both Slave_IO_Running and Slave_SQL_Running show as YES, and that Seconds_Behind_Master is 0 (it can take a while initially for this value to converge to 0, but it should do so). Here is an example of the output of this command:

*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: my_master_host
Master_User: replicant
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000004
Read_Master_Log_Pos: 157767054
Relay_Log_File: crt-relay-bin.000012
Relay_Log_Pos: 112340434
Relay_Master_Log_File: mysql-bin.000004
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table: MYDB.tmp\_%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 112340289
Relay_Log_Space: 112340630
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File: /etc/pki/tls/cert.pem
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
1 row in set (0.00 sec)

Note that I am explicitly excluding from replication tables that start with tmp, which in my case are temporary tables created by certain operations on the master DB which are not needed on the slaves. To do this, I added this line to /etc/my.cnf on the slaves (all replication filtering is done at the slave level):
replicate-wild-ignore-table = MYDB.tmp\_%

Promoting a slave database to master

Let's say disaster strikes and your master DB goes down. At this point, if you have replication set up as above, you can easily turn one of the slave DB machines into a slave, and reconfigure the other slaves to have this newly promoted machine as their master. The official documentation for this scenario is here and it's very good. Let's slave you have master M01 and slaves S01, S02 and S03. Master 01 dies. You want to promote slave S01 to master, and set up S02 and S03 to replicate from S01.

On S01, run these commands at the MySQL prompt:
STOP SLAVE;
RESET MASTER;
CHANGE MASTER TO MASTER_HOST='';
On S02 and S03, run these commands at the MySQL prompt:
STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO MASTER_HOST='S01';
START SLAVE;

Now if you run 'SHOW SLAVE STATUS\G' on the slaves, you should see no errors, and you should also see the master DB hostname shown as 'S01' instead of 'M01'.

While we're on the subject of switching the master DB, it can happen that the slave DBs will get some udpates from the newly promoted master that will conflict with their current view of the database. For example, they can receive from the master a duplicate insert, or a delete on a row that doesn't exist in their database. In these cases, to bring the slave to a sane state, you can issue commands like this one, where N is 1 or 2 (see full explanation here):

STOP SLAVE;
SET GLOBAL SQL_SLAVE_SKIP_COUNTER = N;
START SLAVE;

You can try running the skip command repeatedly until the slave goes back to a successful replication state.

Setting up a slave database from a hot backup of the master

Let's say you have your master database up and running, and you want to set up a new slave without locking the tables for writes on the master. In this case, you can use a product such as InnoDB Hot Backup, which is very much worth its $500/year/host price. What's more, they provide a 30-day free evaluation binary tied to the host name of your DB machine, which is nice if you need something in a critical situation, or if you want to test it before committing to pay.

Here's a procedure for setting up a new slave DB from a hot backup on the master. The InnoDB Hot Backup documentation is very good, and what follows is a subset I used from that documentation.

1) On the master, create two mini configuration files which are tiny subsets of my.cnf. Call one for example my.cnf.source and the other one my.cnf.destination. The source file needs to contain lines similar to these referring to the location of your live MySQL installation:

# cat /etc/my.cnf.source
[mysqld]
datadir = /var/lib/mysql/
innodb_data_home_dir = /var/lib/mysql/
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /var/lib/mysql/
set-variable = innodb_log_files_in_group=2
set-variable = innodb_log_file_size=512M
The destination file needs to contain similar lines, but pointing to a directory where the backup files will be created (that directory needs to be empty). For example:

# cat /etc/my.cnf.destination
[mysqld]
datadir = /var/hot-backups
innodb_data_home_dir = /var/hot-backups
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /var/hot-backups
set-variable = innodb_log_files_in_group=2
set-variable = innodb_log_file_size=512M

2) On the master, run the ibbackup binary and point it to the 2 configuration files:

# /path/to/ibbackup /etc/my.cnf.source /etc/my.cnf.destination 
 
This step can be quite lengthy, depending on the size of your database, but note that you don't need to lock any tables on the master during this time. Upon the completion of this step, you should see an InnoDB data file (its name is the one you specified in the innodb_data_file_path variable in the config files), and an InnoDB transaction log called ibbackup_logfile. Note that this is not identical to the InnoDB logs on the master. To create those logs, you need to go to the next step.

3) On the master, apply the transaction logs created by the hot backup process by running this command:

# /path/to/ibbackup --apply-log /etc/my.cnf.destination

When this is done (again it can take a while), you should see N log files called ib_logfile1, ib_logfile2, ..., ib_logfileN in the destination directory -- where N is the value of the variable innodb_log_files_in_group that you set in the configuration file.

4) On the master, do a tar.gz of all directories in the MySQL datadir which contain MyISAM tables, or .frm tables from InnoDB tables (the main one being of course the mysql directory, containing the MyISAM tables for the mysql database -- assuming of course you've kept the default of MyISAM for the mysql DB).

5) Now you're ready to transfer the data file created in step 2, the log files created in step 3, and the archives created in step 4 to a new machine running MySQL, which you intend to set up as a slave DB. Simply scp the files over. On the target machine, stop mysql, move /var/lib/mysql (or wherever your datadir is) to /var/lib/mysql.bak, create a brand new /var/lib/mysql directory and drop all the files you transferred into that directory (un-tar-ing the tar.gz files appropriately). Also run 'chmod -R mysql.mysql /var/lib/mysql'. Finally, make sure the my.cnf file on the slave has binlog enabled (in case you ever need to promote this slave to a master).

6) Restart the mysqld process on the target machine, and make note of the binlog file and position, which are captured in the mysql log file. You should see a line similar to this:
InnoDB: Last MySQL binlog file position 0 6199825, file name /var/lib/mysql/mysql-bin.000008
Now go to the mysql prompt on the target machine and run:

STOP SLAVE;
RESET SLAVE;
CHANGE MASTER TO
MASTER_HOST='master_database_server_name',
MASTER_USER='replicant',
MASTER_PASSWORD='somepassword',
MASTER_LOG_FILE='mysql-bin.000008',
MASTER_LOG_POS=6199825;
START SLAVE;

At this point, 'SHOW SLAVE STATUS\G' should show no errors, and the new slave should be replicating correctly from the master DB server. It may take a while for the slave to catch up, depending on when you took the hot backup on the master.

Before I finish this post, one word of advice when it comes to mounting EBS volumes in EC2: do not mount /var by itself on an EBS, because if for some reason the EBS becomes unavailable or fails, you won't be able to ssh back into your instance. Why is that? Because sshd (at least in CentOS) needs /var/empty to be available for privilege separation purposes.

If you want to take advantage of an EBS on an EC2 instance functioning as a MySQL database server, it's better to either mount /var/lib/mysql on an EBS, or specify a non-default data directory for MySQL, which you then mount from an EBS.

UPDATE: EC2 backup strategies

An anonymous comment reminded me that I need to also discuss backups. Doh. In an EC2 environment, it's very easy to backup up a whole EBS by means of a snapshot.

Of course, if you do a snapshot with no other backups, the database files will be 'live', but I managed in one case to
1) detach an EBS containing /var/lib/mysql from an instance that was failing, and
2) attach the EBS to another instance and mount it in /var/lib/mysql

I then restarted mysqld on the new instance and everything worked as expected. This is NOT the recommended strategy however. What is recommended is to do a database dump (either a hot backup if you can afford it, or a simple mysqldump) to an EBS, and snapshot the EBS periodically.

Alternatively, you can use various S3 utilities to capture the backups directly to S3. The EBS snapshot solution is better IMO because you can quickly recreate an EBS volume from a snapshot, then mount it to either the original instance, or to a new instance.

However, EBS volumes DO sometimes fail, so another thing to think about is to run your EC2 instances (especially your slave DBs) in different availability zones. We had an issue with 2 of our database servers failing at the same time in zone US-East-1a due to EBS issues, and the thing that saved us is that we had slaves in other availability zones that weren't affected.

Thursday, April 30, 2009

MySQL load balancing and read-write splitting with MySQL Proxy

This is just a quick post which aims to say something positive about MySQL Proxy. I've been reading lots of negative blog and forum posts about it, with people complaining that it doesn't work for them. It works for us, and it works pretty well too. First things first though -- what is MySQL Proxy? Here's what the documentation says:

MySQL Proxy is a simple program that sits between your client and MySQL server(s) that can monitor, analyze or transform their communication. Its flexibility allows for unlimited uses; common ones include: load balancing; failover; query analysis; query filtering and modification; and many more.

Two fairly common usage scenarios for MySQL Proxy are:

1) load balancing across MySQL slaves
2) splitting reads and writes so that reads go to the slave DB servers and writes go to the master DB server

Of course, you don't need MySQL Proxy to accomplish these goals. For slave load balancing, you can use a regular load balancer in front of your slaves. For read-write splitting, you can have your application use different DB servers for reads and writes....but that may require significant changes to your application.

If you want to make things faster in terms of read performance by sending reads to a pool of slave DB servers, while still sending writes to a master DB, AND do all this without modifying your application, then MySQL Proxy might be just the ticket for you. Before you go down that path, let me say that if you make heavy use of MySQL prepared statements, you might be out of luck. In my testing, MySQL Proxy did not support prepared statements well.

Here's a short tutorial on using MySQL Proxy:

1) Download the binary package from the download page. I tried to install it from source, but I ran into some mysterious link issues with Lua libraries. If you didn't know already, MySQL Proxy uses Lua as its scripting language for doing the tricks it's capable of doing; not sure why the authors chose Lua, I suspect it's because of its compactness (the binary version of MySQL Proxy includes the Lua interpreter, so you don't need to install Lua separately.)

In my case I downloaded mysql-proxy-0.6.0-linux-rhas3-x86_64.tar.gz, untar-ed it in ROOT_DIR, then created a symlink called mysql-proxy in ROOT_DIR pointing to mysql-proxy-0.6.0-linux-rhas3-x86_64.

The actual binary is in ROOT_DIR/mysql-proxy/sbin and it's called mysql-proxy. You can run it with --help to see what command-line options it takes.

2) Run mysql-proxy and let it do both slave load balancing and read/write splitting. Load balancing is achieved by specifying the command-line switch --proxy-read-only-backend-addresses, while r/w splitting is achieved by specifying on the command line the script , which is in /mysql-proxy/share/mysql-proxy/rw-splitting.lua

Here is a script that I use to run mysql-proxy with the options I need, and in daemon mode. The master DB server is specified with the --proxy-backend-addresses cmdline switch. An important bit in the script is setting LUA_PATH and pointing it to the directory containing the Lua scripts. If you don't do it, the rw-splitting.lua script won't be found, and you won't know about it until you hit mysql-proxy. You'll then see errors around the script not being found.

Note that LUA_PATH is on the same line as the invocation of the mysql-proxy binary.

#!/bin/bash

MASTERDB=10.1.1.1
SLAVEDB01=10.2.1.1
SLAVEDB02=10.3.1.1
SLAVEDB03=10.4.1.1
ROOT_DIR=/usr/local

LUA_PATH="$ROOT_DIR/mysql-proxy/share/mysql-proxy/?.lua" $ROOT_DIR/mysql-proxy/sbin/mysql-proxy \
--daemon \
--proxy-backend-addresses=$MASTERDB:3306 \
--proxy-read-only-backend-addresses=$SLAVEDB01:3306 \
--proxy-read-only-backend-addresses=$SLAVEDB02:3306 \
--proxy-read-only-backend-addresses=$SLAVEDB03:3306 \
--proxy-lua-script=$ROOT_DIR/mysql-proxy/share/mysql-proxy/rw-splitting.lua


3) Now you have mysql-proxy running on its default port 4040 and ready for you to use. To use it, simply point your web application to 127.0.0.1:4040 instead of MASTER_DB_SERVER:3306. You can also connect to mysql-proxy with the regular mysql command-line client by running:

mysql -uroot -p -h127.0.0.1 -P 4040

4) To start mysql-proxy at boot time, here's a very simple init.d script which assumes you saved the script above in /var/scripts/run_mysql_proxy_rw_splitting.sh:


~# cat /etc/init.d/mysql-proxy
#!/bin/bash
#
# mysql-proxy: Start mysql-proxy in daemon mode
#
# Author: OpenX
#
# chkconfig: - 99 01
# description: Start mysql-proxy in daemon mode with r/w splitting
# processname: mysql-proxy


start(){
echo "Starting mysql-proxy..."
/var/scripts/run_mysql_proxy_rw_splitting.sh
}

stop(){
echo "Stopping mysql-proxy..."
killall mysql-proxy
}

case "$1" in
start)
start
;;
stop)
stop
;;
restart)
stop
start
;;
*)
echo "Usage: mysql-proxy {start|stop|restart}"
exit 1
esac

That's about it in a nutshell. There's much more to explore about the capabilities of MySQL Proxy, and I encourage you to read the main page and the articles linked to on that page. In terms of read/write splitting, the most helpful ones are these two blog posts by the author of rw-splitting.lua, Jan Kneschke. For general usage, this O'Reilly article by Giuseppe Maxia is very good.

Thursday, April 16, 2009

Check out OpenX Market

Techcrunch has an article today about our OpenX Market offering. Check it out if you're a publisher trying to make more money out of the ads you have on your Web site. Also lots of other news items about OpenX today...

Saturday, April 04, 2009

Experiences deploying a large-scale infrastructure in Amazon EC2

At OpenX we recently completed a large-scale deployment of one of our server farms to Amazon EC2. Here are some lessons learned from that experience.

Expect failures; what's more, embrace them

Things are bound to fail when you're dealing with large-scale deployments in any infrastructure setup, but especially when you're deploying virtual servers 'in the cloud', outside of your sphere of influence. You must then be prepared for things to fail. This is a Good Thing, because it forces you to think about failure scenarios upfront, and to design your system infrastructure in a way that minimizes single points of failure.

As an aside, I've been very impressed with the reliability of EC2. Like many other people, I didn't know what to expect, but I've been pleasantly surprised. Very rarely does an EC2 instance fail. In fact I haven't yet seen a total failure, only some instances that were marked as 'deteriorated'. When this happens, you usually get a heads-up via email, and you have a few days to migrate your instance, or launch a similar one and terminate the defective one.

Expecting things to fail at any time leads to and relies heavily on the next lesson learned, which is...

Fully automate your infrastructure deployments

There's simply no way around this. When you need to deal with tens and even hundreds of virtual instances, when you need to scale up and down on demand (after all, this is THE main promise of cloud computing!), then you need to fully automate your infrastructure deployment (servers, load balancers, storage, etc.)

The way we achieved this at OpenX was to write our own custom code on top of the EC2 API in order to launch and destroy AMIs and EBS volumes. We rolled our own AMI, which contains enough bootstrap code to make it 'call home' to a set of servers running slack. When we deploy a machine, we specify a list of slack 'roles' that the machine belongs to (for example 'web-server' or 'master-db-server' or 'slave-db-server'). When the machine boots up, it will run a script that belongs to that specific slack role. In this script we install everything the machine needs to do its job -- pre-requisite packages and the actual application with all its necessary configuration files.

I will blog separately about how exactly slack works for us, but let me just say that it is an extremely simple tool. It may seem overly simple, but that's exactly its strength, since it forces you to be creative with your postinstall scripts. I know that other people use puppet, or fabric, or cfengine. Whatever works for you, go ahead and use, just use SOME tool that helps with automated deployments.

The beauty of fully automating your deployments is that it truly allows you to scale infinitely (for some value of 'infinity' of course ;-). It almost goes without saying that your application infrastructure needs to be designed in such a way that allows this type of scaling. But having the building blocks necessary for automatically deploying any type of server that you need is invaluable.

Another thing we do which helps with automating various pieces of our infrastructure is that we keep information about our deployed instances in a database. This allows us to write tools that inspect the database and generate various configuration files (such as the all-important role configuration file used by slack), and other text files such as DNS zone files. This database becomes the one true source of information about our infrastructure. The DRY principle applies to system infrastructure, not only to software development.

Speaking of DNS, specifically in the context of Amazon EC2, it's worth rolling out your own internal DNS servers, with zones that aren't even registered publicly, but for which your internal DNS servers are authoritative. Then all communication within the EC2 cloud can happen via internal DNS names, as opposed to IP addresses. Trust me, your tired brain will thank you. This would be very hard to achieve though if you were to manually edit BIND zone files. Our approach is to automatically generate those files from the master database I mentioned. Works like a charm. Thanks to Jeff Roberts for coming up with this idea and implementing it.

While we're on the subject of fully automated deployments, I'd like to throw an idea out there that I first heard from Mike Todd, my boss at OpenX, who is an ex-Googler. One of his goals is for us never to have to ssh into any production server. We deploy the server using slack, the application gets installed automatically, monitoring agents get set up automatically, so there should really be no need to manually do stuff on the server itself. If you want to make a change, you make it in a slack role on the master slack server, and it gets pushed to production. If the server misbehaves or gets out of line with the other servers, you simply terminate that server instance and launch another one. Since you have everything automated, it's one command line for terminating the instance, and another one for deploying a brand new replacement. It's really beautiful.

Design your infrastructure so that it scales horizontally

There are generally two ways to scale an infrastructure: vertically, by deploying your application on more powerful servers, and horizontally, by increasing the number of servers that support your application. For 'infinite' scaling in a cloud computing environment, you need to design your system infrastructure so that it scales horizontally. Otherwise you're bound to hit limits of individual servers that you will find very hard to get past. Horizontal scaling also eliminates single points of failure.

Here are a few ideas for deploying a Web site with a database back-end so that it uses multiple tiers, with each tier being able to scale horizontally:

1) Deploy multiple Web servers behind one or more load balancers. This is pretty standard these days, and this tier is the easiest to scale. However, you also want to maximize the work done by each Web server, so you need to find the sweet spot of that particular type of server in terms of httpd processes it can handle. Too few processes and you're wasting CPU/RAM on the server, too many and you're overloading the server. You also need to be cognizant of the fact that each EC2 instance costs you money. It can become so easy to launch a new instance that you don't necessarily think of getting the most out of the existing instances. Don't go wild unless absolutely necessary if you don't want to have a sticker shock when you get the bill from Amazon at the end of the month.

2) Deploy multiple load balancers. Amazon doesn't yet offer load balancers, so what we've been doing is using HAProxy-based load balancers. Let's say you have an HAProxy instance that handles traffic for www.yourdomain.com. If your Web site becomes wildly successful, it is imaginable that a single HAProxy instance will not be able to handle all the incoming network traffic. One easy solution for this, which is also useful for eliminating single points of failure, is to use round-robin DNS, pointing www.yourdomain.com to several IP addresses, with each IP address handled by a separate HAProxy instance. All HAProxy instances can be identical in terms of back-end configuration, so your Web server farm will get 1/N of the overall traffic from each of your N load balancers. It worked really well for us, and the traffic was spread out very uniformly among the HAProxies. You do need to make sure the TTL on the DNS record for www.yourdomain.com is low.

3) Deploy several database servers. If you're using MySQL, you can set up a master DB server for writes, and multiple slave DB servers for reads. The slave DBs can sit behind an HAProxy load balancer. In this scenario, you're limited by the capacity of the single master DB server. One thing you can do is to use sharding techniques, meaning you can partition the database into multiple instances that each handle writes for a subset of your application domain. Another thing you can do is to write to local databases deployed on the Web servers, either in memory or on disk, and then periodically write to the master DB server (of course, this assumes that you don't need that data right away; this technique is useful when you have to generate statistics or reports periodically for example).

4) Another way of dealing with databases is to not use them, or at least to avoid the overhead of making a database call each time you need something from the database. A common technique for this is to use memcache. Your application needs to be aware of memcache, but this is easy to implement in all of the popular programming languages. Once implemented, you can have your Web servers first check a value in memcache, and only if it's not there have them hit the database. The more memory you give to the memcached process, the better off you are.

Establish clear measurable goals

The most common reason for scaling an Internet infrastructure is to handle increased Web traffic. However, you need to keep in mind the quality of the user experience, which means that you need to keep the response time of the pages your serve under a certain limit which will hopefully meet and surpass the user's expectations. I found it extremely useful to have a very simple script that measures the response time of certain pages and that graphs it inside a dashboard-type page (thanks to Mike Todd for the idea and the implementation). As we deployed more and more servers in order to keep up with the demands of increased traffic, we always kept an eye on our goal: keep reponse time/latency under N milliseconds (N will vary depending on your application). When we would see spikes in the latency chart, we knew we need to act at some level of our infrastructure. And this brings me to the next point...

Be prepared to quickly identify and eliminate bottlenecks

As I already mentioned in the design section above, any large-scale Internet infrastructure will have different types of servers: web servers, application servers, database servers, memcache servers, and the list goes on. As you scale the servers at each tier/level, you need to be prepared to quickly identify bottlenecks. Examples:

1) Keep track of how many httpd processes are running on your Web servers; this depends on the values you set for MaxClients and ServerLimit in your Apache configuration files. If you're using an HAProxy-based load balancer, this also depends on the connection throttling that you might be doing at the backend server level. In any case, the more httpd processes are running on a given server, the more CPU and RAM they will use up. At some point, the server will run out of resources. At that point, you either need to scale the server up (by deploying to a larger EC2 instance, for example an m1.large with more RAM, or a c1.medium with more CPU), or you need to scale your Web server farm horizontally by adding more Web servers, so the load on each server decreases.

2) Keep track of the load on your database servers, and also of slow queries. A great tool for MySQL database servers is innotop, which allows you to see the slowest queries at a glance. Sometimes all it takes is a slow query to throw a spike into your latency chart (can you tell I've been there, done that?). Also keep track of the number of connections into your database servers. If you use MySQL, you will probably need to bump up the max_connections variable in order to be able to handle an increased number of concurrent connections from the Web servers into the database.

Since we're discussing database issues here, I'd be willing to bet that if you were to discover your single biggest bottleneck in your application, it would be at the database layer. That's why it is especially important to design that layer with scalability in mind (think memcache, and load balanced read-only slaves), and also to monitor the database servers carefully, with an eye towards slow queries that need to be optimized (thanks to Chris Nutting for doing some amazing work in this area!)

3) Use your load balancer's statistics page to keep track of things such as concurrent connections, queued connections, HTTP request or response errors, etc. One of your goals should be never to see queued connections, since that means that some user requests couldn't be serviced in time.

I should mention that a good monitoring system is essential here. We're using Hyperic, and while I'm not happy at all with its limits (in the free version) in defining alerts at a global level, I really like its capabilities in presenting various metrics in both list and chart form: things like Apache bytes and requests served/second, memcached hit ratios, mysql connections, and many other statistics obtained by means of plugins specific to these services.

As you carefully watch various indicators of your systems' health, be prepared to....

Play wack-a-mole for a while, until things get stable

There's nothing like real-world network traffic, and I mean massive traffic -- we're talking hundreds of millions of hits/day -- to exercise your carefully crafted system infrastructure. I can almost guarantee that with all your planning, you'll still feel that a tsunami just hit you, and you'll scramble to solve one issue after another. For example, let's say you notice that your load balancer starts queuing HTTP requests. This means you don't have enough Web server in the pool. You scramble to add more Web servers. But wait, this increases the number of connections to your database pool! What if you don't have enough servers there? You scramble to add more database servers. You also scramble to increase the memcache settings by giving more memory to memcached, so more items can be stored in the cache. What if you still see requests taking a long time to be serviced? You scramble to optimize slow database queries....and the list goes on.

You'll say that you've done lots of load testing before. This is very good....but it still will not prepare you for the sheer amount of traffic that the internets will throw at your application. That's when all the things I mentioned before -- automated deployment of new instances, charting of the important variables that you want to keep track of, quick identification of bottlenecks -- become very useful.

That's it for this installment. Stay tuned for more lessons learned, as I slowly and sometimes painfully learn them :-) Overall, it's been a blast though. I'm really happy with the infrastructure we've built, and of course with the fact that most if not all of our deployment tools are written in Python.

Tuesday, March 17, 2009

HAProxy and Apache performance tuning tips

I want to start my post by a big shout-out to Willy Tarreau, the author of HAProxy, for his help in fine-tuning one of our HAProxy installations and working with us through some issues we had. Willy is amazingly responsive and obviously lives and breathes stuff related to load balancing, OS and TCP stack tuning, and other arcane subjects ;-)

Let's assume you have a cluster of Apache servers behind an HAProxy and you want to sustain 500 requests/second with low latency per request. First of all, you need to bump up MaxClients and ServerLimit in your Apache configuration, as I explained in another post. In this case you would set both variables to 500. Note that you actually need to stop and start the httpd service, because simply restarting it won't change the built-in limit (which is 256). Also ignore the warning that Apache gives you on startup:

WARNING: MaxClients of 500 exceeds ServerLimit value of 256 servers,
lowering MaxClients to 256. To increase, please see the ServerLimit
directive.

Note that the more httpd processes you have, the more CPU and RAM will be consumed on the server. You need to decide how much to push the envelope in terms of concurrent httpd processes you can sustain on a given server. A good measure is the latency / responsiveness you expect from your Web application. At some point, it will start to suffer, and that will be a sign that you need to add a new Web server to your server farm (of course, this over-simplifies things a bit, since there's always the question of the database layer; I'm assuming you can use memcache to minimize database access.) Here's a good overview of the trade-offs related to MaxClients.

Other Apache configuration variables I've tweaked are StartServers, MinSpareServers and MaxSpareServers. It sometimes pays to bump up the values for these variables, so you can have spare httpd processes waiting around for those peak times when the requests hitting your server suddenly increase. Again, there's a trade-off here between server resources and number of spare httpd processes you want to maintain.

Assuming you fine-tuned your Apache servers, it's time to tweak some variables in the HAProxy configuration. Perhaps the most important ones for our discussion are the number of maximum connections per server (maxconn), httpclose and abortonclose.

It's a good idea to throttle the maximum number of connections per server and set it to a number related to the request/second rate you're shooting for. In our case, that number is 500. Since HAProxy itself needs some connections for healthchecking and other internal bookkeeping, you should set the maxconn per server to something slightly lower than 500. In terms of syntax, I have something similar to this in the backend section of haproxy.cfg:

server server1 10.1.1.1:80 check maxconn 500

I also have the following 2 lines in the backend section:


option abortonclose
option httpclose

According to the official HAProxy documentation, here's what these options do:

option abortonclose

In presence of very high loads, the servers will take some time to respond.
The per-instance connection queue will inflate, and the response time will
increase respective to the size of the queue times the average per-session
response time. When clients will wait for more than a few seconds, they will
often hit the "STOP" button on their browser, leaving a useless request in
the queue, and slowing down other users, and the servers as well, because the
request will eventually be served, then aborted at the first error
encountered while delivering the response.

As there is no way to distinguish between a full STOP and a simple output
close on the client side, HTTP agents should be conservative and consider
that the client might only have closed its output channel while waiting for
the response. However, this introduces risks of congestion when lots of users
do the same, and is completely useless nowadays because probably no client at
all will close the session while waiting for the response. Some HTTP agents
support this behaviour (Squid, Apache, HAProxy), and others do not (TUX, most
hardware-based load balancers). So the probability for a closed input channel
to represent a user hitting the "STOP" button is close to 100%, and the risk
of being the single component to break rare but valid traffic is extremely
low, which adds to the temptation to be able to abort a session early while
still not served and not pollute the servers.

In HAProxy, the user can choose the desired behaviour using the option
"abortonclose". By default (without the option) the behaviour is HTTP
compliant and aborted requests will be served. But when the option is
specified, a session with an incoming channel closed will be aborted while
it is still possible, either pending in the queue for a connection slot, or
during the connection establishment if the server has not yet acknowledged
the connection request. This considerably reduces the queue size and the load
on saturated servers when users are tempted to click on STOP, which in turn
reduces the response time for other users.

option httpclose

As stated in section 2.1, HAProxy does not yes support the HTTP keep-alive
mode. So by default, if a client communicates with a server in this mode, it
will only analyze, log, and process the first request of each connection. To
workaround this limitation, it is possible to specify "option httpclose". It
will check if a "Connection: close" header is already set in each direction,
and will add one if missing. Each end should react to this by actively
closing the TCP connection after each transfer, thus resulting in a switch to
the HTTP close mode. Any "Connection" header different from "close" will also
be removed.

It seldom happens that some servers incorrectly ignore this header and do not
close the connection eventough they reply "Connection: close". For this
reason, they are not compatible with older HTTP 1.0 browsers. If this
happens it is possible to use the "option forceclose" which actively closes
the request connection once the server responds.

And now for something completely different.....TCP stack tuning! Even with all the tuning above, we were still seeing occasional high latency numbers. Willy Tarreau to the rescue again....he was kind enough to troubleshoot things by means of the haproxy log and a tcpdump. It turned out that some of the TCP/IP-related OS variables were set too low. You can find out what those values are by running:

sysctl -a | grep ^net

In our case, the main one that was out of tune was:

net.ipv4.tcp_max_syn_backlog = 1024

Because of this, when there were more than 1,024 concurrent sessions on the machine running HAProxy, the OS had to recycle through the SYN backlog, causing the latency issues. Here are all the variables we set in /etc/sysctl.conf at the advice of Willy:
net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65023
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.core.somaxconn = 10000
(to have these values take effect, you need to run 'sysctl -p')

That's it for now. As I continue to use HAProxy in production, I'll report back with other tips/tricks/suggestions.

Wednesday, March 04, 2009

HAProxy, X-Forwarded-For, GeoIP, KeepAlive

I know the title of this post doesn't make much sense, I wrote it that way so that people who run into issues similar to mine will have an easier time finding it.

Here's a mysterious issue that I recently solved with the help of my colleague Chris Nutting:

1) Apache/PHP server sitting behind an HAProxy instance
2) MaxMind's GeoIP module installed in Apache
3) Application making use of the geotargeting features offered by the GeoIP module was sometimes displaying those features in a drop-down, and sometimes not

It turns out that the application was using the X-Forwarded-For headers in the HTTP requests to pass the real source IP of the request to the mod_geoip module and thus obtain geotargeting information about that IP. However, mysteriously, HAProxy was sometimes (once out of every N requests) not sending the X-Forwarded-For headers at all. Why? Because KeepAlive was enabled in Apache, so HAProxy was sending those headers only on the first request of the HTTP connection that was being "kept alive". Subsequent requests in that connection didn't have those headers set, so those requests weren't identified properly by mod_geoip.

The solution in this case was to disable KeepAlive in Apache. Willy Tarreau, the author of HAProxy, also recommends setting 'option httpclose' in the HAProxy configuration file. Here's an excerpt from the official HAProxy documentation:

option forwardfor [ except  ] [ header  ]
....
It is important to note that as long as HAProxy does not support keep-alive
connections, only the first request of a connection will receive the header.
For this reason, it is important to ensure that "option httpclose" is set
when using this option.

I hope this post will be of some use to people who might run into this issue.

Tuesday, February 24, 2009

You're not a cloud provider if you don't provide an API

Cloud computing is all the rage these days, with everybody and their brother claiming to be a 'cloud provider'. Just because hosting companies have a farm of virtual servers that they can parcel out to their customers, it doesn't mean that they are operating 'in the cloud'. For that to be the case, they need to offer a solid API that allows their customers to manage resources such as virtual server instances, storage mounts, IP addresses, load balancer pools, firewall rules, etc.

A short discussion on 'XaaS' nomenclature is in order here: 'aaS' stands for 'as a Service', and X can take various values, for example P==Platform, S==Software, I==Infrastructure. You will see these acronyms in pretty much every industry-sponsored article about cloud computing. Pundits seem to love this kind of stuff. When I talk about cloud providers in this post, I mean providers of 'Infrastructure as a Service', things like the ones I mentioned above -- virtual servers, networking and storage resources, in short the low-level plumbing of an infrastructure.

A good example of 'Platform as a Service' is Google AppEngine, which offers both a development environment (right now Python-specific), and an API to interact with the 'Google cloud' when deploying your GAE application.

'Software as a Service' is pretty much what 'ASP' used to be in the dot com days (ASP == Application Service Provider if you don't remember your acronyms). The poster child for SaaS these days seems to be salesforce.com. I do however emphasize that one significant difference between SaaS and ASP is that SaaS providers DO offer an API for your application to interact with the resources they expose.

So...the common thread between the XaaS offerings is the existence of an API which allows you, as a systems and/or application architect, to interact with and manage the resources offered by the particular provider.

I've been using two cloud APIs here at OpenX, one from AppNexus and one from Amazon EC2. The AppNexus API allows you to reserve physical servers, start up, shut down and delete virtual instances on each server, clone a virtual instance, manage load balancer pools and SSL certificates at the LB level, etc. In short, it's a very solid and easy to use API.

The Amazon EC2 API is more fine grained than the one from AppNexus, which can be an advantage, but also makes it hard to coordinate the management of various resources. For example, to launch an EC2 instance you first need to create a keypair, potentially a security group, maybe an EBS volume and an elastic IP, and only then you can tie everything together via yet other EC2 API calls. For this reason, we're building our own tools around the Amazon API, tools which allow us to deploy an instance with all its associated resources via a single command-line script (and yes, we call this collection of tools the MCP). We're also using slack to deploy specific packages and applications to each instance we launch, but that's a topic for another post.

So what does all this mean to you as a systems or application architect? For a system administrator, I think it means that you need to shore up your programming skills so that you will be able to take advantage of these APIs and automate the deployment, testing and scaling of your infrastructure. For an application architect, it means that you need to shore up your sysadmin skills so you can understand the lower-level resources exposed by cloud APIs and use them to your full advantage. I think the future is bright for people who possess both sets of skills.

Tuesday, February 17, 2009

Helping the 'printable world wide web' movement

Alexander Artemenko pointed out to me that my blog lacks a sane CSS stylesheet for printing. He was nice enough to provide one for me (since I'm no CSS wizard), and I inserted it into my blog's template. Alexander runs a campaign for convincing bloggers to make their blog content printable.

BTW, here's all I had to add to my Blogger template to make the content printable:


<style type="text/css">
@media print {
#sidebar, #navbar-iframe, #blog-header,
#comments h4, #comments-block, #footer,
span.statcounter, #b-backlink {display: none;}
#wrap, #content, #main-content {width: 100%; margin: 0; background: #FFFFFF;}
}
</style>

Wednesday, February 04, 2009

Load Balancing in Amazon EC2 with HAProxy

Until the time comes when Amazon will offer a load balancing service in their EC2 environment, people are forced to use a software-based load balancing solution. One of the most common out there is HAProxy. I've been looking at it for the past 2 months or so, and recently we started to use it in production here at OpenX. I am very impressed with its performance and capabilities. I'll explore here some of the functionality that HAProxy offers, and also discuss some of the non-obvious aspects of its configuration.

Installation

I installed HAProxy via yum. Here's the version that was installed using the default CentOS repositories on a CentOS 5.x box:
# yum list installed | grep haproxy

haproxy.i386 1.3.14.6-1.el5 installed


The RPM installs an init.d service called haproxy that you can use to start/stop the haproxy process.

Basic Configuration

In true Unix fashion, all configuration is done via a text file: /etc/haproxy/haproxy.cfg. It's very important that you read the documentation for the configuration file. The official documentation for HAProxy 1.3 is here.

Emulating virtual servers

In version 1.3, you can specify a frontend section, which defines an IP address/port pair for requests coming into the load balancer (think of it as a way to specify a virtual server/virtual port pair on a traditional load balancer), and multiple backend sections for each frontend, which correspond to the real IP addresses and ports of the backend servers handling the requests. If you can assign multiple external IP addresses to your HAProxy server, then you can have each one of these IPs function as a virtual server (via a frontend declaration), sending traffic to real servers declared in a backend.

However, one fairly large limitation of EC2 instances is that you only get one external IP address per instance. This means that you can have HAProxy listen on port 80 on a single IP address in EC2. How then can you have multiple 'virtual servers' on an EC2 HAProxy load balancer? The answer is in a new feature of HAProxy called ACLs.

Here's what the official documentation says:

2.3) Using ACLs
---------------

The use of Access Control Lists (ACL) provides a flexible solution to perform
content switching and generally to take decisions based on content extracted
from the request, the response or any environmental status. The principle is
simple :

- define test criteria with sets of values
- perform actions only if a set of tests is valid

The actions generally consist in blocking the request, or selecting a backend.

So let's say for example that you want to handle both www.example1.com and www.example2.com using the same HAProxy instance, but you want to load balance traffic for www.example1.com to server1 and server2 with IP addresses 192.168.1.1 and 192.168.1.2, while traffic for www.example2.com gets load balanced to server3 and server4 with IP addresses 10.0.0.3 and 10.0.0.4. Traffic for other domains will be sent to a default backend.

First, you define a frontend section in haproxy.cfg similar to this:

frontend myfrontend *:80
log global
maxconn 25000
option forwardfor
acl acl_example1 url_sub example1
acl acl_example2 url_sub example2
use_backend example1_farm if acl_example1
use_backend example2_farm if acl_example2
default_backend default_farm

This tells haproxy that there are 2 ACLs defined -- one called acl_example1, which is triggered if the incoming HTTP request is for a URL that contains the expression 'example1', and one called acl_example2, which is triggered if the incoming HTTP request is for a URL that contains the expression 'example2'.

If acl_example1 is triggered, the backend used will be example1_farm. If acl_example2 is triggered, the backend used will be example2_farm. If no acl is triggred, the default backend used will be default_farm.

This is the simplest form of ACLs. HAProxy supports many more, and you're strongly advised to read the ACL section in the documentation for a more in-depth discussion. However, the URL-based ACLs are very useful especially in an EC2 environment.

The backend sections of haproxy.cfg will look similar to this:

backend example1_farm
mode http
balance roundrobin
server server1 192.168.1.1:80 check
server server2 192.168.1.2:80 check
backend example2_farm
mode http
balance roundrobin
server server3 10.0.0.3:80 check
server server4 10.0.0.4:80 check
backend default_farm
mode http
balance roundrobin
server server5 192.168.1.5:80 check
server server6 192.168.1.6:80 check

Logging

You can have haproxy log to syslog, but first you need to allow syslog to receive UDP traffic from 127.0.0.1 on port 514. I'll discuss syslog-ng here, with its configuration file in /etc/syslog-ng/syslog-ng.conf. To allow the UDP traffic I mention, add the line 'udp(ip(127.0.0.1) port(514));' to the source s_sys section, which in my case looks like this:

source s_sys {
file ("/proc/kmsg" log_prefix("kernel: "));
unix-stream ("/dev/log");
internal();
udp(ip(127.0.0.1) port(514));
};
Also add a filter for facility local 0:

filter f_filter9 { facility(local0); };
And finally associate that filter with the d_mesg destination, which sends messages to /var/log/messages:

log { source(s_sys); filter(f_filter9); destination(d_mesg); };
Restart syslog-ng via its init.d script.

Now for the HAProxy configuration -- you need to have a line similar to this in the 'global' section of haproxy.cfg:

global
log 127.0.0.1 local0 info
This tells haproxy to log to facility 'local0' on the localhost using the severity 'info'. You could send logs to a remote syslog server just as well.

Once you define this in the global section, you can specify the logging mechanism either in the default section (which means that all frontends will log in this way), or by a frontend-to-frontend case. If you want to have it in the default section, just write:

defaults
log global
Once you restart haproxy, you should see messages like this in /var/log/messages:
Feb  2 22:39:49 127.0.0.1 haproxy[19150]: Connect from A.B.C.D:44463 to 10.0.0.1:80 (your_frontend_name/HTTP)
However, if you want you're handling HTTP traffic and you would like to see the exact HTTP requests handled by HAProxy, you need to add this line either to the default section, or to a specific frontend:

mode httplog
In this case, the log will contain lines that look like a regular Apache combined log line.

A caveat: if you do enable logging in httplog mode, make sure /var has lots of disk space. If your HAProxy will handle a lot of traffic, the messages file will become very large, very fast. Just don't have /var be part of the typically small / partition, or you can be in a world of trouble.

Logging the client source IP in the backend web logs

One issue with load balancers and reverse proxies is that the backend servers will see traffic as always originating from the IP address of the LB or reverse proxy. This is obviously a problem when you're trying to get stats from your web logs. To mitigate this issue, many LBs/proxies use the X-Forwarded-For header to send the IP address of the client to the destination server. HAProxy offers this functionality via the forwardfor option. You can simply declare

option forwardfor

in your backend, and all your backend servers will receive the X-Forwarded-For header.

Of course, you also have to tell your Web server to handle this header in its log file. In Apache you need to modify the LogFormat directive and replace %h with %{X-Forwarded-For}i.

SSL

To handle SSL traffic in HAProxy, you need 3 things:

1) Define a frontend with a unique name which handles *:443
2) Send traffic to real_server_IP_1:443 through real_server_IP_N:443 in the backend(s) associated with the frontend
3) Specify 'mode tcp' instead of 'mode http' both in the frontend section and in the backend section(s) which handle port 443. Otherwise you won't see any SSL traffic hitting your real servers, and you'll wonder why....

Load balancing algorithms

HAProxy can handle several load balancing algorithms:
  • round-robin: requests are rotated among the servers in the backend -- note that servers declared in the backend section also accept a weight parameter which specifies their relative weight in that backend; the round-robin algorithm will respect that weight ratio
  • leastconn: the request is sent to the server with the lowest number of connections; round-robin is used if servers are similarly loaded
  • source: a hash of the source IP is divided by the total weight of the running servers to determine which server will receive the request; this ensures that clients from the same IP address always hit the same server, which is a poor man's session persistence solution
  • uri: the part of the URL up to a question mark is hashed and used to choose a server that will handle the request; this is useful when you want certain sub-parts of your web site to be served by certain servers (this is used with proxy caches to maximize the cache hit rate)
  • url_param: can be used to check certain parts of the URL, for example values sent via POST requests; for example a request which specifies a user_id parameter with a certain value can get directed to the same server using the url_param method -- so this is another form of achieving session persistence in some cases (see the documentation for more details)
Session persistence with cookies

If you're OK with the fact that not all client browsers accept cookies, and you still want to use cookies as a session persistence mechanism, then HAProxy offers an easy way to do so. If you add this line to the backend section:

cookie SERVERID insert nocache indirect

then you're telling HAProxy to insert a cookie named SERVERID in the HTTP response; the cookie will be sent to the client browser via a Set-Cookie header in the response, and which is sent back by the client in a Cookie header in all subsequent requests. Note that this cookie is only a session cookie, and will not be written to disk by the client browser. For this reason, and for issues related to caching, the documentation recommends specifying the other 2 options 'nocache' and 'indirect'. In particular, 'indirect' means that the cookie will be removed from the HTTP request once it is processed by HAProxy, so your application running on the backend servers will never see it.

Once you define the cookie, you need to associate it with the servers in the backend, like this:

server server1 10.1.1.1:80 cookie server01 check
server server2 10.1.1.2:80 cookie server02 check

If a client request will get sent to server serverN initially, the cookie will insert a SERVERID corresponding to serverN in the response. In the requests that follow, the client will send back this SERVERID in the cookie and hence will be directed to the same server for the duration of the session.

Server health checks

HA
Proxy verifies the health of the servers declared in the backend section by sending them periodic HTTP requests. You need to specify 'check' in the server declaration line. Here is the appropriate section from the official documentation:
check
This option enables health checks on the server. By default, a server is
always considered available. If "check" is set, the server will receive
periodic health checks to ensure that it is really able to serve requests.
The default address and port to send the tests to are those of the server,
and the default source is the same as the one defined in the backend. It is
possible to change the address using the "addr" parameter, the port using the
"port" parameter, the source address using the "source" address, and the
interval and timers using the "inter", "rise" and "fall" parameters. The
request method is define in the backend using the "httpchk", "smtpchk",
and "ssl-hello-chk" options. Please refer to those options and parameters for
more information.

Performance tuning

Section 1.2 of the official documentation details the variables you can set to tweak maximum performance out of your HAProxy. The only parameter I found critical so far is maxconn, which in some of the sample configuration files was set to 2,000. This means that if HAProxy is hit with more than 2,000 concurrent connections, only the first 2,000 will be serviced, and the subsequent ones will be queued. For this reason, I recommend you set maxconn to a high number (such as 25,000 for example) in all the sections of your haproxy.cfg file: default, frontend and backend.

From what I've seen so far, the performance of HAProxy itself is very satisfactory. Even on an EC2 m1.small instance, HAProxy took less than 1% CPU for a web site we maintain that was hit with around 20,000 connections. I can guarantee that you will discover many other bottlenecks in your infrastructure long before HAProxy itself becomes your bottleneck. The only caveat in all this is the maxconn parameter above, which you do need to set to a high value to avoid unnecessary throttling of connections at the HAProxy layer.

Utilization statistics

HAProxy offers very nice utilization statistics, with tables showing the servers in all declared backends. Here's how these tables look like:

my_website

QueueSessionsBytesDeniedErrorsWarningsServer
CurMaxLimitCurMaxLimitTotalLbTotInOutReqRespReqConnRespRetrRedisStatusWghtActBckChkDwnDwntmeThrtle
server0100-12704-177432175934132022286108951418
0
5672 1498
1h6m UP1Y-861056s-
server0200-13716-177364176665132655773110034272
0
13218 699
10m48s UP1Y-54525s

To enable stats, add lines such as these to the either the 'defaults' section, or to a specific backend section:

stats enable
stats uri /lb?stats
stats realm Haproxy\ Statistics
stats auth myusername:mypassword


Then hit http://external.ip.of.haproxy/lb?stats and you'll be presented with a basic HTTP authentication dialog. Log in with the credentials you specified.

High-availability strategies

In an ideal situation, you would have 2 HAProxy instances using a heartbeat-type protocol and sharing an external IP address. In case one of them goes down, the other one would assume the IP and your site will be available at all times. You could use Linux-HA, or Wackamole and the Spread toolkit. However, this is not possible in Amazon EC2 because IP addresses cannot be shared among instances in the manner that heartbeat-type protocols expect.

What you can do instead is to use an Elastic IP and associate it with your HAProxy instance. Then you can have another stand-by HAProxy instance kept in sync with the live one (only the haproxy.cfg needs to be rsync-ed across). Your monitoring system can then detect when the live HAProxy instance goes down, and automatically assign the Elastic IP address to the other instance using for example the EC2 API Tools command ec2-associate-address.

Tuesday, January 27, 2009

Tadalist -- simple but powerful task management

I've been using the free Tadalist service lately to keep track of my TODO lists. It has a deceivingly simple user interface, but for me at least it turned out to be a powerful ally in keeping on top of my ever-increasing TODO lists.

In Tadalist you can only do a few things: create a list, add an item to a list, edit the list title, edit the description of an item, check off an item as done, and reorder the items in the list. It turns out this is really all you need. I especially like the feeling of checking off an item and seeing it drop to the bottom of the list, in smaller font, joining the list of tasks that are DONE! It's almost as addictive as seeing those dots when you run unit tests. Reordering items is also a very nice feature, because an item that wasn't so hot yesterday can become really critical today, in which case you want it at the top of the list.

One feature I'd like to see is for checked off items to also get a timestamp, so you can go back and see when exactly you completed a given task.

If you're not using any task management software (in which case I hope you're still using old-school pen and paper), then give Tadalist a try.

BTW -- what task management software have YOU used successfully? Please leave a comment.

Saturday, January 24, 2009

Book review: "Pro Django" by Marty Alchin

I volunteered to write a book review for Apress and I chose "Pro Django" by Marty Alchin (of course, I received the book for free at that point). Here's my review:

If you are serious about developing Web applications in Django, then "Pro Django" will be a great addition to your technical library. Note, however, that the "Pro" in the title really means "professional", "in-depth", at times even "obscure" -- so please, do not pick up this book if you're just starting out with Django. To really get the most out of this book, you need to already have at least one, and preferably several Django applications under your belt.

I personally just finished a small fun project for my daughter's 8th grade Science Fair: a Web site written of course in Django where her friends can take a fun science-related quiz and see if they improve their score the second time around, after being told the correct answer for each question. It was my first Django application, and I used the online documentation and tutorial (both very good), as well as the online Django book. I also used Sams' "Teach yourself Django in 24 hours" by Brad Dayley, which was very helpful for a beginner like me.

I say all this because in reading "Pro Django", you need to be already familiar with the core concepts of Django: models, views, templates, forms, and the all-important URL configuration. You won't get a feel for these concepts unless you actually start writing a Web application and understand the hard way how everything fits together. Once you have this understanding, and if you want to continue on the path of creating more Django apps, it's time for you to pick up "Pro Django".

Marty Alchin doesn't waste time delving into aspects of Python which are typically not used to their full potential by many people (including me): metaclasses, introspection, decorators, descriptors. In fact, the themes of introspection, customization and extension (which all take advantage of the dynamic nature of Python) keep coming up in almost every chapter of the book.

For example, the 'Models' chapter shows how to subclass model fields and how the use of metaclasses allows a field to know its name, and the class it was assigned to. The chapter also talks about the nifty technique of creating models dynamically at runtime. The 'URLs and Views' chapter goes into the gory details of the Django URL configuration mechanism, and shows how to use decorators to make views as generic as possible.

My favorite chapter was 'Handling HTTP'. It exemplifies what for me is the best part about Alchin's book: showing readers where and how to insert their own advanced processing code into the hooks provided by Django, without disturbing the flow of the framework. This is typically one of the hard parts of learning a Web framework, and Marty Alchin does a great job of explaining how to achieve a maximum of effect with a minimum of effort in this area, for example by writing your own middleware modules and inserting them into Django.

I also liked the last two chapters, 'Coordinating applications' and 'Enhancing applications', which show practical examples of code aggregated in mini-applications. In fact, this is also the main gripe I have about this book: I wish the author used more mini-applications throughout the book to explain the advanced concepts he described. He did show code snippets for each concept, but they were all isolated, and sometimes hard to place into the context of an application. I realize that space was limited, but it would have been so much nicer to see a real application being built and described throughout the book, with more and more functionality added at each stage.

Overall, I really enjoyed reading "Pro Django". However, reading such a book is just a start. What I really need to do is to start writing code and applying some of the new techniques I learned. I can't wait to do it!

Wednesday, January 21, 2009

Watch that Apache KeepAlive setting!

The Apache KeepAlive directive specifies that TCP/IP connections from clients to the Apache server are to be kept 'alive' for a given duration specified by the value of KeepAliveTimeout (the default is 15 seconds). This is useful when you serve out heavy HTML with embedded images or other resources, since browsers will open just one TCP/IP connection to the Apache server and all the resources from that page will be retrieved via that connection.

If, however, your Apache server handles small individual resources (such as images), then KeepAlive is overkill, since it will make every TCP connection linger for N seconds. Given a lot of clients, this can quickly saturate your Apache server in terms of network connections.

So...if you have a decent server that doesn't seem to be overloaded in terms of CPU/memory, yet Apache is slow-to-unresponsive, check out the KeepAlive directive and try setting it to Off. Note that the default value is On.

More Apache performance tuning tips are in the official Apache documentation.

Sunday, January 04, 2009

Happy New Year and....Teach Me Web Testing!

Happy New Year everybody! I hope 2009 won't rush by as quickly as 2008 did...

Now for the 'Teach Me Web Testing' part: Steve Holden graciously offered to be the host of an Open Space at PyCon 2009 on this topic. Steve started the 'Teach Me...' series at the last PyCon, with his now famous 'Teach Me Twisted' session.

For this format to work, we need to put together an audience which is formed of at least 3 types of people:

1) people interested in learning about Web testing in Python
2) people who write Python Web testing tools for fun and profit
3) people who use Python Web testing tools extensively for fun and profit

My role here is to rally people in categories 2 and 3. So if you're either a Web testing tool author or somebody who uses Web testing tools extensively in your job, please either comment on this post, or send me email at grig at gheorghiu dot net and let me know if you'd be interested in attending this Open Space session. Knowing Steve, I can guarantee it will be LOTS of fun.

Tuesday, December 16, 2008

Some issues when restoring files using duplicity

I blogged a while back about how to do incremental encrypted backups to S3 using duplicity. I've been testing the restore procedure for some of my S3 backups, and I had a problem with the way duplicity deals with temporary directories and files it creates during the restore.

By default, duplicity will use the system default temporary directory, which on Unix is usually /tmp. If you have insufficient disk space in /tmp for the files you're trying to restore from S3, the restore operation will eventually fail with "IOError: [Errno 28] No space left on device".

One thing you can do is create another directory on a partition with lots of disk space, and specify that directory in the duplicity command line using the --tempdir command line option. Something like: /usr/local/bin/duplicity --tempdir=/lotsofspace/temp

However, it turns out that this is not sufficient. There's still a call to os.tmpfile() buried in the patchdir.py module installed by duplicity. Consequently, duplicity will still try to create temporary files in /tmp, and the restore operation will still fail. As a workaround, I solved the issue in a brute-force kind of way by editing /usr/local/lib/python2.5/site-packages/duplicity/patchdir.py (the path is obviously dependent on your Python installation directory) and replacing the line:

tempfp = os.tmpfile()

with the line:

tempfp, filename = tempdir.default().mkstemp_file()

(I also needed to import tempdir at the top of patchdir.py; tempdir is a module which is part of duplicity and which deals with temporary file and directory management -- I guess the author of duplicity just forgot to replace the call to os.tmpfile() with the proper calls to the tempdir methods such as mkstemp_file).

This solved the issue. I'll try to open a bug somehow with the duplicity author.

Friday, December 12, 2008

Working with Amazon EC2 regions

Now that Amazon offers EC2 instances based in data centers in Europe, there is one more variable that you need to take into account when using the EC2 API: the concept of 'region'. Right now there are 2 regions to choose from: us-east-1 (based of course in the US on the East Coast), and the new region eu-west-1 based in Western Europe. Knowing Amazon, they will probably launch data centers in other regions across the globe -- Asia, South America, etc.

Each region has several availability zones. You can see the current ones in this nice article from the AWS Developer Zone. The default region is us-east-1, with 3 availability zones (us-east-1a, 1b and 1c). If you don't specify a region when you call an EC2 API tool, then the tool will query the default region. That's why I was baffled when I tried to launch a new AMI in Europe; I was calling 'ec2-describe-availability-zones' and it was returning only the US ones. After reading the article I mentioned, I realized I need to have 2 versions of my scripts: the old one I had will deal with the default US-based region, and the new one will deal with the Europe region by adding '--region eu-west-1' to all EC2 API calls (you need the latest version of the EC2 API tools from here).

You can list the zones available in a given region by running:

# ec2-describe-availability-zones --region eu-west-1
AVAILABILITYZONE eu-west-1a available eu-west-1
AVAILABILITYZONE eu-west-1b available eu-west-1
Note that all AWS resources that you manage belong to a given region. So if you want to launch an AMI in Europe, you have to create a keypair in Europe, a security group in Europe, find available AMIs in Europe, and launch a given AMI in Europe. As I said, all this is accomplished by adding '--region eu-west-1' to all EC2 API calls in your scripts.

Another thing to note is that the regions are separated in terms of internal DNS too. While you can access AMIs within the same zone based on their internal DNS names, this access doesn't work across regions. You need to use the external DNS name of an instance in Europe if you want to ssh into it from an instance in the US (and you also need to allow the external IP of the US instance to access port 22 in the security policy for the European instance.)

All this introduces more headaches from a management/automation point of view, but the benefits obviously outweigh the cost. You get low latency for your European customers, and you get more disaster recovery options.

Thursday, December 11, 2008

Deploying EC2 instances from the command line

I've been doing a lot of work with EC2 instances lately, and I wrote some simple wrappers on top of the EC2 API tools provided by Amazon. These tools are Java-based, and I intend to rewrite my utility scripts in Python using the boto library, but for now I'm taking the easy way out by using what Amazon already provides.

After downloading and unpacking the EC2 API tools, you need to set the following environment variables in your .bash_profile file:
export EC2_HOME=/path/to/where/you/unpacked/the/tools/api
export EC2_PRIVATE_KEY = /path/to/pem/file/containing/your/ec2/private/key
export EC2_CERT = /path/to/pem/file/containing/your/ec2/cert
You also need to add $EC2_HOME/bin to your PATH, so the command-line tools can be found by your scripts.

At this point, you should be ready to run for example:
# ec2-describe-images -o amazon
which lists the AMIs available from Amazon.

If you manage more than a handful of EC2 AMIs (Amazon Machine Instances), it quickly becomes hard to keep track of them. When you look at them for example using the Firefox Elasticfox extension, it's very hard to tell which is which. One solution I found to this is to create a separate keypair for each AMI, and give the keypair a name that specifies the purpose of that AMI (for example mysite-db01). This way, you can eyeball the list of AMIs in Elasticfox and make sense of them.

So the very first step for me in launching and deploying a new AMI is to create a new keypair, using the ec2-add-keypair API call. Here's what I have, in a script called create_keypair.sh:
# cat create_keypair.sh
#!/bin/bash

KEYNAME=$1

if [ -z "$KEYNAME" ]
then
echo "You must specify a key name"
exit 1

fi

ec2-add-keypair $KEYNAME.keypair > ~/.ssh/$KEYNAME.pem
chmod 600 ~/.ssh/$KEYNAME.pem

Now I have a pem file called $KEYNAME.pem containing my private key, and Amazon has my public key called $KEYNAME.keypair.

The next step for me is to launch an 'm1.small' instance (the smallest instance you can get from EC2) whose AMI ID I know in advance (it's a 32-bit Fedora Core 8 image from Amazon with an AMI ID of ami-5647a33f). I am also using the key I just created. My script calls the ec2-run-instances API.
# cat launch_ami_small.sh
#!/bin/bash

KEYNAME=$1

if [ -z "$KEYNAME" ]
then
echo "You must specify a key name"
exit 1

fi

# We launch a Fedora Core 8 32 bit AMI from Amazon
ec2-run-instances ami-5647a33f -k $KEYNAME.keypair --instance-type m1.small -z us-east-1a
Note that the script makes some assumptions -- such as the fact that I want my AMI to reside in the us-east-1a availability zone. You can obviously add command-line parameters for the availability zone, and also for the instance type (which I intend to do when I rewrite this in Python).

Next, I create an EBS volume which I will attach to the AMI I just launched. My create_volume.sh script takes an optional argument which specifies the size in GB of the volume (and otherwise sets it to 50 GB):
# cat create_volume.sh
#!/bin/bash

SIZE=$1
if [ -z "$SIZE" ]
then
SIZE=50

fi

ec2-create-volume -s $SIZE -z us-east-1a
The volume should be created in the same availability zone as the instance you intend to attach it to -- in my case, us-east-1a.

My next step is to attach the volume to the instance I just launched. For this, I need to specify the instance ID and the volume ID -- both values are returned in the output of the calls to ec2-run-instances and ec2-create-volume respectively.

Here is my script:

# cat attach_volume_to_ami.sh
#!/bin/bash

VOLUME_ID=$1
AMI_ID=$2

if [ -z "$VOLUME_ID" ] || [ -z "$AMI_ID" ]
then
echo "You must specify a volume ID followed by an AMI ID"
exit 1

fi

ec2-attach-volume $VOLUME_ID -i $AMI_ID -d /dev/sdh

This attaches the volume I just created to the AMI I launched and makes it available as /dev/sdh.

The next script I use does a lot of stuff. It connects to the new AMI via ssh and performs a series of commands:
* format the EBS volume /dev/sdh as an ext3 file system
* mount /dev/sdh as /var2, and copy the contents of /var to /var2
* move /var to /var.orig, create new /var
* unmount /var2 and re-mount /dev/sdh as /var
* append the mounting as /dev/sdh as /var to /etc/fstab so that it happens upon reboot

Before connecting via ssh to the new AMI, I need to know its internal DNS name or IP address. I use ec2-describe-instances to list all my running AMIs, then I copy and paste the internal DNS name of my newly launched instance (which I can isolate because I know the keypair name it runs with).

Here is the script which formats and mounts the new EBS volume:

# cat format_mount_ebs_as_var_on_ami.sh
#!/bin/bash

AMI=$1
KEYNAME=$2

if [ -z "$AMI" ] || [ -z "$KEY" ]
then
echo "You must specify an AMI DNS name or IP followed by a keypair name"
exit 1

fi

CMD='mkdir /var2; mkfs.ext3 /dev/sdh; mount -t ext3 /dev/sdh /var2; \
mv /var/* /var2/; mv /var /var.orig; mkdir /var; umount /var2; \
echo "/dev/sdh /var ext3 defaults 0 0" >>/etc/fstab; mount /var'

ssh -i ~/.ssh/$KEY.pem root@$AMI $CMD
The effect is that /var is now mapped to a persistent EBS volume. So if I install MySQL for example, the /var/lib/mysql directory (where the data resides by default in Fedora/CentOS) will be automatically persistent. All this is done without interactively logging in to the new instance. so it can be easily scripted as part of a larger deployment procedure.

That's about it for the bare-bones stuff you have to do. I purposely kept my scripts simple, since I use them more to remember what EC2 API tools I need to run than anything else. I don't do a lot of command-line option stuff and error-checking stuff, but they do their job.

If you run scripts similar to what I have, you should have at this point a running AMI with a 50 GB EBS volume mounted as /var. Total running time of all these scripts -- 5 minutes at most.

As soon as I have a nicer Python script which will do all this and more, I'll post it here.

Thursday, December 04, 2008

New job at OpenX

I meant to post this for a while, but haven't had the time, because...well, it's a new job, so I've been quite swamped. I started 2 weeks ago as a system engineer at OpenX, a company based in Pasadena, whose main product is an Open Source ad server. I am part of the 'black ops' team, and my main task for now is to help with deploying and scaling the OpenX Hosted service within Amazon EC2 -- which is just one of several cloud computing providers that OpenX uses (another one is AppNexus for example).

Lots of Python involved in this, lots of automation, lots of testing, so all this makes me really happy :-)

Here is some stuff I've been working on, which I intend to post on with more details as time permits:

* command-line provisioning of EC2 instances
* automating the deployment of the OpenX application and its pre-requisites
* load balancing in EC2 using HAProxy
* monitoring with Hyperic
* working with S3-backed file systems

I'll also start working soon with slack, a system developed at Google for automatic provisioning of files via the interesting concept of 'roles'. It's in the same family as cfengine or puppet, but simpler to use and with a powerful inheritance concept applied to roles.

All in all, it's been a fun and intense 2 weeks :-)

Sunday, November 30, 2008

The sad state of open source monitoring tools

I've been looking lately at open source network monitoring tools. I'm not impressed at all by what I've seen so far. Pretty much the least common denominator when it comes to this type of tools is Nagios, which is not a bad tool (I used it a few years ago), but did you see its Web interface? It's soooooo 1999 -- think 'Perl CGI scripts'!

A slew of other tools are based on the Nagios engine, and are trying hard to be more pleasing to the eye -- Opsview and GroundWork are some examples. Opsview seems just a wrapper around Nagios, with not a lot of improvements in terms of both functionality and UI.

I looked at the GroundWork screencast and it seemed promising, but when I tried to install it I had a very unpleasant experience. First of all, the install script uses curses (did those guys hear about unattended installs?), and requires Java 1.5. Although I had both Java 1.5 and 1.6 on my CentOS server, and JAVA_HOME set correctly, it didn't stop the installer from complaining and exiting. Good riddance.

I should say that the first open source network monitoring tool that I tried was Zenoss, which is supposed to be the poster child for Python-based monitoring tools. Believe me, I tried hard to like it. I even went back and gave it a second chance, after noticing that other tools aren't any better. But to no avail -- I couldn't get past the sensation that it's a half-baked tool, with poor documentation and obscure user interface. It could work fine if you just want to monitor some devices with SNMP, but as soon as you try to extend it with your own plugins (called Zen Packs), or if you try to use their agents (called Zen Plugins), you run into a wall. At least I did. I got tired of Python tracebacks, obscure references to 'restarting Zope' (I thought it's based on twisted), fiddling with values for the so-called zProperties of a device, trying unsuccessfully to get ssh key authentication to work with the Zen Plugins, etc, etc. I'm not the only one who went through these frustrations either -- there are plenty of other users saying in the Zenoss forums that they've had it, and that they're going to look for something else. Which is what I did too.

I also tried OpenNMS, which was better than Zenoss, but it still had a CGI feel in terms of its Web interface.

So...for now I settled on Hyperic. It's a Java-based tool with a modern Web interface, very good documentation, and it's extensible via your own plugins (which you can write in any language you want, as long as you conform to some conventions which are not overly restrictive). Hyperic uses agents that you install on every server you need to monitor. I don't mind this, I find it better than configuring SNMP to death. It does have it quirks -- for example it calls devices that it monitors 'platforms' (instead of just 'devices' or 'servers'), and it calls the plugins that monitor specific services 'servers' (instead of services). Once you get used to it, it's not that bad. However, I wish there was a standard nomenclature for this stuff, as well as a standard way for these tools to inter-operate. As it is, you have to learn each tool and train your brain to ignore all the weirdness that it encounters. Not an optimal scenario by any means.

I'm very curious to see what tools other people use. If you care to leave a comment about your monitoring tool of choice, please do so!

I'll report back with more stuff about my experiences with Hyperic.

Friday, November 21, 2008

Issues with Ubuntu 8.10 on Lenovo T61p laptop

I got a new Lenovo ThinkPad T61p, and of course I promptly installed Ubuntu Ibex 8.10 on it. The first day I used it, I had no issues, but this morning it froze no less than 3 times, and each time the Caps Lock light flashed. I googled around, and I found what I hope is the solution in this post on the Ubuntu forums. It seems that this is the core issue:

System lock-ups with Intel 4965 wireless

The version of the iwlagn wireless driver for Intel 4965 wireless chipsets included in Linux kernel version 2.6.27 causes kernel panics when used with 802.11n or 802.11g networks. Users affected by this issue can install the linux-backports-modules-intrepid package, to install a newer version of this driver that corrects the bug. (Because the known fix requires a new version of the driver, it is not expected to be possible to include this fix in the main kernel package.)

As recommended, I did 'apt-get install backports-modules-intrepid' and I rebooted. That was around 1 hour ago, and I haven't seen any issues since. Hopefully that was it. BTW, when the Caps Lock light blinks, it means 'kernel panic'. Who knew.

Thursday, November 13, 2008

Python and MS Azure

You've probably heard by now of Microsoft's entry in the cloud computing race, dubbed Azure. What I didn't know until I saw it this morning on InfoQ was that Microsoft encourages the use of languages and tools other than their official ones. Here's what they say on the 'What is the Azure Service Platform' page:

"Windows Azure is an open platform that will support both Microsoft and non-Microsoft languages and environments. Windows Azure welcomes third party tools and languages such as Eclipse, Ruby, PHP, and Python."

While you and I may think MS says this just for marketing/PR purposes, it turns out they are walking the walk a bit. I was glad to see in the InfoQ article that a Microsoft guy wrote a Python wrapper on top of the Azure Data Storage APIs. Note that this is classic CPython, not IronPython. I assume more interesting stuff can be done with IronPython.

Wednesday, November 12, 2008

"phrase from nearest book" meme

Via Elliot:

  • Grab the nearest book.
  • Open it to page 56.
  • Find the fifth sentence.
  • Post the text of the sentence in your journal along with these instructions.
  • Don’t dig for your favorite book, the cool book, or the intellectual one: pick the CLOSEST.
Here's mine, from 'Kim' by Kipling:

"A little later a marriage procession would strike into the Grand Trunk with music and shoutings, and a smell of marigold and jasmine stronger even than the reek of the dust."

Not bad, I like it :-)

Friday, October 31, 2008

Migrating SSL certs from IIS to Apache

If you ever find yourself facing this task, use this guide -- worked perfectly for me.

I work for this guy


Best Halloween costume I've ever seen. You rock Nate!!!

Monday, October 27, 2008

This is depressing: Ken Thompson is also a googler

Just found out that Ken Thompson, one of the creators of Unix, works at Google. You can see his answers to various questions addressed to Google engineers by following the link with his name on this page.

So let's see, Google has hired:

* Ken Thompson == Unix
* Vint Cerf == TCP/IP
* Andrew Morton == #2 in Linux
* Guido van Rossum == Python
* Ben Collins-Sussman and Brian Fitzpatrick == subversion
* Bram Moolenaar == vim

...and I'm sure there are countless others that I missed.

If this isn't a march towards world domination, I don't know what is :-)

Thursday, October 16, 2008

The case of the missing profile photo

Earlier today I posted a blog entry, then I went to view it on my blog, only to notice that my profile photo was conspicuously absent. I double-checked the URL for the source of the image -- it was http://agile.unisonis.com/gg.jpg. Then I remembered that I recently migrated agile.unisonis.com to my EC2 virtual machine. I quickly ssh-ed into my EC2 machine and saw that the persistent storage volume was not mounted. I ran uptime and noticed that it only showed 8 hours, so the machine had somehow been rebooted. In my experiments with setting up that machine, I had failed to add a line to /etc/fstab that causes the persistent storage volume to be mounted after the rebooted. Easily rectified:

echo "/dev/sds /ebs1 ext3 defaults 0 0" >> /etc/fstab

I connected to my EC2 environment with ElasticFox and saw that the EBS volume was still attached to my machine instance as /dev/sds, so I mounted it via 'mount /dev/sds/ /ebs1', then restarted httpd and mysqld, and all my sites were again up and running.

I tested my setup by rebooting. After the reboot, another surprise: httpd and mysqld were not chkconfig-ed on, so they didn't start automatically. I fixed that, I rebooted again, and finally everything came back as expected.

A few lessons learned here in terms of hosting your web sites in 'the cloud':

1) you need to test your machine setup across reboots
2) you need automated tests for your machine setup -- things like 'is httpd chkconfig-ed on?'; 'is /dev/sds mounted as /ebs1 in /etc/fstab?'
3) you need to monitor your sites from a location outside the cloud which hosts your sites; I shouldn't have to eyeball a profile photo to realize that my EC2 instance is not functioning properly!

I'll cover all these topics and more soon in some other posts, so stay tuned!

Recommended book: "Scalable Internet Architectures"

One of my co-workers, Nathan, introduced me to this book -- "Scalable Internet Architectures" by Theo Schlossnagle. I read it in one sitting. Recommended reading for anybody who cares about scaling their web site in terms of both web/application servers and database servers. It's especially appropriate in our day and age, when cloud computing is all the rage (more on this topic in another series of posts). My preferred chapters were "Static Content Serving" (talks about wackamole and spread) and "Static Meets Dynamic" (talks about web proxy caches such as squid).

I wish the database chapter contained more in-depth architectural discussions; instead, the author spends a lot of time showing a Perl script that is supposed to illustrate some of the concepts in the chapter, but falls very short of that in my opinion.

Overall though, highly recommended.

Wednesday, October 08, 2008

Example Django app needed

Dear lazyweb, I need a good sample Django application (with a database backend) to run on Amazon EC2. If the application has Ajax elements, even better.

Comments with suggestions would be greatly appreciated!

Thursday, October 02, 2008

Update on EC2 and EBS

I promised I'll give an update on my "Experiences with Amazon EC2 and EBS" post from a month ago. Well, I just got an email from Amazon, telling me:

Greetings from Amazon Web Services,

This e-mail confirms that your latest billing statement is available on the AWS web site. Your account will be charged the following:

Total: $73.74

So there you have it. That's how much it cost me to run the new SoCal Piggies wiki, as well as some other small sites, with very little traffic. Your mileage will definitely vary, especially if you run a high-traffic site.

I also said I'll give an update on running a MySQL database on EBS. It turns out it's really easy. On my Fedora Core 8 AMI, I did this:

* installed mysql packages via yum:

yum -y install mysql mysql-server mysql-devel

* moved the default data directory for mysql (/var/lib/mysql) to /ebs1/mysql (where /ebs1 is the mount point of my 10 GB EBS volume), then symlinked /ebs1/mysql back to /var/lib, so that everything continues to work as expected as far as MySQL is concerned:

service mysqld stop
mv /var/lib/mysql /ebs1/mysql
ln -s /ebs1/mysql /var/lib
service mysqld start

That's about it. I also used the handy snapshot functionality in the ElasticFox plugin and backed up the EBS volume to S3. In case you lose your existing EBS volume, you just create another volume from the snapshot, specify a size for it, and associate it with your AMI instance. Then you mount it as usual.

Update 10/03/08

In response to comments inquiring about a more precise breakdown of the monthly cost, here it is:

$0.10 per Small Instance (m1.small) instance-hour (or partial hour) x 721 hours = $72.10

$0.100 per GB Internet Data Transfer - all data transfer into Amazon EC2 x 0.607 GB = $0.06

$0.170 per GB Internet Data Transfer - first 10 TB / month data transfer out of Amazon EC2 x 2.719 GB = $0.46

$0.010 per GB Regional Data Transfer - in/out between Availability Zones or when using public IP or Elastic IP addresses x 0.002 GB = $0.01

$0.10 per GB-Month of EBS provisioned storage x 9.958 GB-Mo = $1.00

$0.10 per 1 million EBS I/O requests x 266,331 IOs = $0.03

$0.15 per GB-Month of EBS snapshot data stored x 0.104 GB-Mo = $0.02

$0.01 per 1,000 EBS PUT requests (when saving a snapshot) x 159 Requests = $0.01

EC2 TOTAL: $73.69
Other S3 costs (outside of EC2): $0.05

GRAND TOTAL: $73.74

Friday, September 19, 2008

Presubmit testing at Google

Here is an interesting blog post from Marc Kaplan, test engineering manager at Google, on their strategy of running what they call 'presubmit tests' -- tests that are run automatically before the code gets checked in. They include performance tests, and they compare the performance of the new code with baselines from the previous week, then report back nice graphs showing the delta. Very cool.