Saturday, February 27, 2010

Use HAProxy 1.4 if you need MySQL health checks

I was using the latest version of HAProxy 1.3 and was load balancing backend MySQL servers while also checking their ports, so if one server went down it would be taken out of the load balancing pool. However, since the port checks in HAProxy happen at the TCP level, the MySQL instance which was being hit by the port checks wasn't happy, because it wasn't a proper MySQL connection. As a result, after a number of some checks, MySQL refused to allow clients to connect, with a message like this:

OperationalError: (1129, "Host 'myhost' is blocked because of many connection errors; unblock with 'mysqladmin flush-hosts'")

Solution: upgrade HAProxy to the newly released version 1.4 (at the date of this writing, the exact version is 1.4.0). Documentation is here.

For MySQL specific checks, you can specify 'option mysql-check' in a backend or 'listen' section of the configuration file. For example, I have something similar to this in my HAProxy configuration file:

listen mysql
mode tcp
option mysql-check
balance roundrobin
server mysql1 check port 3306
server mysql2 check port 3306 backup

Monday, February 22, 2010

My top 10 PyCon takeaways

PyCon 2010 in Atlanta was a blast as always. While I still have things fresh on my mind, here are my top 10 takeaways from the conference, in no particular order.

1) Alternative Python implementations are getting increased attention

It seemed to me that PyPy, Unladen Swallow, IronPython and Jython got much more buzz this year. Maybe this is also due to the announcement that Unladen Swallow will be merged into Python 3.x. I recommend you watch Holger Krekel's talk on the topic of the diverse and healthy Python ecosystem, 'The Ring of Python'.

I was also glad to see that 2 core Jython developers, Frank Wierzbicki and Jim Baker, were hired by Sauce Labs.

2) Testing has gone mainstream

When Mark Shuttleworth mentions automated testing in his keynote as one of the most important ingredients of a sound software engineering process, you know that automated testing has arrived.

There were also no less than 6 testing-related talks, all very good, given by the usual suspects in the testing world, people like Ned Batchelder, Titus Brown, Holger Krekel and Michael Foord.

3) Packaging and packaging

I liked Antonio Rodriguez's distinction between packaging with small 'p' (distutils, setuptools, distribute) and Packaging with big 'P' (the web site). Both are very important. There was a lot of attention to packaging, and a great show of support for Tarek Ziade's efforts in leading the way to improving the way of distributing Python packages. And I think Antonio was right in pointing out that the site needs some redesign in terms of getting a more modern and streamlined look and feel.

4) I tweet, thus I exist

I came late to the Twitter party, barely a month ago. I was resistant at first because I considered tweeting a waste of time. I still think it has a strong tendency to shorten your attention span and break your focus, so I personally need to discipline myself in how I use it.

But Twitter is a great way to keep your finger on the pulse of topics that interest you -- and at PyCon, if you didn't tweet or at least read other people's tweets, you were out of touch, out of the picture. Alex Gaynor and company did a great job with their PyCon Live Stream site, which was pretty much the dashboard of the conference.

5) The testing goat

Terry Peppers started a new meme during the TiP BoF: the testing goat. Read Terry's post and also Titus's post for more details on it, but suffice it to say it was a huge success. And speaking of the TiP Bof, it ballooned from last year to this one. I estimate around 120 attendees, so more than 10% of the people at the conference. Pizza provided by Disney (thanks to Paul Hildebrandt and Roy Turner), beer provided by Dr. Brown and friends, great lightning talks and unceasing heckling made this into one of the highlights of the conference.

6) Healthy ecosystem of Python web frameworks

Two or three years ago, all the buzz was about Django and maybe TurboGears. This year, a lot of presenters talked about other frameworks -- Pylons in particular, but also tornado, CherryPy, restish. It does feel like Django is the granddaddy of them all, but it also feels to me like Pylons is being preferred by big name/big traffic web sites such as reddit. Tornado of course is a newcomer, and we're using it very successfully at Evite. The presenter from Lolapps said they were also experimenting with it and were going to put it in production for some portions of their site.

7) Inspirational keynotes

I thought the keynotes were of much higher quality than in previous years. Mark Shuttleworth talked about 'Cadence, quality and design' (see the bitsource interview), while Antonio Rodriguez gave a very inspirational presentation on topics such as involving everybody in your company in coding (he knows, it sounds crazy...), about the strategic advantages of using Python, about putting more stable libraries into the stdlib (he mentioned httplib2, and I couldn't agree more -- we need that library in the stdlib!), and other stuff that you can see on his pycon 2010 page. You need to watch the video of his keynote though in order to appreciate the impact that it had (videos from PyCon are being made available as we speak on

One thing though -- I am a big Ubuntu fan, I have it on both my laptop(s) and desktop, and yet I was pained to see that Mark Shuttleworth couldn't use his slide deck because his laptop couldn't properly display a dual screen when using the conference projector. I struggled to make it work myself before delivering my presentation. Ubuntu really needs to get better dual-screen configuration management software.

8) It's all about the hallway discussions

For first comers to PyCon, or for people who intend to go next year, a word of advice: skip some of the presentations and instead join random people in hallway discussions, or for a beer at the bar. Trust me, you'll learn more than in almost any presentation. And you'll potentially make friends that you'll recognize the next time you go to PyCon. I've done this for 6 years now, and it never fails to amaze me how easy it is to get into deep technical discussions over a mind-bending range of topics. Non-technical discussions are usually mind-bending at PyCon too ;-)

9) More talks of the advanced type please

I heard from many people (and Titus has been saying it for years) that they wished the talks were a bit more advanced. I realize PyCon needs to cater to all types of Python users, from beginning to intermediate to expert, but still the conference track could use a larger number of advanced, mind-exploding, challenging presentations (such as Raymond Hettinger's talk). I understand though that next year there will be exactly such a track, dubbed 'Extreme Python', so I'm very much looking forward to it.

10) Top-notch organization

Finally, kudos to Van Lindberg, this year's PyCon chair, and the rest of the organizers, for delivering an almost perfect experience to the more than 1,000 attendees. I though the food was great, the WiFi was better than usual, the sessions went almost always smoothly (minus projector issues), and there was a great fun and camaraderie in the air. That's why PyCon is my (and many other people's) favorite conference. Keep it up guys!

Friday, February 19, 2010

Slides for 'Creating RESTful Web Services with restish'

For those who are interested, the slides for my PyCon 2010 talk 'Creating RESTful Web Services with restish' are online. Leave comments please if you attended the talk and want to start discussions on some of the topics I mentioned.

Update 02/25/10: the video is up too

Thursday, February 04, 2010

Web site monitoring techniques and tools

When monitoring a Web site, you need to look at it both from a 'micro' perspective (i.e. are the individual devices and servers in your infrastructure running smoothly?) and from a 'macro' perspective (i.e. do your customers have a pleasant experience when accessing and using your site?; can they use your site's functionality to conduct their business online?). You can also think about these types of monitoring as 'engineering' monitoring vs. 'business' monitoring. They're both important, but most system administrators obviously focus on the engineering/micro-type monitoring, since it's closer to their skills and interests, and tend to not put much emphasis on the business/macro-type monitoring.

In this post I'll talk about various types of monitoring we're doing for the Evite Web site. I'll focus on what I call 'deep application monitoring', which to me means gathering and displaying metrics about your Web application's behavior as seen by the users.

System infrastructure monitoring

This is the 'micro' or 'engineering' type of monitoring that is the bread and butter of a sysops team. I need to eat crow here and say that there is a great open source monitoring tool out there, and its name is Nagios. That's what we use to monitor the health of all our internal servers and network devices, and to alert us via email and SMS if anything goes wrong. It works great, it's easy to set up and deploy automatically, it has tons of plugins to monitor pretty much everything under the sun. My mistake in the past was expecting it to also be a resource graphing tool, but that was simply asking too much from one single tool.

For resource graphing and visualization we use a combination of cacti, ganglia and munin. They're all good, but going forward I'll use munin because it's easy to set up and automate, and also allows you to group devices together and see the same metric (for example system load) across the devices in the group. This makes it easy to spot trends and outliers.

One non-typical aspect of our use of Nagios is that we run passive checks from the servers being monitored to the main Nagios server using NSCA. This avoids the situation where the central Nagios server becomes a bottleneck because it needs to poll all the devices it monitors either via ssh or via NRPE. In our setup, each plugin deployed on a monitored server pushes its own notifications to the central server using send_nsca. It's also fairly easy to write your own Nagios plugins.

What kinds of things do we monitor and graph? The usual suspects -- system load, CPU, memory, disk, processes -- but also things like number of database connections on each Web/app server, number of Apache and Tomcat/Java processes AND threads, size of the mail queue on mail servers, MySQL queries/second, threads, throughput...and many others.

In general, the more metrics you capture and graph, the easier it is to spot trends and correlations. Because that's what you need to do during normal day-to-day traffic on your Web site -- you need to establish baseline numbers which will tell you right away if something goes wrong, for example when you're being hit with more traffic than usual. Having a baseline makes it easy to spot spikes and flare-ups in the metrics you capture. Noticing correlations between these metrics makes it easier for you to troubleshoot things like site slowness or unresponsiveness.

For example, if you see that the Web servers aren't being pounded with more traffic than usual, yet the number of database connections on each Web server keeps going up, you can quickly look at the database metrics and see that indeed the database is slower than usual. It could be that there's a disk I/O problem, or maybe the database is working hard to reindex a large table...but in any case your monitoring and graphing systems should be your friends in diagnosing the problem.

I compare system infrastructure monitoring with unit testing in a development project. Just like when you develop you write a unit test every time you find a bug, with monitoring you add another metric or resource to be monitored and graphed every time you discover a problem with your Web site. Both system monitoring and unit testing work at the micro/engineering level.

By the way, one of the best ways to learn a new system architecture is to delve into the monitoring system, or to roll one out if there's none. This will make you deeply familiar with every server and device in your infrastructure.

External HTTP monitoring

This type of monitoring typically involves setting up your own monitoring server outside of your site's infrastructure, or using a 3rd party service such as Pingdom, Keynote or Gomez. We've been using Pingdom and have been very happy with it. While it doesn't provide the in-depth analysis that Keynote or Gomez do, it costs much less and still gives you a great window into the performance and uptime of your site, as seen by an external user.

In any such system, you choose certain pages within your Web application and you have the external monitor check for a specific character string within the page, to make sure the expected content is there, as opposed to a cryptic error message.

However, with Pingdom you can go deeper than that. They provide 'custom HTTP checks' which allow you to deploy an internal monitor reachable from the outside via HTTP, which checks whatever resource you need within your internal infrastructure, and sends some well-formatted XML back to Pingdom. This is useful when you have a farm of web/app servers behind a load balancer, and you want to check each one but without exposing an external IP address for each server.  Or you can have processes running on various port numbers on your web/app servers, and you don't want to expose those directly to Pingdom checks. Instead, you need to open port 80 to just one server, the one running the custom checks.

We've built a RESTful Web service based on restish for this purpose. We call it from Pingdom with certain parameters in the query string, and based on those we run various checks such as 'is our Web application app1 running on web servers A, B, C and D functioning properly?'. If the application doesn't respond as expected on server C, we send details about the error in the XML payload back to Pingdom and we get alerted on it.

We also use Pingdom for monitoring cloud-based resources that we interact with, for example Amazon S3 or SQS. For SQS, we have another custom HTTP check which goes against our restish-based Web service, which then connects to SQS, does some operations and reports back the elapsed time. We want to know if that time/latency exceeds certain thresholds. While we could have done that from our internal Nagios monitoring system, it would have required writing a Nagios plugin, while our restish application already knew how to send the properly formatted XML payload back to Pingdom.

This type of monitoring doesn't go very deep into checking the business logic rules of your application. It is however a 'macro'-type monitoring technique that should be part of your monitoring strategy because it mimics the user experience of your site's customers. I compare it to doing functional testing for your application. You don't test things end-to-end, but you choose some specific portion of your functionality (a specific Web page that is being hit by many users for example) and you make sure that your application provides indeed that functionality to its users in a timely manner.

Also, if you use a CDN provider, then they probably offer you a way to see statistics about your HTTP traffic using some sort of Web dashboard. Most providers also give you a breakdown of HTTP codes returned over some period of time. Again trending is your friend here. If you notice that the percentage of non-200 HTTP codes suddenly increases, you know something is not going well somewhere in your infrastructure. Many CDN providers offer a Web service API for you to query and get these stats, so you can integrate that functionality into your monitoring and alerting system.

Business intelligence monitoring

We've come to the 'deep application monitoring' section of this post. Business intelligence (BI) deals with metrics pertaining to the business of your Web site. Think of all the metrics you can get with Google Analytics -- and yes, Google Analytics in itself is a powerful monitoring tool.

BI is very much concerned with trends. Everybody wants to see traffic and number of users to the site going up and up. Your site may have very specific metrics that you want to keep track of. Let's say you want users to register and fill in details about themselves, upload a photo, etc. You want to keep track of how many user profiles are created per day, and you want to keep track of the rate of change from day to day, or even from hour to hour. It would be nice if you could spot a drop in that rate right away, and then identify the root cause of the problem. Well, we do this here at Evite with a combination of tools, the main one being a non-open-source non-free tool called SAM (for Simple Application Monitoring) written by Randy Wigginton.

In a nutshell, SAM uses a modified memcached server to capture URLs and SQL queries as they are processed by your application, and aggregates them by ignoring query strings in URLs and values in SQL statements. It saves their count, average/min/max time of completion and failure count in a database, then displays stats and graphs in a simple Web application that queries that database. The beauty of this approach is that you can see metrics related to specific portions of your application aggregated across all users hitting that functionality. So if a user profile is saved via a POST to a URL with a specific query string and HTTP payload, SAM will aggregate all such requests and will allow you to see exactly how many users have saved their profiles in the last hour, how many failures you had, how long it took, etc. What's more, you can query the database directly and take action based on certain thresholds (for example if the failure rate for a specific action, e.g. a POST to a specific URL, has surpassed a given threshold). We're doing exactly this within our restish-based Web service monitoring tool.

Note that this technique based on aggregation can be implemented using other approaches, for example by mining your Web server logs, or database logs. The nice thing about SAM is that it captures all this information in real time. As far as your application is concerned, it only interacts with memcached using standard memcache client calls, and it doesn't even bother to check the result of the memcache push.

Another advantage of having such a macro/business-type of monitoring and graphing in place is that your business stakeholders (including your CEO) can see these metrics and trends almost in real time, and can alert you (the ops team) if something seems out of whack. This is almost guaranteed to correlate with something going amiss in your infrastructure. Once you figure out what that is, see if your internal system monitoring system caught it or not. If it didn't, it's a great opportunity for you to add it in so as to know about it next time before your CEO does. Trust me, these things can happen, I speak from experience.

End-to-end application monitoring

Companies such as BrowserMob and Sauce Labs allow you to run Selenium scripts against your Web application using clients hosted by them 'in the cloud'. With BrowserMob, you can also use these scripts to monitor a given path through your application. This is obviously the monitoring equivalent of an end-to-end application test. If you have an e-commerce site for example, you can test and monitor your entire check out process this way. I am still scratching the surface when it comes to this type of monitoring, but it's high on my TODO list.

So there you have it. Similar to having a sound application testing strategy, your best bet is to implement all these types of monitoring, or as many of them as your time and budget allow. If you do, you can truly say you have a holistic monitoring strategy in place. And remember, the more dashboards you have into the health of your application and your systems, the better off you are. It's an uncontested rule that the one with the most dashboards wins.

For further reading and more dashboard ideas, see these resources:

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...