Tuesday, January 22, 2008

Joel on checklists

Another entertaining blog post from Joel Spolsky, this time on some issues they had with servers and networking equipment hosted at a data center in Manhattan. It all comes down to a network switch which had its ports configured to automatically negotiate their speed. As a result, one port was misbehaving and brough their whole web site down. The conclusion reached by Joel and his sysadmin team: we need documentation, we need checklists. I concur, but as I said in a recent post, this is still not enough. Human beings are notoriously prone to skipping tests on checklists. What Joel and his team really need are AUTOMATED TESTS that run periodically and check every single thing on those checklists. You can easily automate the step which verifies that a port on the switch is set to 100 Mbps or 1 Gbps; you can either use SNMP, or some expect-like script.

In fact, at my own company I'm developing a pretty extensive automated test suite (written in Python of course, and using nose) that verifies all the steps we go through whenever we deploy a server or a network device. It's very satisfying to see those dots and have a total of N passed tests and 0 failed tests, with N increasing daily. Automated tests for sysadmin tasks is an area little explored, so there's lots of potential for cool stuff to happen. If you're doing something similar and have ideas to share, please leave a comment.

9 comments:

stan said...

I've starting using nose and twill to verify our various websites (some static, some dynamic) are up and running properly after web server configuration changes. This includes checking that passwords are requested where expected, and that important links work.

Anonymous said...

Checklists can be ignored, but generally the known existence of a checklist is enough to improve quality. If you can automate it, great, do so. Many things cannot be automated (with current technology), but a human can still look at the checklist.

Note that I specified known. Checklists that nobody knows about are worthless. A checklist that people know about will improve quality.

Anonymous said...

My dad used to do data center support for one of the major banks up here. They had entire binders of procedures on how to install a new banking platform onto the as/400s. I recall coming home one night (well, morning ;)) to have him sitting on the phone to some island trying to recover their database. His comment to the person at the time was something along the lines of "Yes, you were supposed to follow the stuff on page 12. That's why why put page 12 in there. It's not as if we had to keep pages 11 and 13 from each other!" :)

Grig Gheorghiu said...

Henry -- that has not been my experience. I was fully aware of the existence of a checklist for deploying a complex Web application, but that didn't stop me from sometimes skipping important steps by mistake.

Grig

Grig Gheorghiu said...

Adam -- I sympathize with your dad....Great story, thanks for sharing it!

Grig

Grig Gheorghiu said...

Stan -- thanks for sharing your scenario. nose and twill are a great combination!

Grig

Anonymous said...

"Automated tests for sysadmin tasks", I'd really like to read an article or tutorial on this.

Anonymous said...

Our sys admin writes tests, and runs them continuously, using Zabbix.

Great stuff--it pushes notifications, makes nice graphs for system dashboards, etc.

Grig Gheorghiu said...

Stephen -- having a monitoring system is necessary, but sometimes not sufficient. Let's say you have a checklist for deploying a LAMP server, or a Apache/Tomcat server. Your monitoring system will be hard pressed to verify that all the steps in the checklist have been applied. Examples:
* is httpd set to start at boot time?
* are the tomcat directories owned by the correct user?
* does the sudoers file contain certain user IDs?

All these things can be tested automatically with a real scripting language, using a test framework.

Grig

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...