Another entertaining blog post from Joel Spolsky, this time on some issues they had with servers and networking equipment hosted at a data center in Manhattan. It all comes down to a network switch which had its ports configured to automatically negotiate their speed. As a result, one port was misbehaving and brough their whole web site down. The conclusion reached by Joel and his sysadmin team: we need documentation, we need checklists. I concur, but as I said in a recent post, this is still not enough. Human beings are notoriously prone to skipping tests on checklists. What Joel and his team really need are AUTOMATED TESTS that run periodically and check every single thing on those checklists. You can easily automate the step which verifies that a port on the switch is set to 100 Mbps or 1 Gbps; you can either use SNMP, or some expect-like script.
In fact, at my own company I'm developing a pretty extensive automated test suite (written in Python of course, and using nose) that verifies all the steps we go through whenever we deploy a server or a network device. It's very satisfying to see those dots and have a total of N passed tests and 0 failed tests, with N increasing daily. Automated tests for sysadmin tasks is an area little explored, so there's lots of potential for cool stuff to happen. If you're doing something similar and have ideas to share, please leave a comment.