Friday, January 25, 2008

Checklist automation and testing

This is a follow-up to my previous post on writing automated tests for sysadmin-related checklists. That post seems to have struck a chord, judging by the comments it generated.

Here's the scenario I'm thinking about: you need to deploy a standardized set of packages and configurations to a bunch of servers. You put together a checklist detailing the steps you need to take on each server -- kickstart the box, run some post-install scripts, do some configuration customization, etc. At this point, you're already ahead of the game, and you're not relying solely on human memory. However, if you rely on a human being going manually through each step of the checklist on each server, you're in for some surprises in the guise of missed steps. The answer of course is to automate as many steps as you can, ideally all of them.

Now we're getting to the main point of my post: assuming you did automate all the steps of the checklist, and you ran your scripts on each server, do you REALLY have that warm and fuzzy feeling that everything is OK? You don't, unless you also have a comprehensive automated test suite that runs on every server and actually checks that stuff happened the way you intended.

Here are some concrete examples of stuff I'm verifying after the deployment of a certain type of our servers (running Apache/Tomcat).

OS-specific tests

* does the sudoers file contain certain users that need those rights
* is the sshd process set to start at boot time
* is the ClientAliveInterval variable set correctly in /etc/sshd/sshd_config
* are certain NFS mount points defined in /etc/fstab, and do they actually exist on the server
* is sendmail set to start at boot time, and running
* are iptables and/or SELinux configured the way they should be
* ....and more

Apache-specific tests

* is httpd set to start at boot time
* do the virtual host configuration files in /etc/httpd/conf.d contain the expected information
* has mod_jk been installed and configured properly (mod_jk provides the glue between Apache and Tomcat)
* is SSL configured properly
* does the /etc/logrotate.d/httpd configuration file contain the correct options (for example keep the logs for N days and compress them)
* etc.

Tomcat-specific tests

* has a specific version of Java been installed
* has Tomcat been installed in the correct directory, with the correct permissions
* has Tomcat been set to start at boot time
* etc.

Media-specific tests

* has ImageMagick been installed in the correct location
* does ImageMagick support certain file formats (JPG, PNG, etc)
* can ImageMagick actually process certain types of files (JPG, PNG, etc.)

Some of these tests could be run from a monitoring system (one of the commenters on my previous post mentioned that their sysadmins use Zimbrix; an Open Source alternative is Nagios, and there are many others.) However, a monitoring system typically doesn't go into the level of detail I described, especially when it comes to configuration files and other more advanced customizations. That's why I think it's important to use a real test framework and a real scripting language for this type of automated tests.

In my case, each type of test resides in its own file -- for example test_os.py, test_apache.py, test_tomcat.py, test_media.py. I run the tests using the nose test framework.

Here are some examples of small test functions. I'm using sets for making sure that expected lines are in certain files or in the output of certain commands, since most of the time I don't care about the order in which those lines appear.

From test_os.py:

def test_sshd_on():
stdout, stdin = popen2.popen2('chkconfig sshd --list')
lines = stdout.readlines()
assert "sshd \t0:off\t1:off\t2:on\t3:on\t4:on\t5:on\t6:off\n" in lines

From test_apache.py:

def test_logrotate_httpd():
lines = open('/etc/logrotate.d/httpd').readlines()
lines = set(lines)
expected = set([
" rotate 100\n",
" compress\n",
])
assert lines.issuperset(expected)

From test_tomcat.py:

def test_homedir():
target_dir = '/opt/target'
assert os.path.isdir(target_dir)
(st_mode, st_ino, st_dev, st_nlink, st_uid, st_gid, st_size, st_atime, st_mtime, st_ctime) = os.stat(target_dir)
assert st_uid == TARGET_UID, 'User wrong for %s' % pathname
assert st_gid == TARGET_GID, 'Group wrong for %s' % pathname

From test_media.py:

def test_ImageMagick_listformat():
stdout, stdin = popen2.popen2(''/usr/local/bin/identify --list format'')
lines = stdout.readlines()
lines = set(lines)
expected = set([
" JNG* PNG rw- JPEG Network Graphics\n",
" JPEG* JPEG rw- Joint Photographic Experts Group JFIF format (62)\n",
" JPG* JPEG rw- Joint Photographic Experts Group JFIF format\n",
" PJPEG* JPEG rw- Progessive Joint Photographic Experts Group JFIF\n",
" JNG* PNG rw- JPEG Network Graphics\n",
" MNG* PNG rw+ Multiple-image Network Graphics (libpng 1.2.10)\n",
" PNG* PNG rw- Portable Network Graphics (libpng 1.2.10)\n",
" PNG24* PNG rw- 24-bit RGB PNG, opaque only (zlib 1.2.3)\n",
" PNG32* PNG rw- 32-bit RGBA PNG, semitransparency OK\n",
" PNG8* PNG rw- 8-bit indexed PNG, binary transparency only\n",
])
assert lines.issuperset(expected)

As always, comments and suggestions are very welcome! Also see Titus's post for some sysadmin-related automated tests that he's running on a regular basis.

7 comments:

Evgeny Zislis said...

You are probably aware that this idea is very old.

So, if you somehow missed these projects -- this is the grandparent of unix system bootstrapping:
cfengine (also at gnu)

And these two are his children written in Ruby:
puppet
cfruby

Anonymous said...

Any Python-using Linux sysadmin who thinks this is the coolest thing since sliced bread, please email me, I'll probably hire you.

Grig Gheorghiu said...

Kesor -- thanks for the comment. Yes, the idea of having an automation engine that aids in deployment is old. My main point is that you still need automated tests that verify that the engine did its job correctly. I haven't seen the need for automated tests stressed too much out there in the wild.

Grig

Anonymous said...

Seems like you could take this idea a step further and turn this into a monitoring system, a la nagios... but with less suck and more python.

Evgeny Zislis said...

That is exactly what "cfengine" is all about. It's both for checking the checklist, and as a bootstrapper for that same checklist for a new server.

Grig Gheorghiu said...

Evgeny -- interesting, I didn't know that cfengine does the testing as well. I wanted to look into puppet anyway; too bad it's written in Ruby, but maybe there's a Google SoC project waiting there to port it to Python :-)

Thanks for your comments.

Grig

Anonymous said...

Another approach is versioning OS images. Changes are made on a single system in the development environment. This system's image is taken, and it is labeled and checked into a repository.

Production systems are then stamped out from these labeled images. A complete re-image of single system takes about 3 minutes. A set of k systems in a multi-cast domain takes about 10 minutes.

This has the wonderful affect of removing the possibility of configuration drift from your production environment, and discouraging *anyone* from making ad-hoc changes and expecting them to persist.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...