When you work as a systems engineer at a company that has a large scale system infrastructure, sooner or later you realize that you need to automate pretty much everything you do. You can't afford not to, if you want to keep up with the ever-present demands of scaling up and down the infrastructure.
The main promise of cloud computing -- infinite elastic scaling based on demand -- is real, but you can only achieve it if you automate your deployments. It's fairly safe to say that most teams that are involved in such infrastructures have achieved high levels of automation. Some fearless teams practice continuous deployment, others do frequent dark launches. All these practices are great, but my thesis is that in order to achieve fearlessness you need automated tests of your production deployments.
Note the word 'production' -- I believe it is necessary to go one step beyond running automated tests in an isolated staging environment (although that is a very good thing to do, especially if staging mirrors production at a smaller scale). That next step is to run your test harness in production, every time you deploy. And deployment, at a fast moving Web company these days, can happen multiple times a day. Trust me, with no automated tests in place, you'll never get rid of that nagging feeling in the pit of your stomach that you might have broken things horribly, in production.
So how do you go about writing automated tests for your deployments? I wrote a while ago about automating and testing your system setup checklists. Even testing small things such as 'is httpd/mysqld/postfix setup to run at boot time' will go a long way in achieving peace of mind.
Assuming you have a list of things to test (it can be just a couple of critical things for starters), how and when do you run the tests? Again, you can do the simplest thing that works -- a bash shell that iterates through your production servers and runs the test scripts remotely on the servers via ssh. Some things I test this way these days are:
* do local MySQL databases on servers in a particular cluster contain the same data in certain tables? (this shows me that things are in sync across servers)
* is MySQL replication working as expected across the cluster of read-only slaves?
* are periodic operations happening as expected (here I can do a simple tail of a log file to figure it out)
* are certain PHP modules correctly installed?
* is Apache serving a number of requests per second that is not too high, but not too low either (where high and low are highly dependent on your traffic and application obviously)
I run these tests (and many others) each time I push a change to production. No matter how small the change can seem, it can have unanticipated side effects. I found that having tests that probe the system from as many angles as possible are the most efficient -- the angles in my case being Apache, MySQL, PHP, memcached for example. I also found that this type of testing (push-based if you want) is very good at showing discrepancies between servers. If you see a server being out of wack this way, then you know you need to attempt to fix it, or even terminate it and deploy a new one.
Another approach in your automated testing strategy is to run your test harness periodically (via cron for example) and also to write the harness in a proper language (Python comes to mind), integrated into a test framework. You can have the results of the tests emailed to you in case of failure. The advantage of this approach is that you can have things run automatically without your intervention (in the first approach, you still have to remember to run the test suite!).
The ultimate in terms of automated testing is to integrate it with your monitoring infrastructure. If you use Nagios for example, you can easily write plugins that essentialy probe for the same things that your tests probe for. The advantage of this approach is that the tests will run every time Nagios runs, and you can set up alerts easily. One disadvantage is that it can slow down your monitoring, depending on the number of tests you need to run on each server. Monitoring typically happens very often (every 5 minutes is a common practice), so it may be overkill to run all the tests every 5 minutes. Of course, this should be configurable in your monitoring tool, so you can have a separate class of checks that only happen every N hours for example.
In any case, let me assure you that even if you take the first approach I mentioned (ssh into all servers and run commands remotely that way), you'll reap the rewards very fast. In fact, you'll like it so much that you'll want to keep adding more tests, so you can achieve more inner peace. It's a sure way to becoming test infected, but also to achieve deployment nirvana.