This blog post was inspired by the Velocity 2012 keynote given by Dr. Richard Cook and titled "How Complex Systems Fail". Approximately 6 minutes into the presentation, Dr. Cook relates a story which resonated with me. He talks about upgrading hospital equipment, specifically infusion pumps, which perform and regulate the infusion of fluids in patients. Pretty important and critical task. The hospital bought brand new infusion pumps from a single vendor. The pumps worked without a glitch for exactly 1 year. Then, at 20 minutes past midnight, the technician on call was alerted to the fact that one of the pumps stopped working. He fiddled with it, rebooted the equipment and brought it back to life (not sure about the patient attached to the pump though). Then, minutes later, other calls started to pour in. It turns out that approximately 20% of the pumps stopped working around the same time that night. Nightmare night for the technician on call, and we can only hope he retained his sanity.
The cause of the problems was this: the pumps have a series of pretty complicated settings, one of which being the period of time that needs to elapse until a mandatory software upgrade. That period of time was initially set to, you guessed it, 1 year, because it seemed like such a distant point in time. Well, after 1 year, the pumps begged to be upgraded (it's not clear whether that was a manual process to be initiated by the technician, or an automated process) -- but the gotcha was that normal functionality was suspended during the upgrade process, so the pumps effectively stopped working.
This story resonates with me on 2 fronts related to uniformity: the first is uniformity in time (most of the pumps were put in production around the same time), and the second is uniformity or monoculture in setup (and by this I mean single vendor/hardware/OS/software). These 2 aspects can introduce very subtle and hard to avoid bugs, which usually hit you when you least expect it.
I have a few stories of my own to tell regarding these issues.
First story: at Evite we purchased Dell C2100 servers to act as our production database servers. We got them in the summer of 2011 and we set them up in time for our high season of the year, which is late November/early-mid December. We installed Ubuntu 10.04 on all of them, and they performed magnificently, with remarkable uptime -- in fact, once we set them up, we never needed to reboot them. That is, until they started to crash one by one with kernel panic messages approximately 200 days after putting them in production. This seemed to be too much of a coincidence. After consulting with Dell support specialists, we were pointed to this Linux kernel bug which says that the scheduler code for kernel 18.104.22.168 crashes after 200+ days of uptime. We had 2 crashes in 24 hours, after which we preemptively rebooted the other servers during the night and this seemed to solve that particular issue. We also started to update Ubuntu to 12.04 on all servers so we can get an updated kernel version not affected by that bug.
As you can see, we had both uniformity/monoculture in setup (same vendor, same hardware, same OS), and uniformity in time (all servers had been put in production within a few days of each other). This combination hit us hard and unexpectedly.
Second story: one of the Dell C2100 servers mentioned above was displaying an unusual behavior. Every Saturday morning at 9 AM, we would get a monitoring alert about increased CPU I/O wait on that server. This would last for about 1 hour, after which things would return to normal. At first we thought it's a one-off, but after 2 or 3 consecutive Saturdays we looked more deeply on the system and we figured out (with the help of Percona, since those servers were running the Percona MySQL distribution) that the behavior was due to the RAID controller battery discharging and recharging itself as part of a relearning process. This had the effect or disabling the RAID write cache, so the I/O on the system suffered. I wrote more extensively about these battery issues in another post. The lesson here: if you see a cyclic behavior (which belongs to the issue of uniformity in time), investigate your hardware settings, especially the RAID controller!
Third story: this is the well known story of the 'leapocalypse', the addition of a leap second at midnight on July 1st 2012. While we weren't actually affected by the leap second bug, we still monitored our servers nervously as soon as word broke out on Twitter that servers running mostly Java apps, but also some MySQL servers, would become almost unresponsive, with 100% CPU utilization. The fix found by Mozilla was to run the date command and set it to the current date. The fact that they spread this fix to all their affected servers using Puppet also helped them.
Related to this story, it was interesting to read this article on how Google's servers weren't affected. They had been bit by leap second issues in the past, so on days that preceded the introduction of a leap second they introduced 'leap smear' to their NTP servers, adding a few milliseconds to every NTP update so that the overall effect was to get in sync more slowly (of course, they had to run their own modified NTP code for this to work).
The leapocalypse story contains elements of both unformity of time (everybody was affected at the same time due to the nature of the leap second update) and to a lesser degree uniformity of setup (Linux servers were affected because of a bug in the Linux kernel; Solaris and its variants weren't affected).
So...what steps can we take to try to alleviate these issues? I think we can do a few things:
1) Don't restart all servers at the same time prior to putting them in production; instead, introduce a 'jitter' of a couple of days in between restarts. This will give you time to react to 'uniformity in time' bugs, so that not all your servers will exhibit the bug at the same time.
Same strategy applies to clearing caches -- memcached or other cache servers.
2) Don't buy all your servers from the same vendor. This is harder to do, since vendors like to sell you things in bulk, and provide incentives for you to do so. But this would avoid issues with uniformity/monoculture in setup. Of course, even if you do buy servers from a single vendors, you can still install different OSes on them, different versions of the same OS, etc. This is only feasible I think if you use a configuration management tool such as Puppet or Chef, which in theory at least abstracts away the OS from you.
3) Make sure your monitoring is up to par. Monitor every single component of your servers which can be monitored. Pay attention to RAID controller cards! And of course graph the resources you monitor as well, to see spikes or dips that maybe are not caught by your alerting thresholds.