This post was inspired by Henrik Warne's post "Top 5 Surprises When Starting Out as a Software Developer". I thought it was a good idea to put together a similar list for sysadmins. I won't call them 'surprises', just 'things to know'. I found them useful when I started out, and I still find them useful today. I won't prioritize them either, because they're all important in their own way.
Backups are good only if you can restore them
You would be right to roll your eyes and tell yourself this is so obvious, but in my experience most people run backups regularly, but omit to try to restore from those backups periodically. Especially if you have a backup scheme with one full backup every N days followed by either incremental or differential backups every day, it's important to test that you can obtain a recent backup (yesterday's at a minimum) by applying those incrementals or differentials to the full backup. And remember, if it's not backed up, it's not in production.
If it's not monitored, it's not in production
This is one of those things that you learn pretty quickly, especially when your boss calls you up in the middle of the night telling you the site is down. I wrote before on how in my opinion monitoring is for ops what testing is for dev, and I also wrote how monitoring is the foundation for whipping your infrastructure into shape.
If a protocol has an acronym, you need to learn it
SNMP, LDAP, NFS, NIS, SMTP are just some examples of such protocols. As a sysadmin, you need to be deeply familiar with them if you want to have any chance of troubleshooting complex issues. And I want to single out two protocols that are the most important in my book: DNS and HTTP. Get the RFCs out and study them. And speaking of troubleshooting complex issues...
The most important skill you need to master is problem solving
The issues you'll face in your career as a sysadmin will get more and more complex, in direct relation with the complexity of the infrastructures you'll build and maintain. You need to be able to analyze a problem and come up with several variables that could cause the issue, then eliminate the variables one by one until you discover the root cause. This one-by-one variable elimination strategy is really important, and I've been struck throughout my career by how many people have never mastered it, and instead flail around hopelessly when faced with a non-trivial issue.
You need at least 2 of everything in production
As soon as you are in charge of a non-trivial Web site, you realize that you need to eliminate single points of failure as much as possible. It starts with border routers, it continues with firewalls and load balancers, then web/app/database servers, and network switches that tie everything together. All of a sudden, you have a pretty complex infrastructure to build and maintain.
One of the most important things you can do in this context is to test the failover of the various devices (firewalls, load balancers, routers, switches), which are usually in an active/passive configuration. I've been bit many times by forced failovers (when the active device unexpectedly failed) which didn't go well because the passive device wasn't configured properly or wasn't syncing properly from the active one.
I also want to mention in this context the necessity of deeply understanding how networks work both at Layer 2 (MAC) and Layer 3 (IP routing). You can only fake so much a lack of understanding of these issues. The most subtle and hard to solve issues I've faced in my career as a sysadmin have all been networking issues (which for some reason involved ARP tables many times). You need to become best friends with tcpdump.
Keep your systems secure
The days when telnet was enabled by default in most OSes are long gone, but you still need to worry about security issues. Fortunately there are simple things you can do that go a long way towards improving the security of your infrastructure -- things like putting firewalls in front of everything and only allowing the ports necessary for your production traffic, disabling services you don't need on your servers, monitoring your logs for unauthorized access attempts, and not running Windows (just kidding, kinda).
One issue I faced in this context was applying security patches to various OSes. You need to be careful when doing this, and make sure you test out those patches in staging before applying them in production (otherwise you run the risk of rebooting a production server and have it not come back because of the effects of that patch -- trust me, it happens).
Logging is your best friend
Logging goes hand in hand with monitoring as one of those sine-qua-non conditions for having a good grasp of what's going on with your infrastructure. You'll learn soon that you need to have a strategy for logging, in most cases a central log server where you send logs from your other systems. There are tools such as Flume and Scribe that help these days, but even good old syslog-ng works just fine for this purpose. Logging by itself is not enough -- you need to also monitor the logs and send alerts when you identify error conditions. It's not easy, but it needs to be done.
You need to know a scripting language
You can only go so far in your sysadmin career if you don't master a decent scripting language. I started with Perl (after programming in C/C++ for a living for several years) but discovered Python in 2004 and never looked back. Ruby will do the trick too. You don't need to be a ninja programmer, but you need to have decent skills -- know how to split a program into modules, know how to use OOP techniques, know enough of the language to be able to read and extend other people's code, and maybe most important of all, KNOW HOW TO TEST YOUR CODE! Always test your code in staging before you put it in production.
Document everything
This is very important when you start out because you learn something new every day. Write it down (I used to do it with old fashioned pen and paper) but also share it with your team. Wikis are decent for this purpose, although they become hard to organize as they grow. But having some sort of searchable knowledge base is definitely 'a good thing', especially as you team grows and new people need to be brought up to speed. Of course, these days you can also use 'executable documentation' in the form of Chef recipes or Puppet manifests.
And speaking of teams...
Always try to be a leader
You start out on the bottom rung of the ladder, but you can still be a leader. I once saw a definition of leadership that really resonated with me: "a leader is somebody who makes something happen which otherwise wouldn't happen". There are countless opportunities to do just that even if you are just starting out in your career. If something is hard (or 'not that fun') and people on your team either postpone it or seem to just forget to do it, that's a good sign you need to step up and be a leader and do it. You will help your team and you will help yourself in the process.
One thing you can make happen (for example by blogging) is to share lessons that you've learned the hard way. Many of the solutions I've found to thorny issues I've faced have come from blogs, so I am always happy to contribute back to the community by sharing some of my own experiences via blogging. I strongly advise you to do the same.
Sunday, August 26, 2012
Monday, August 13, 2012
The dangers of uniformity
This blog post was inspired by the Velocity 2012 keynote given by Dr. Richard Cook and titled "How Complex Systems Fail". Approximately 6 minutes into the presentation, Dr. Cook relates a story which resonated with me. He talks about upgrading hospital equipment, specifically infusion pumps, which perform and regulate the infusion of fluids in patients. Pretty important and critical task. The hospital bought brand new infusion pumps from a single vendor. The pumps worked without a glitch for exactly 1 year. Then, at 20 minutes past midnight, the technician on call was alerted to the fact that one of the pumps stopped working. He fiddled with it, rebooted the equipment and brought it back to life (not sure about the patient attached to the pump though). Then, minutes later, other calls started to pour in. It turns out that approximately 20% of the pumps stopped working around the same time that night. Nightmare night for the technician on call, and we can only hope he retained his sanity.
The cause of the problems was this: the pumps have a series of pretty complicated settings, one of which being the period of time that needs to elapse until a mandatory software upgrade. That period of time was initially set to, you guessed it, 1 year, because it seemed like such a distant point in time. Well, after 1 year, the pumps begged to be upgraded (it's not clear whether that was a manual process to be initiated by the technician, or an automated process) -- but the gotcha was that normal functionality was suspended during the upgrade process, so the pumps effectively stopped working.
This story resonates with me on 2 fronts related to uniformity: the first is uniformity in time (most of the pumps were put in production around the same time), and the second is uniformity or monoculture in setup (and by this I mean single vendor/hardware/OS/software). These 2 aspects can introduce very subtle and hard to avoid bugs, which usually hit you when you least expect it.
I have a few stories of my own to tell regarding these issues.
First story: at Evite we purchased Dell C2100 servers to act as our production database servers. We got them in the summer of 2011 and we set them up in time for our high season of the year, which is late November/early-mid December. We installed Ubuntu 10.04 on all of them, and they performed magnificently, with remarkable uptime -- in fact, once we set them up, we never needed to reboot them. That is, until they started to crash one by one with kernel panic messages approximately 200 days after putting them in production. This seemed to be too much of a coincidence. After consulting with Dell support specialists, we were pointed to this Linux kernel bug which says that the scheduler code for kernel 2.6.32.11 crashes after 200+ days of uptime. We had 2 crashes in 24 hours, after which we preemptively rebooted the other servers during the night and this seemed to solve that particular issue. We also started to update Ubuntu to 12.04 on all servers so we can get an updated kernel version not affected by that bug.
As you can see, we had both uniformity/monoculture in setup (same vendor, same hardware, same OS), and uniformity in time (all servers had been put in production within a few days of each other). This combination hit us hard and unexpectedly.
Second story: one of the Dell C2100 servers mentioned above was displaying an unusual behavior. Every Saturday morning at 9 AM, we would get a monitoring alert about increased CPU I/O wait on that server. This would last for about 1 hour, after which things would return to normal. At first we thought it's a one-off, but after 2 or 3 consecutive Saturdays we looked more deeply on the system and we figured out (with the help of Percona, since those servers were running the Percona MySQL distribution) that the behavior was due to the RAID controller battery discharging and recharging itself as part of a relearning process. This had the effect or disabling the RAID write cache, so the I/O on the system suffered. I wrote more extensively about these battery issues in another post. The lesson here: if you see a cyclic behavior (which belongs to the issue of uniformity in time), investigate your hardware settings, especially the RAID controller!
Third story: this is the well known story of the 'leapocalypse', the addition of a leap second at midnight on July 1st 2012. While we weren't actually affected by the leap second bug, we still monitored our servers nervously as soon as word broke out on Twitter that servers running mostly Java apps, but also some MySQL servers, would become almost unresponsive, with 100% CPU utilization. The fix found by Mozilla was to run the date command and set it to the current date. The fact that they spread this fix to all their affected servers using Puppet also helped them.
Related to this story, it was interesting to read this article on how Google's servers weren't affected. They had been bit by leap second issues in the past, so on days that preceded the introduction of a leap second they introduced 'leap smear' to their NTP servers, adding a few milliseconds to every NTP update so that the overall effect was to get in sync more slowly (of course, they had to run their own modified NTP code for this to work).
The leapocalypse story contains elements of both unformity of time (everybody was affected at the same time due to the nature of the leap second update) and to a lesser degree uniformity of setup (Linux servers were affected because of a bug in the Linux kernel; Solaris and its variants weren't affected).
So...what steps can we take to try to alleviate these issues? I think we can do a few things:
1) Don't restart all servers at the same time prior to putting them in production; instead, introduce a 'jitter' of a couple of days in between restarts. This will give you time to react to 'uniformity in time' bugs, so that not all your servers will exhibit the bug at the same time.
Same strategy applies to clearing caches -- memcached or other cache servers.
2) Don't buy all your servers from the same vendor. This is harder to do, since vendors like to sell you things in bulk, and provide incentives for you to do so. But this would avoid issues with uniformity/monoculture in setup. Of course, even if you do buy servers from a single vendors, you can still install different OSes on them, different versions of the same OS, etc. This is only feasible I think if you use a configuration management tool such as Puppet or Chef, which in theory at least abstracts away the OS from you.
3) Make sure your monitoring is up to par. Monitor every single component of your servers which can be monitored. Pay attention to RAID controller cards! And of course graph the resources you monitor as well, to see spikes or dips that maybe are not caught by your alerting thresholds.
The cause of the problems was this: the pumps have a series of pretty complicated settings, one of which being the period of time that needs to elapse until a mandatory software upgrade. That period of time was initially set to, you guessed it, 1 year, because it seemed like such a distant point in time. Well, after 1 year, the pumps begged to be upgraded (it's not clear whether that was a manual process to be initiated by the technician, or an automated process) -- but the gotcha was that normal functionality was suspended during the upgrade process, so the pumps effectively stopped working.
This story resonates with me on 2 fronts related to uniformity: the first is uniformity in time (most of the pumps were put in production around the same time), and the second is uniformity or monoculture in setup (and by this I mean single vendor/hardware/OS/software). These 2 aspects can introduce very subtle and hard to avoid bugs, which usually hit you when you least expect it.
I have a few stories of my own to tell regarding these issues.
First story: at Evite we purchased Dell C2100 servers to act as our production database servers. We got them in the summer of 2011 and we set them up in time for our high season of the year, which is late November/early-mid December. We installed Ubuntu 10.04 on all of them, and they performed magnificently, with remarkable uptime -- in fact, once we set them up, we never needed to reboot them. That is, until they started to crash one by one with kernel panic messages approximately 200 days after putting them in production. This seemed to be too much of a coincidence. After consulting with Dell support specialists, we were pointed to this Linux kernel bug which says that the scheduler code for kernel 2.6.32.11 crashes after 200+ days of uptime. We had 2 crashes in 24 hours, after which we preemptively rebooted the other servers during the night and this seemed to solve that particular issue. We also started to update Ubuntu to 12.04 on all servers so we can get an updated kernel version not affected by that bug.
As you can see, we had both uniformity/monoculture in setup (same vendor, same hardware, same OS), and uniformity in time (all servers had been put in production within a few days of each other). This combination hit us hard and unexpectedly.
Second story: one of the Dell C2100 servers mentioned above was displaying an unusual behavior. Every Saturday morning at 9 AM, we would get a monitoring alert about increased CPU I/O wait on that server. This would last for about 1 hour, after which things would return to normal. At first we thought it's a one-off, but after 2 or 3 consecutive Saturdays we looked more deeply on the system and we figured out (with the help of Percona, since those servers were running the Percona MySQL distribution) that the behavior was due to the RAID controller battery discharging and recharging itself as part of a relearning process. This had the effect or disabling the RAID write cache, so the I/O on the system suffered. I wrote more extensively about these battery issues in another post. The lesson here: if you see a cyclic behavior (which belongs to the issue of uniformity in time), investigate your hardware settings, especially the RAID controller!
Third story: this is the well known story of the 'leapocalypse', the addition of a leap second at midnight on July 1st 2012. While we weren't actually affected by the leap second bug, we still monitored our servers nervously as soon as word broke out on Twitter that servers running mostly Java apps, but also some MySQL servers, would become almost unresponsive, with 100% CPU utilization. The fix found by Mozilla was to run the date command and set it to the current date. The fact that they spread this fix to all their affected servers using Puppet also helped them.
Related to this story, it was interesting to read this article on how Google's servers weren't affected. They had been bit by leap second issues in the past, so on days that preceded the introduction of a leap second they introduced 'leap smear' to their NTP servers, adding a few milliseconds to every NTP update so that the overall effect was to get in sync more slowly (of course, they had to run their own modified NTP code for this to work).
The leapocalypse story contains elements of both unformity of time (everybody was affected at the same time due to the nature of the leap second update) and to a lesser degree uniformity of setup (Linux servers were affected because of a bug in the Linux kernel; Solaris and its variants weren't affected).
So...what steps can we take to try to alleviate these issues? I think we can do a few things:
1) Don't restart all servers at the same time prior to putting them in production; instead, introduce a 'jitter' of a couple of days in between restarts. This will give you time to react to 'uniformity in time' bugs, so that not all your servers will exhibit the bug at the same time.
Same strategy applies to clearing caches -- memcached or other cache servers.
2) Don't buy all your servers from the same vendor. This is harder to do, since vendors like to sell you things in bulk, and provide incentives for you to do so. But this would avoid issues with uniformity/monoculture in setup. Of course, even if you do buy servers from a single vendors, you can still install different OSes on them, different versions of the same OS, etc. This is only feasible I think if you use a configuration management tool such as Puppet or Chef, which in theory at least abstracts away the OS from you.
3) Make sure your monitoring is up to par. Monitor every single component of your servers which can be monitored. Pay attention to RAID controller cards! And of course graph the resources you monitor as well, to see spikes or dips that maybe are not caught by your alerting thresholds.
Subscribe to:
Posts (Atom)
Modifying EC2 security groups via AWS Lambda functions
One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...
-
A short but sweet PM Boulevard interview with Jerry Weinberg on Agile management/methods. Of course, he says we need to drop the A and actu...
-
Here's a good interview question for a tester: how do you define performance/load/stress testing? Many times people use these terms inte...
-
Update 02/26/07 -------- The link to the old httperf page wasn't working anymore. I updated it and pointed it to the new page at HP. Her...