Sunday, August 26, 2012

10 things to know when starting out as a sysadmin

This post was inspired by Henrik Warne's post "Top 5 Surprises When Starting Out as a Software Developer". I thought it was a good idea to put together a similar list for sysadmins. I won't call them 'surprises', just 'things to know'. I found them useful when I started out, and I still find them useful today. I won't prioritize them either, because they're all important in their own way.

Backups are good only if you can restore them

You would be right to roll your eyes and tell yourself this is so obvious, but in my experience most people run backups regularly, but omit to try to restore from those backups periodically. Especially if you have a backup scheme with one full backup every N days followed by either incremental or differential backups every day, it's important to test that you can obtain a recent backup (yesterday's at a minimum) by applying those incrementals or differentials to the full backup. And remember, if it's not backed up, it's not in production.

If it's not monitored, it's not in production

This is one of those things that you learn pretty quickly, especially when your boss calls you up in the middle of the night telling you the site is down. I wrote before on how in my opinion monitoring is for ops what testing is for dev, and I also wrote how monitoring is the foundation for whipping your infrastructure into shape.

If a protocol has an acronym, you need to learn it

SNMP, LDAP, NFS, NIS, SMTP are just some examples of such protocols. As a sysadmin, you need to be deeply familiar with them if you want to have any chance of troubleshooting complex issues. And I want to single out two protocols that are the most important in my book: DNS and HTTP. Get the RFCs out and study them. And speaking of troubleshooting complex issues...

The most important skill you need to master is problem solving

The issues you'll face in your career as a sysadmin will get more and more complex, in direct relation with the complexity of the infrastructures you'll build and maintain. You need to be able to analyze a problem and come up with several variables that could cause the issue, then eliminate the variables one by one until you discover the root cause. This one-by-one variable elimination strategy is really important, and I've been struck throughout my career by how many people have never mastered it, and instead flail around hopelessly when faced with a non-trivial issue.

You need at least 2 of everything in production

As soon as you are in charge of a non-trivial Web site, you realize that you need to eliminate single points of failure as much as possible. It starts with border routers, it continues with firewalls and load balancers, then web/app/database servers, and network switches that tie everything together. All of a sudden, you have a pretty complex infrastructure to build and maintain.

One of the most important things you can do in this context is to test the failover of the various devices (firewalls, load balancers, routers, switches), which are usually in an active/passive configuration. I've been bit many times by forced failovers (when the active device unexpectedly failed) which didn't go well because the passive device wasn't configured properly or wasn't syncing properly from the active one.

I also want to mention in this context the necessity of deeply understanding how networks work both at Layer 2 (MAC) and Layer 3 (IP routing). You can only fake so much a lack of understanding of these issues. The most subtle and hard to solve issues I've faced in my career as a sysadmin have all been networking issues (which for some reason involved ARP tables many times). You need to become best friends with tcpdump.


Keep your systems secure

The days when telnet was enabled by default in most OSes are long gone, but you still need to worry about security issues. Fortunately there are simple things you can do that go a long way towards improving the security of your infrastructure -- things like putting firewalls in front of everything and only allowing the ports necessary for your production traffic, disabling services you don't need on your servers, monitoring your logs for unauthorized access attempts, and not running Windows (just kidding, kinda).

One issue I faced in this context was applying security patches to various OSes. You need to be careful when doing this, and make sure you test out those patches in staging before applying them in production (otherwise you run the risk of rebooting a production server and have it not come back because of the effects of that patch -- trust me, it happens).

Logging is your best friend

Logging goes hand in hand with monitoring as one of those sine-qua-non conditions for having a good grasp of what's going on with your infrastructure. You'll learn soon that you need to have a strategy for logging, in most cases a central log server where you send logs from your other systems. There are tools such as Flume and Scribe that help these days, but even good old syslog-ng works just fine for this purpose. Logging by itself is not enough -- you need to also monitor the logs and send alerts when you identify error conditions. It's not easy, but it needs to be done.

You need to know a scripting language

You can only go so far in your sysadmin career if you don't master a decent scripting language. I started with Perl (after programming in C/C++ for a living for several years) but discovered Python in 2004 and never looked back. Ruby will do the trick too. You don't need to be a ninja programmer, but you need to have decent skills -- know how to split a program into modules, know how to use OOP techniques, know enough of the language to be able to read and extend other people's code, and maybe most important of all, KNOW HOW TO TEST YOUR CODE! Always test your code in staging before you put it in production.

Document everything

This is very important when you start out because you learn something new every day. Write it down (I used to do it with old fashioned pen and paper) but also share it with your team. Wikis are decent for this purpose, although they become hard to organize as they grow. But having some sort of searchable knowledge base is definitely 'a good thing', especially as you team grows and new people need to be brought up to speed. Of course, these days you can also use 'executable documentation' in the form of Chef recipes or Puppet manifests.

And speaking of teams...

Always try to be a leader

You start out on the bottom rung of the ladder, but you can still be a leader. I once saw a definition of leadership that really resonated with me: "a leader is somebody who makes something happen which otherwise wouldn't happen". There are countless opportunities to do just that even if you are just starting out in your career. If something is hard (or 'not that fun') and people on your team either postpone it or seem to just forget to do it, that's a good sign you need to step up and be a leader and do it. You will help your team and you will help yourself in the process.

One thing you can make happen (for example by blogging) is to share lessons that you've learned the hard way. Many of the solutions I've found to thorny issues I've faced have come from blogs, so I am always happy to contribute back to the community by sharing some of my own experiences via blogging. I strongly advise you to do the same.

7 comments:

Atmospheric said...

Excellent stuff Grig!

Unknown said...

Well Done!

silviu dicu said...

true true true !

-thanks

Unknown said...

Really good solid advice to anyone, I think it really helps to show the Ops job is much more than building systems and pushing code.

Jirka said...

Great advice. Thanks for sharing.

Sandeep Netha said...

Awesome posting..

Really it very intresting.

Larry said...

Hi,
This one is too good!!! I agree with each and ever point you have mentioned. This is always true that the issues complexity will increase with the level of network you will be building and using it.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...