Friday, April 19, 2013

Setting up keepalived with Chef on Ubuntu 12.04

We have 2 servers running HAProxy on Ubuntu 12.04. We want to set them up in an HA configuration, and for that we chose keepalived.

The first thing we did was look for an existing Chef cookbook for keepalived -- luckily, @jtimberman already wrote it. It's a pretty involved cookbook, probably one of the most complex I've seen. The usage instructions are pretty good though. In any case, we ended up writing our own wrapper cookbook on top of keepalived -- let's call it frontend-keepalived.

The usage documentation for the Opscode keepalived cookbook contains a role-based example and a recipe-based example. We took inspiration from both. In our frontend-keepalived/recipes/default.rb file we have:


include_recipe 'keepalived'

node[:keepalived][:check_scripts][:chk_haproxy] = {
  :script => 'killall -0 haproxy',
  :interval => 2,
  :weight => 2
}
node[:keepalived][:instances][:vi_1] = {
  :ip_addresses => '172.30.10.10',
  :interface => 'frontend_if',
  :track_script => 'chk_haproxy',
  :nopreempt => false,
  :advert_int => 1,
  :auth_type => :pass, # :pass or :ah
  :auth_pass => 'mypass'
}


This code overrides the default values for many of the attributes defined in the Opscode keepalived cookbook. It specifies the floating IP address that will be common between the 2 servers that will each run HAProxy (:ip_addresses). It also specifies the network interface where the multicast-based keepalived protocol (:interface) and the 'check script' which tests whether HAProxy is still running on each server.

However, we still needed a way to specify which of the 2 servers is the master and which is the backup (in keepalived parlance), as well as indicating priorities for each server. The usage document in the keep alived cookbook shows this as an example of using a single role to define the master and the backup:


override_attributes(
  :keepalived => {
    :global => {
      :router_ids => {
        'node1' => 'MASTER_NODE',
        'node2' => 'BACKUP_NODE'
      }
    }
  }
)

We couldn't get this to work (if somebody who did reads this, please leave a comment and tell me how you did it!). Instead, we defined 2 roles, one for the master and one for the backup. Here's the master role:


$ cat frontend-keepalived-master.rb
name "frontend-keepalived-master"
description "install keepalived and set state to MASTER"

override_attributes(
    "keepalived" => {
      "instance_defaults" => {
        "state" => "MASTER",
        "priority" => "101"
}
      }
)

run_list(
  "recipe[frontend-keepalived]"
)

Here's the backup role:


$ cat frontend-keepalived-backup.rb
name "frontend-keepalived-backup"
description "install keepalived and set state to BACKUP"

override_attributes(
    "keepalived" => {
      "instance_defaults" => {
        "state" => "BACKUP",
        "priority" => "100"
}
    }
)

run_list(
  "recipe[frontend-keepalived]"
)

Notice that we override 2 attributes, the state and the priority. The defaults for these are in the Opscode keepalived cookbook, under attributes/default.rb


default['keepalived']['instance_defaults']['state'] = 'MASTER'
default['keepalived']['instance_defaults']['priority'] = 100

This was useful in determining how to specify the stanza overriding them in our roles -- it made us see that we needed to specify the instance_defaults key under keepalived in the role files.

At this point, we added the master role to the Chef run_list of server #1 and the backup role to the Chef run_list of server #2. We had to do one more thing on each server (which we'll add to the default recipe of our frontend-keepalived cookbook): per this very helpful blog post on setting up HAProxy and keepalived, we edited /etc/systctl.conf and added:

net.ipv4.ip_nonlocal_bind=1
then applied it via 'sysctl -p'. This was needed so that HAProxy can listen on the keepalived-created 'floating IP' common to the 2 servers, which is not a real IP tied to an existing local network interface.

Once we ran chef-client on each of the 2 servers, we were able to verify that keepalived does its job by pinging the common floating IP from a 3rd server, then shutting down the network interface 'frontend_if' on each server, with no interruption in the ICMP responses sent from the floating IP. Our next step is to do some heavy-duty testing involving HTTP requests handled by HAProxy, and see that there is no interruption in service when we fail over from one HAProxy server to the other.

UPDATE

My colleague Zmer Andranigian discovered an attribute in the Opscode keepalived cookbook that deals with the sysctl setup. The default value for this attribute is:

default['keepalived']['shared_address'] = false

If this attribute is set to 'true' (for example in one of the 2 roles we defined above), then the keepalived cookbook will create a file called /etc/sysctl.d/60-ip-nonlocal-bind.conf containing:

net.ipv4.ip_nonlocal_bind=1

and will also set it in the running configuration of sysctl.

For reference, the role frontend-keepalived-master would contain the following attributes:


override_attributes(
    "keepalived" => {
      "instance_defaults" => {
        "state" => "MASTER",
        "priority" => "101"
}
      "shared_address" => "true"
   }
)




Tuesday, April 02, 2013

Using wrapper cookbooks in Chef

Not sure if this is considered a Chef best practice or not -- I would like to get some feedback, hopefully via constructive comments on this blog post. But I've started to see this pattern when creating application-specific Chef cookbooks: take a community cookbook, include it in your own, and customize it for your specific application.

A case in point is the haproxy community cookbook. We have an application that needs to talk to a Riak cluster. After doing some research (read 'googling around') and asking people on Twitter (because Tweeps are always right), it looks like the preferred way of putting a load balancer in front of Riak is to run haproxy on each application server that needs to talk to Riak, and have haproxy listen on 127.0.0.1 on some port number, then load balance those requests to the Riak backend. Here is an example of such an haproxy.cfg file.

So what I did was to create a small cookbook called haproxy-riak, add a default.rb recipe file that just calls

include_recipe haproxy

and then customize the haproxy.cfg file via a template file in my new cookbook (I actually didn't do it via the template yet, only via a hardcoded cookbook file, but my colleague and Chef guru Jeff Roberts is working on templatizing it).

I also added

depends haproxy

to metadata.rb.

I think this is pretty much all that's needed in order to create this 'wrapper' cookbook. This cookbook can then be used by any application that needs to talk to Riak. As a matter of fact, we (i.e. Jeff) are thinking that we should have such a cookbook per application (so we would call it app1-haproxy-riak for example) just so we can do things like search for different Riak clusters that we may have for different types of applications, and populate haproxy.cfg with the search results.

In any case, I would be curious to find out if other people are using this 'pattern', or if they found other ways to apply the DRY principle in their Chef cookbooks. Please leave a comment!

Monday, March 18, 2013

Installing ruby 2.0 on Ubuntu 12.04

I wanted to find a way to install ruby 2.0 on Ubuntu 12.04 via Chef, without going the rvm route, which is harder to automate. After several tries, I finally found a list of .deb packages which are necessary and sufficient as pre-requisites. Writing it down here for my future reference, and maybe it will prove useful to others out there.

I downloaded most of these packages from packages.ubuntu.com (if you google their names you'll find them), then I installed them via 'dpkg -i'. Here they are:


libffi5_3.0.9-1_amd64.deb
libssl0.9.8_0.9.8o-7ubuntu3.1_amd64.deb
zlib1g-dev_1.2.3.4.dfsg-3ubuntu4_amd64.deb

The actual ruby .deb package was built with fpm by my colleague Jeff Roberts courtesy of this blog post.




Wednesday, March 06, 2013

No snowflakes allowed

"Snowflake" is a term I learned from my colleague Jeff Roberts. It is used in the Chef community (maybe in the configuration management community at large as well) to designate a server/node that is 'unique', i.e. not in configuration management control. In a Chef environment, it means that the node in question was never added to Chef and never had chef-client run on it.

We've all been in situations where it seems overkill to go through the effort of automating the setup of a server. Maybe the server has a unique purpose within our infrastructure. Maybe we didn't feel like spending the time to create Chef recipes for that server. Whatever the reasoning, it seemed low-risk at the time.

Well, I am here to tell you there is danger in this way of thinking. Example: we deployed a server in EC2 manually. We installed the Sensu client on it manually and pointed it at our Sensu server. Everything seemed fine. Then one day we updated our Sensu configuration (via Chef) both on the Sensu server and on all the Sensu clients. Of course, the Sensu configuration on our snowflake server never got updated, since chef-client wasn't running on that server. As a result, the Sensu client wasn't checking in properly with the Sensu server, and the snowflake behaved as if it was falling off the map as far as our monitoring system was concerned. We had to manually update Sensu on the snowflake to bring it in sync with our configuration changes.

Basically, the result of having snowflake servers is that they do fall off the map as far as the overall automation of your infrastructure is concerned. They suffer bitrot, and you end up spending lots of time on their care and feeding, thus defeating the purpose of saving the time to automate them in the first place.

This being said, it's hard to be disciplined enough to run chef-client periodically on every single server in your infrastructure. I've never been able to do that before, but we are doing it now, mostly because of the insistence of Jeff. I do see the advantages of this discipline, and I do recommend it to everybody.

Tuesday, March 05, 2013

Video and slides for 'Five years of EC2 distilled' talk

On February 19th I gave a talk at the Silicon Valley Cloud Computing Meetup about my experiences and lessons learned while using EC2 for the past 5 years or so. I posted the slides on Slideshare and there's also a video recording of my presentation. I think the talk went pretty well, judging by the many questions I got at the end. Hopefully it will be useful to some people out there who are wondering if EC2 or The Cloud in general is a good fit for their infrastructure (short answer: it very much depends).

I want to thank Sebastian Stadil from Scalr for inviting me to give the talk (he first contacted me about giving a talk like this in -- believe it or not -- early 2008!). I also want to thank Adobe for hosting the meeting, and CDNetworks for sponsoring the meeting.

Thursday, February 07, 2013

A workflow of managing Chef with knife

My colleague Jeff Roberts has been trying hard to teach me how to properly manage Chef cookbooks, roles, and nodes using the knife utility. Here is a workflow for doing this in a way that aims to be as close as possible to 'best practices'. This assumes that you already have a Chef server set up.

1) Install chef on your personal machine

In my case, my personal laptop is running OS X (gasp! I used to be a big-time Ubuntu-everywhere user, but I changed my mind after being handed a MacBook Air).

I won't go into the gory details on installing chef locally, but here are a few notes:

  •  I installed the XCode command line tools, then I installed Homebrew. Note that plain XCode wasn't sufficient, I had to download the dmg package for the command line tools and install that.
  •  I installed git via brew.
  •  I installed chef via
  •  I installed the EC2 plugin for knife via
    • cd /opt/chef/embedded/bin/; sudo ./gem install knife-ec2
2) Clone chef-repo locally

Best practices dictate that you keep your chef-repo directory structure in version control. If you are using git, like we do, then you need to clone that locally via a git clone command.

3) Deploy chef client and validation keys

The keys are kept in chef-repo/.chef in our case. You need 2 keys: your_username.pem and validation.pem. You need to coordinate with your Chef server administrator to get them. A good way of passing keys around is to encrypt them on encypher.it and send the link in an email, and communicate the decryption  by some out of band mechanism (such as voice).

4) Configure knife via knife.rb

You need a knife.rb file which sits in chef-repo/.chef as well. Here's a sample (replace username with your actual username, and set the proper EC2 access keys):
log_level :info
log_location STDOUT
node_name 'username'
client_key '/Users/username/chef-repo/.chef/username.pem'
validation_client_name 'chef-validator'
validation_key '/Users/username/chef-repo/.chef/validation.pem'
chef_server_url 'http://chef.example.com:4000'
cache_type 'BasicFile'
cache_options( :path => '/tmp/checksums' )
cookbook_path [ './cookbooks' ]
# EC2:
knife[:aws_access_key_id] = "xxxxxxxxxx"
knife[:aws_secret_access_key] = "XXXXXXXXXXXXXXXXXXX"
5) Test your knife setup

An easy way to see if knife can communicate properly with the Chef server at this point is to list the nodes in your infrastructure via

knife node list

If this doesn't work, you need to troubleshoot it until you make it work ;-)

BTW, in my case, I need to run knife while in the chef-repo directory, for it to properly read the files in the .chef subdirectory.

6) Create a new cookbook

For this example, I'll create a cookbook called myblog. The idea is to install nginx and Octopress.

The proper command to use is:

# knife cookbook create myblog

This will create a directory called myblog under chef-repo/cookbooks, and it will populate it with files and subdirectories pertaining to that cookbook (such as attributes, definitions, recipes, etc).

7) Download any other required cookbooks

For this example, I will download the nginx cookbook from the Opscode community cookbooks. I first search for the nginx cookbook, then I install it:

# knife cookbook site search nginx
# knife cookbook site install nginx


Once the nginx cookbook is installed locally, you still need to upload it to the Chef server:

# knife cookbook upload nginx

8) Create recipe for installing nginx and Octopress in new cookbook

Now that the pre-requisite cookbook is installed and uploaded to Chef, you can use it in your custom cookbook. You need to add references to the pre-requisite cookbook (nginx) in the following 2 files under cookbooks/myblog:


Add this to metadata.rb:

depends "nginx"


Add this to README.md:

Requirements
============
* nginx


The actual custom recipe for myblog lives in cookbooks/myblog/recipes/default.rb. In my case, here's what I do to install Octopress:


include_recipe 'nginx'
include_recipe 'ruby::1.9.1'

# set default ruby to point to 1.9.1 (which is actually 1.9.3!)
system("update-alternatives --install /usr/bin/ruby ruby /usr/bin/ruby1.9.1 400 --slave /usr/share/man/man1/ruby.1.gz ruby.1.gz /usr/share/man/man1/ruby1.9.1.1.gz --slave /usr/bin/ri ri /usr/bin/ri1.9.1 --slave /usr/bin/irb irb /usr/bin/irb1.9.1 --slave /usr/bin/rdoc rdoc /usr/bin/rdoc1.9.1")

# install bundler via gems
system("gem install bundler")

# get octopress source code and install it via bundle and rake
system("cd /opt/; git clone git://github.com/imathis/octopress.git octopress")
system("cd /opt/octopress; bundle install; rake install")


It's a pretty convoluted way of installing Octopress, and it requires installing version 1.9.1 of ruby via the Opscode ruby cookbook first. It took me a few tries to get it right, but it seems to do the job, although I know running system commands on the remote node is not the preferred way of configuring nodes with Chef.

9) Upload new cookbook to Chef server

In order for the new cookbook you created to be available, you need to upload it to the Chef server:

# knife cookbook upload myblog

10) Test new cookbook by deploying new node

At this point, you are ready to test your shiny new cookbook. I did this by launching a new EC2 instance associated with the recipe in the myblog cookbook.

What follows is a command line using the knife EC2 plugin which took me a few tries to get right. It works for me, so I hope it will work for you too if you ever decide to do something similar. I had to dig into the knife-ec2 source code to get to some of these options, since they aren't documented in the README.

# knife ec2 server create -r "role[base], recipe[myblog]" -I ami-0d153248 --flavor c1.medium --region us-west-1 -g sg-7babb117 -i ~/.ssh/mykey.pem -x ubuntu -N myblog -S mykey -s subnet-ca6d20a3 -T T=techblog --ebs-size 50

This tells knife to launch an Ubuntu 12.04 instance (the -I AMI_ID option) associated with a 'base' role and the 'myblog' recipe (the -r option), size c1.medium (the --flavor option), in the us-west-1 region (the --region option), in a given security group (the -g option) and a given VPC subnet (the -s option), using the mykey.pem to ssh into the instance (the -i option -- where mykey.pem is the private key corresponding to the keypair you specify with the -S option) as user ubuntu (the -x option), using the mykey keypair name (the -S option -- this is a keypair that you must already have created), with a Chef node name of myblog (the -N option), an EC2 tag of techblog (the -T option), and finally an EBS root volume size of 50 GB (the --ebs-size option). Whew.

If everything goes well, you'll see something similar to this:

Instance ID: i-3c98b265
Flavor: c1.medium
Image: ami-0d153248
Region: us-west-1
Availability Zone: us-west-1b
Security Group Ids: sg-7babb117
Tags: TtechblogNametechblog
SSH Key: mykey

Waiting for server................
Subnet ID: subnet-ca7e10a3
Private IP Address: 10.10.14.4

followed by the output of an ssh session in which chef-client will run on the newly created instance. You'll be able to see if the chef-client run was successful or not. In either case, you should able to ssh into the new instance with the mykey.pem private key.

11) Commit any new or modified cookbooks

Now that you tested you cookbooks (both the pre-requisite ones such as nginx, and new ones such as myblog), you need to commit them to the chef-repo git repository so other members of your team can take advantage of them. You do this with git add, git commit and git push.

12) Other useful knife commands

You should be able to get information from the Chef server about the new node you just launched by running:

# knife node show techblog
Node Name:   techblog
Environment: _default
FQDN:        
IP:          10.10.14.4
Run List:    role[base],  recipe[myblog]
Roles:       base, sysadmin_sudoers
Recipes:     apt, ntp, timezone, chef-client::service, chef-client::delete_validation, base-apps, users::sysadmins, sudo, nagios-plugins, ruby, rubygems, sensu::client, myblog
Platform:    ubuntu 12.04
Tags:        

You can also edit the run list of a given node by running:

# knife node edit techblog

(you need to set your EDITOR variable to your favorite editor first).

To inspect a given role, use:

# knife role show monitoring
chef_type:            role
default_attributes:   
description:          Installs the sensu monitoring client and related software
env_run_lists:        
json_class:           Chef::Role
name:                 monitoring
override_attributes:  
run_list:            
    recipe[nagios-plugins]
    recipe[ruby]
    recipe[rubygems]
    recipe[sensu::client]

There are many other knife commands you can use -- in fact, using knife to its full potential is an art in itself. Here is a sample of knife commands, courtesy of our Chef guru Jeff Roberts:

This command searches sensu.client.subscriptions and finds node that are running the mysql check.

knife search node "sensu_client_subscriptions:mysql"  

Show the sensu subscriptions for the jira.corp node.

knife node show jira.corp -a sensu.client.subscriptions

Show the EC2 attrs for the test_box node.

knife node show test_box -a "ec2"

Search all nodes and find ones in the "us-west-*" availability zone.

knife search node "ec2_placement_availability_zone:us-west-*" -a "ec2"

Search for all nodes in the role, "webserver" and show the "apache.sites" attribute.

knife search node "role:webserver" -a apache.sites

List all of the versions of the cookbook "nginx".

knife cookbook show nginx

Find all of the nodes in the "prod" environment.

knife search node "chef_environment:prod"

Find the last next available UID.

knife search users "*:*" -a uid | grep uid | sort

Monday, February 04, 2013

Some gotchas when installing Octopress on Ubuntu

Here are some quick notes I took while trying to install the Octopress blog engine on a box running Ubuntu 12.04. I tried following the official instructions and I chose the RVM method. The first gotcha is that you have to have the development tools (compilers, linkers etc) installed already. So you need to run:

# apt-get install build-essential


Then I ran the recommended commands in order to install rvm and rubygems:

# curl -L https://get.rvm.io | bash -s stable --ruby
# source /usr/local/rvm/scripts/rvm
# rvm install 1.9.3
# rvm use 1.9.3
# rvm rubygems latest



I then installed git and grabbed the octopress source code:
 
# apt-get install git
# git clone git://github.com/imathis/octopress.git octopress
# cd octopress/


When trying to install bundler, I got this error:

# gem install bundler

ERROR:  Loading command: install (LoadError)
   cannot load such file -- zlib
ERROR:  While executing gem ... (NameError)
   uninitialized constant Gem::Commands::InstallCommand


Googling around, I found this answer on Stack Overflow which talked about the same error. The solution was to install the zlib1g-dev package, then reinstall rvm so it's aware of zlib, then install bundler.

# apt-get install  zlib1g-dev
# rvm reinstall 1.9.3
# gem install bundler


At this point I was able to continue the installation by running these commands (in the octopress source top directory):

# bundle install
# rake install


That's about it about installing Octopress. I still have to figure out how to front Octopress with nginx, and how to actually start using it.

Friday, February 01, 2013

IT stories from the trenches #2

The year was 1996. I was a graduate student in CS at USC. My first contact with Unix (I had started my career as a programmer in C++ under DOS and later Windows 3.1; yes, I am dating myself big time here!). My first task as a research assistant was to recompile the X server so it can use shared memory extensions. This was part of an experiment that was studying multimedia applications over a high speed proprietary network.

I started optimistically. I got acquainted with the compilers and linkers on the HP-UX platform we were using, I learned all kinds of neat Unix command line utilities, but there was only one glitch -- the X binary wouldn't compile. I tried everything in my power. I even searched on Altavista (no Google in 1996). Nothing worked. Finally, I posted a plea for help on comp.windows.x. A gentle soul by the name of Kaleb Keithley replied first on that thread, then on separate email threads (I was using pine as my email client at the time) and nudged me towards the solution to my issue. It turns out that the X server's frame buffer was not supported on that particular HP workstation where I was trying to compile it. I tried immediately on another type of HP workstation and everything worked like a charm. This was almost 3 months after I started on this project.

Lesson learned? Don't give up. It's a very unpleasant feeling to bang your head against a wall, but in my experience, I noticed over and over again that it's absolutely the best way to learn a new technology or tool. In my case, I knew Unix commands and the Unix development toolchain pretty well at the end of those 3 months. Just persevere in the face of that sinking feeling in the pit of your stomach that another day has passed and you haven't made much progress. There will be an EUREKA moment, I guarantee it.

Another lesson learned: ask for help early and often. These days it's probably on IRC that you would help most quickly, but between IRC, Stack Exchange, Google Groups and Twitter, chances are somebody had already seen the problem you're facing.

As a side note, when you do find a solution to a long-standing problem, do everybody a favor and blog about it. Most often than not, I find solutions to my technical problems by reading blog posts. Even if they bring me only 80% of the way to a solution, it's a huge help.

For the curious, here's a PDF of the paper that resulted from my X server experiments.

Tuesday, January 29, 2013

IT stories from the trenches #1


I thought it might be interesting to tell some stories/vignettes that capture various lessons I learned throughout my career in IT. I call them 'stories from the trenches' because in general the lessons were acquired the hard way (but maybe that's the best way to acquire lessons...).

Here's the first one.

It was my second day on the job as a Unix system architect.

We weren't using LDAP or NIS to centralize user management so we were copying user entries in /etc/passwd and /etc/shadow from one server and pasting them on other servers that we needed new users created on.

On one of these (production) servers I typed 'ci /etc/passwd' instead of 'vi /etc/passwd'. This had the unfortunate effect of invoking the RCS check-in command line utility ci, which then moved '/etc/passwd' to a file named '/etc/passwd,v'. Instead of trying to get back the passwd file, I panicked and exited the ssh shell. Of course, at this point there was no passwd file, so nobody could log in anymore. Ouch. I had to go to my boss, admit my screw-up, and together we took the server down (it was a Solaris server) then booted in single user mode off of an installation CD, mounted /etc and moved passwd,v back to passwd. I mentioned this was a production server, right? DATABASE production server. MAIN database production server. Miraculously, I kept my job.

Anyway, lesson learned? Well, several lessons in fact:

1) use system utilities for creating users and groups, and ssh keys instead of passwords; of course, these days all these menial tasks should be automated via configuration management tools
2) DON'T PANIC. As long as you are logged in as root on a remote system, there's ample opportunity for fixing things that you may have broken.
3) know how to fix things by taking a machine offline in single user mode; it will come in handy one day

Thursday, December 27, 2012

2012 Year in Review: devops blog posts and articles

This is a whirlwind tour of blog posts and articles that I saved in Instapaper in 2012.

January
February
March
July
August

Tuesday, December 18, 2012

15 things I learned from "Shipping Greatness"

I finished reading "Shipping Greatness" by Chris Vander Mey (ex-Googler, ex-Amazonian). It's a great book, and I highly recommend it to anyone who is interested in learning how to ship software as close to schedule as possible, with as few bugs as possible, which is really the best we can hope for in this industry. Here are some things I jotted down while reading the book. You should really read the whole book, if only for the 'in-the-trenches' vignettes from Google and Amazon.

1. When tackling a new product, always start with the needs/problems of your customers (of course Jeff Bezos is famous for this approach at Amazon).

2. Have a mission statement that inspires and that also fits on a t-shirt if possible (for example the mission statement of the personalization team at Amazon is "increase customer delight").

3. Define your product by writing a press release (Amazon does it; it forces you to be succinct and to capture the essence of your product).

4. Avoid building a product that nobody wants. Engage real customers in testing early versions of the product or in giving feedback from on mockups and wireframes.

5. When putting together project schedules, remember that 60% is the best you can hope for in terms of productivity, so adjust the schedules accordingly.

6. Never release on a Friday! You don't want to scramble to find engineers available to fix critical issues over the weekend.

7. Apply the High School Embarrassment Test to the product you want to ship: will you be embarrassed if one of your old high school friends sees your product?

8. Run periodic bug bashes that are scheduled in the project plan and that are short (1 hour or so). Incentivize participation by giving away t-shirts, be grateful for every bug found -- better you find it than your customers!

9. Report on meaningful metrics: for example Google measures "seven-day active user count" (unique users who used the product in the last 7 days) so they can compare week to week.

10. Have a launch checklist - makes sure all moving pieces are accounted for, facilitates communication across different functions of the team (commercial pilots go through a checklist before every flight).

11. When you write email messages, make sure they are "crisp", i.e. clear, succinct, and covering a single topic if possible.

12. Before the daily standup meeting, have the dev lead give a 30-second brief on the status of the project (it sets the context of the meeting and reminds the team of critical things).

13. If one person in a team meeting is being strongly negative and dismissive of other people or of the product, take it offline for a one-on-one meeting rather than arguing in front of the team. 

14. If you need to give a presentation, make it short (10-15 min), deliver one and only one message, and tell a story if you can. Also use blue background with yellow and white font because it looks great on any projector.

15. In terms of getting things done, start by doing things that you, and only you, can do, so you're not blocking others (e.g. approve an expense report).

I really struggled to keep the list somewhat short. The book is full of similar nuggets. Go read it!

Thursday, November 29, 2012

Code performance vs system performance

Just a quick thought: as non-volatile storage becomes faster and more affordable, I/O will cease to be the bottleneck it currently is, especially for database servers. Granted, there are applications/web sites out there which will always have to shard their database layer because they deal with a volume of writes well above what a single DB server can handle (and I'm talking about mammoth social media sites such as Facebook, Twitter, Tumblr etc).  By database in this context I mean relational databases. NoSQL-like databases worth their salt are distributed from the get go, so I am not referring to them in this discussion.

For people who are hoping not to have to shard their RDBMS, things like memcached for reads and super fast storage such as FusionIO for writes give them a chance to scale their single database server up for a much longer period of time (and by a single database server I mostly mean the server where the writes go, since reads can be scaled more easily by sending them to slaves of the master server in the MySQL world for example).

In this new world, the bottleneck at the database server layer becomes not the I/O subsystem, but the CPU. Hence the need to squeeze every ounce of performance out of your code and out of your SQL queries. Good DBAs will become more important, and good developers writing efficient code will be at a premium. Performance testing will gain a greater place in the overall testing strategy as developers and DBAs will need to test their code and their SQL queries against in-memory databases to make sure there are no inefficiencies in the code.

I am using the future tense here, but the future is upon us already, and it's exciting!

Friday, November 09, 2012

Quick troubleshooting of Sensu 'no keepalive from client' issue

As I mentioned in a previous post, we started using Sensu as our internal monitoring tool. We also integrated it with Pager Duty. Today we terminated an EC2 instance that had been registered as a client with Sensu. I started to get paged soon after with messages of the type:

 keepalive : No keep-alive sent from client in over 180 seconds

Even after removing the client from the Sensu dashboard, the messages kept coming. My next step was of course to get on the #sensu IRC channel. I immediately got help from robotwitharose and portertech.  They had me try the following:

1) Try to remove the client via the Sensu API.

I used curl and ran:

curl -X DELETE http://sensu.server.ip.address:4567/client/myclient

2) Try to retrieve the client via the Sensu API and make sure I get a 404

curl -v http://sensu.server.ip.address:4567/client/myclient

This indeed returned a 404.

3) Check that there is a single redis process running

BINGO -- when I ran 'ps -def | grep redis', the command returned TWO redis-server processes! I am not sure how they got to be both running, but this solved the mystery: sensu-server was talking to one redis-server process, and sensu-api was talking to another. When the client was removed via the sensu-api, the Sensu server was still seeing events sent by the client, such as this one from /var/log/sensu/sensu-server.log:


{"timestamp":"2012-11-10T01:41:14.154418+0000","message":"handling event","event":{"client":{"subscriptions":["all"],"name":"myclient","address":"10.2.3.4","timestamp":1352502348},"check":{"name":"keepalive","issued":1352511674,"output":"No keep-alive sent from client in over 180 seconds","status":2,"history":["2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2"],"flapping":false},"occurrences":305,"action":"create"},"handler":{"type":"pipe","command":"/etc/sensu/handlers/ops_pagerduty.rb","api_key":"myapikey","name":"ops_pagerduty"},"level":"info"}

To actually solve this, I killed the 2 redis-server processes (since 'service redis-server stop' didn't seem to do it), then stopped sensu-server and sensu-api, then started redis-server, and finally started sensu-server and sensu-api again.

At this point, the Sensu dashboard showed the 'myclient' client again. I removed it one more time from the dashboard (I could have done it via the API too) and it finally went away for good.

This was quite some obscure issue. I wouldn't have been able to solve it were it not for the awesomeness of the #sensu IRC channel (and kudos to the aforementioned robotwitharose and portertech!)

I hope google searches for 'sensu no keepalive from client' will result in this blog post helping somebody out there! :-)

Sensu rocks BTW.

Saturday, October 27, 2012

Monitoring doesn't have to suck

This is a follow up to my previous post, where I detailed some of the things I want in a modern monitoring tool. A month has passed, and I thought I'd give a quick overview of some tools we started to use as part of our monitoring and graphing strategy.

Several people recommended Sensu, so my colleague Jeff Roberts has given it a shot, and he liked what he saw (blog post with more technical details hopefully soon from Jeff!). We're still tinkering with it, but we're already using it to monitor our Linux-based machines (both Ubuntu and CentOS). Jeff is working on our Chef infrastructure, and soon we'll be able to deploy the Sensu client via Chef. What's nice about Sensu, and different from other tools, is the queueing mechanism it uses for client-server communication, and for posting events such as 'send this metric to Graphite' or 'send this alert to Pager Duty' or 'send this notification to this email address'. It does have a few rough edges, but @portertech seems to be on IRC 24x7, so that helps a lot :-)

Sensu, similar to many other monitoring tools, doesn't have good support for Windows. We looked around and we settled on Server Density for monitoring our Windows machines (mostly used for our backend infrastructure). We also monitor the Sensu server itself with Server Density, and we're looking into adding RabbitMQ-specific checks for Sensu.

We integrated both Sensu and Server Density with Pager Duty, and we consider every alert sent to Pager Duty as critical (which means the engineer on duty has to acknowledge, solve or escalate it). Pager Duty is an amazingly useful service, and most useful of all are probably the escalation processes and policies it provides. No more excuses for people who are on pager that they weren't notified when shit hit the fan!

For non-critical alerts we send email notifications outside of Pager Duty. With Sensu we do this via its mailer handler, and for Server Density by using a profile that goes to an email alias.

Two more monitoring tools we signed up for are Boundary, which we haven't been using in anger yet, but will do so in the next couple of weeks, and New Relic, whose sweet spot seems to be application-level monitoring. We'll deploy some Java app servers soon and we'll try the JVM New Relic plugin at the same time. Boundary will be very useful for looking at network traffic, establishing patterns and baselines, and getting notified when something gets out of whack. It will tell us what we don't know that we don't know.

Update: I forgot to mention Pingdom, which is a very inexpensive but extremely useful external monitoring service. We use it to monitor important user-facing resources (mostly web pages, but Pingdom can also do generic TCP checks, and mail-related checks) so we can get alerted when something is wrong from the perspective of our users, looking outside in so to speak.

For graphing, we deployed Graphite. We haven't started to use it very heavily, but again this will change soon, as we'll send there various business metrics obtained by querying the critical databases within our infrastructure. Jeff already integrated it with Sensu, so we'll be able to use the same queuing mechanism for Graphite as we are using for the monitoring alerts.

So there you have it: Sensu, Pingdom, Server Density, Pager Duty, Boundary, New Relic and Graphite. Modern tools that give a good name to monitoring. No, monitoring no longer sucks.

Sunday, September 30, 2012

What I want in a monitoring tool

I started a new job a few weeks ago, and I'm now at a point where I'm investigating monitoring options. At past jobs I used Nagios, which I know will work, but I would like to look into other more modern tools. I am aware that #monitoringsucks, and I am pretty sure people have hashed these topics before, but here are some of the things I want from a modern monitoring tool:
  • Ideally open source, of if not affordable per host per month pricing (we already signed up as a paying customer of Boundary for example)
  • Installation and configuration should be easily scriptable
    • server installation, as well as addition/modification of clients should be easily automated so it can be done with Puppet/Chef
    • API would be ideal
  • Robust notifications/alerting rules
    • escalations
    • service dependencies
    • event handler scripts
    • alerts based on subsets of hosts/services
      • for example alert me only when 2+ servers of the same type are down
  • Out-of-the-box plugins
    • database-specific checks for example
  • Scalability
    • the monitoring server shouldn't become a bottleneck as more clients are added
      • nagios is OK with 100-200 clients (with passive checks)
    • hierarchy of servers should be supported
    • agent-based clients
  • Reporting/dashboards
    • Hosts/services status dashboards
    • Downtime/outages dashboards
    • Latency (for HTTP checks)
  • Resource graphing would be great
    • but in my experience very few tools do both alerting and resource/metrics graphing well
    • in the past I used Nagios for alerting and Ganglia/Graphite for graphing
  • Integration with other tools
    • Send events to graphing tools (Graphite), alerting tools (PagerDuty), notification mechanisms (irc, Campfire), logging tools (Graylog2)
I also asked a question on Twitter about what monitoring tool people would recommend, and here are some of the tools mentioned in the replies:
  • Sensu
  • OpenNMS
  • Icinga
  • Zenoss
  • Riemann
  • Ganglia
  • Datadog
Several people told me to look into Sensu, and a quick browsing of the home page tells me it would be worth giving it a whirl. So I think I'll do that next. Stay tuned, and also please leave a comment if you know of other tools that might fit the profile I am looking for.

Update: more tools mentioned in comments or on Twitter after I posted a link to this blog post:
  • New Relic (which I am actually in the process of evaluating, having paid for 1 host)
  • Circonus
  • Zabbix
  • Server Density
  • Librato
  • Comostas
  • OpsView
  • Shinken
  • PRTG
  • NetXMS
  • Tracelytics

Sunday, September 23, 2012

3 things to know when starting out with cloud computing

In the same vein as my previous post, I want to mention some of the basic but important things that someone starting out with cloud computing needs to know. Many times people see 'the cloud' as something magical, as the silver bullet that will solve all their scalability and performance problems. These people are in for a rude awakening if they don't pay attention to the following points.

Expect failure at any time

There are no guarantees in the cloud. Failures can and will happen, suddenly and mercilessly. Their frequency will increase as you increase the number of instances that you run in the cloud. It's a sickening feeling to realize that one of your database instances is gone, and there's not much you can do to bring it back. At that point, you need to rely on your disaster recovery plan (and you have a DR plan, don't you?) and either launch a new instance from scratch, or, in the case of a MySQL master server for example, promote a slave to a master. The silver lining in all this is that you need a disaster recovery plan anyway, even if you host your own servers in your own data center space. The cloud just makes failures happen more often, so it will test your DR plan more thoroughly. In a few months, you will be an expert at recovering from such failures, which is not a bad thing.

In fact, knowing that anything can fail in the cloud at any time forces you to architect your infrastructure such that you can recover from failures quickly. This is a worthy goal to have no matter where your infrastructure is hosted.

There's more to expecting failures at any time. You need a solid monitoring system to alert you that failures are happening. You need a solid configuration management system to spin up new instances of a given role quickly and with as little human intervention as possible. All these are great features to have in any infrastructure.

Automation is key

If you only have a handful of servers to manage, 'for' loops with ssh commands in a shell script can still seem like a good and progressive way of running tasks across the servers. This method breaks at scale. You need to use more industrial-grade configuration management tools such as Puppet, Chef or CFEngine. The learning curve can be pretty steep for most of these tools, but it's worth investing your time in them.

Deploying automatically is not enough though. You also need ways to ensure that things have been deployed as expected, and that you haven't inadvertently introduced new bugs. Automated testing of your infrastructure is key. You can achieve this either by writing scripts that check to see if certain conditions have been met, or, even better, by adding those checks to your monitoring system. This way, your monitoring checks are your deployment unit tests.

If the cloud is a kingdom, its APIs are the crown jewels. You need to master those APIs in order to automate the launching and termination of cloud instances, as well as the creation and management of other resources (storage, load balancers, firewall rules). Libraries such as jclouds and libcloud help, but in many situations you need to fall back to the raw cloud provider API. Having a good handle on a scripting language (any decent scripting language will do) is very important.

Dashboards are essential

One of the reason, and maybe the main reason, that people use the cloud is that they want to scale their infrastructures horizontally. They want to be able to add new instances at every layer (web app, database, caching) in order to handle anything that their users throw at their web site. Simply monitoring those instances for failures is not enough. You also need to graph the resources that are consumed, and the metrics that are generated at all layers: hardware, operating system, application, business logic. Tools such as Graphite, Ganglia, Cacti, etc can be of great assistance. Even homegrown dashboards based on tools such as the Google Visualization API can be extremely useful (I wrote about this topic here).

One aspect of having these graphs and dashboards is that they are essential for capacity planning, by helping you in establishing correlations between traffic that hits your web site and resources that are consumed and that can become a bottleneck (such as CPU, memory, disk I/O, network bandwidth etc). If you run your database servers in the cloud for example, you will very quickly realize that disk I/O is your main bottleneck. You will be able to correlate high CPU wait numbers with slowness in your database, which reflects in slowness in your overall application. You will then know that if you reach a certain CPU wait threshold, it's time to either scale your database server farm horizontally (easier said than done especially if you do manual sharding), or vertically (which may not even be possible in the cloud, because you have pretty restrictive choices of server hardware).

As an aside: do not run your relational database servers in the cloud. Disk I/O is so bad, and relational databases are so hard to scale horizontally, that you will soon regret it. Distributed NoSQL databases such as Riak, Voldemort or HBase are a different matter, since they are designed from the ground up to scale horizontally. But for MySQL or PostgreSQL, go for bare metal servers hosted at a good data center.

One last thing about scaling horizontally is the myth of autoscaling. I've very rarely seen companies that are able to scale their infrastructure up and down based on traffic patterns. Many say they do it, but when you get down to it it's mostly a manual process. It's somewhat easier to scale it up, but scaling it down is a totally different matter. You need to worry at that point that resources that you are taking away from your infrastructure are properly accounted for. Let's say you terminate a web server instance behind a load balancer -- is the load balancer aware of the fact that it shouldn't send traffic to that instance any more?

Another aspect of having graphs and dashboards is to see how business metrics (such as sales per day, or people that abandoned their shopping carts, or payment transactions, etc.) are affected by bottlenecks in your infrastructure. This assumes that you do graph business metrics, which you should. Without dashboards, you are flying blind.


I could go on with more caveats about the cloud, but I think that these 3 topics are a good starting point. The interesting thing about all 3 topics though is that they apply to any infrastructure, not only to cloud-based ones. But the scale of the cloud puts these things into their proper perspective and forces you to think them through.



Sunday, August 26, 2012

10 things to know when starting out as a sysadmin

This post was inspired by Henrik Warne's post "Top 5 Surprises When Starting Out as a Software Developer". I thought it was a good idea to put together a similar list for sysadmins. I won't call them 'surprises', just 'things to know'. I found them useful when I started out, and I still find them useful today. I won't prioritize them either, because they're all important in their own way.

Backups are good only if you can restore them

You would be right to roll your eyes and tell yourself this is so obvious, but in my experience most people run backups regularly, but omit to try to restore from those backups periodically. Especially if you have a backup scheme with one full backup every N days followed by either incremental or differential backups every day, it's important to test that you can obtain a recent backup (yesterday's at a minimum) by applying those incrementals or differentials to the full backup. And remember, if it's not backed up, it's not in production.

If it's not monitored, it's not in production

This is one of those things that you learn pretty quickly, especially when your boss calls you up in the middle of the night telling you the site is down. I wrote before on how in my opinion monitoring is for ops what testing is for dev, and I also wrote how monitoring is the foundation for whipping your infrastructure into shape.

If a protocol has an acronym, you need to learn it

SNMP, LDAP, NFS, NIS, SMTP are just some examples of such protocols. As a sysadmin, you need to be deeply familiar with them if you want to have any chance of troubleshooting complex issues. And I want to single out two protocols that are the most important in my book: DNS and HTTP. Get the RFCs out and study them. And speaking of troubleshooting complex issues...

The most important skill you need to master is problem solving

The issues you'll face in your career as a sysadmin will get more and more complex, in direct relation with the complexity of the infrastructures you'll build and maintain. You need to be able to analyze a problem and come up with several variables that could cause the issue, then eliminate the variables one by one until you discover the root cause. This one-by-one variable elimination strategy is really important, and I've been struck throughout my career by how many people have never mastered it, and instead flail around hopelessly when faced with a non-trivial issue.

You need at least 2 of everything in production

As soon as you are in charge of a non-trivial Web site, you realize that you need to eliminate single points of failure as much as possible. It starts with border routers, it continues with firewalls and load balancers, then web/app/database servers, and network switches that tie everything together. All of a sudden, you have a pretty complex infrastructure to build and maintain.

One of the most important things you can do in this context is to test the failover of the various devices (firewalls, load balancers, routers, switches), which are usually in an active/passive configuration. I've been bit many times by forced failovers (when the active device unexpectedly failed) which didn't go well because the passive device wasn't configured properly or wasn't syncing properly from the active one.

I also want to mention in this context the necessity of deeply understanding how networks work both at Layer 2 (MAC) and Layer 3 (IP routing). You can only fake so much a lack of understanding of these issues. The most subtle and hard to solve issues I've faced in my career as a sysadmin have all been networking issues (which for some reason involved ARP tables many times). You need to become best friends with tcpdump.


Keep your systems secure

The days when telnet was enabled by default in most OSes are long gone, but you still need to worry about security issues. Fortunately there are simple things you can do that go a long way towards improving the security of your infrastructure -- things like putting firewalls in front of everything and only allowing the ports necessary for your production traffic, disabling services you don't need on your servers, monitoring your logs for unauthorized access attempts, and not running Windows (just kidding, kinda).

One issue I faced in this context was applying security patches to various OSes. You need to be careful when doing this, and make sure you test out those patches in staging before applying them in production (otherwise you run the risk of rebooting a production server and have it not come back because of the effects of that patch -- trust me, it happens).

Logging is your best friend

Logging goes hand in hand with monitoring as one of those sine-qua-non conditions for having a good grasp of what's going on with your infrastructure. You'll learn soon that you need to have a strategy for logging, in most cases a central log server where you send logs from your other systems. There are tools such as Flume and Scribe that help these days, but even good old syslog-ng works just fine for this purpose. Logging by itself is not enough -- you need to also monitor the logs and send alerts when you identify error conditions. It's not easy, but it needs to be done.

You need to know a scripting language

You can only go so far in your sysadmin career if you don't master a decent scripting language. I started with Perl (after programming in C/C++ for a living for several years) but discovered Python in 2004 and never looked back. Ruby will do the trick too. You don't need to be a ninja programmer, but you need to have decent skills -- know how to split a program into modules, know how to use OOP techniques, know enough of the language to be able to read and extend other people's code, and maybe most important of all, KNOW HOW TO TEST YOUR CODE! Always test your code in staging before you put it in production.

Document everything

This is very important when you start out because you learn something new every day. Write it down (I used to do it with old fashioned pen and paper) but also share it with your team. Wikis are decent for this purpose, although they become hard to organize as they grow. But having some sort of searchable knowledge base is definitely 'a good thing', especially as you team grows and new people need to be brought up to speed. Of course, these days you can also use 'executable documentation' in the form of Chef recipes or Puppet manifests.

And speaking of teams...

Always try to be a leader

You start out on the bottom rung of the ladder, but you can still be a leader. I once saw a definition of leadership that really resonated with me: "a leader is somebody who makes something happen which otherwise wouldn't happen". There are countless opportunities to do just that even if you are just starting out in your career. If something is hard (or 'not that fun') and people on your team either postpone it or seem to just forget to do it, that's a good sign you need to step up and be a leader and do it. You will help your team and you will help yourself in the process.

One thing you can make happen (for example by blogging) is to share lessons that you've learned the hard way. Many of the solutions I've found to thorny issues I've faced have come from blogs, so I am always happy to contribute back to the community by sharing some of my own experiences via blogging. I strongly advise you to do the same.

Monday, August 13, 2012

The dangers of uniformity

This blog post was inspired by the Velocity 2012 keynote given by Dr. Richard Cook and titled "How Complex Systems Fail". Approximately 6 minutes into the presentation, Dr. Cook relates a story which resonated with me. He talks about upgrading hospital equipment, specifically infusion pumps, which perform and regulate the infusion of fluids in patients. Pretty important and critical task. The hospital bought brand new infusion pumps from a single vendor. The pumps worked without a glitch for exactly 1 year. Then, at 20 minutes past midnight, the technician on call was alerted to the fact that one of the pumps stopped working. He fiddled with it, rebooted the equipment and brought it back to life (not sure about the patient attached to the pump though). Then, minutes later, other calls started to pour in. It turns out that approximately 20% of the pumps stopped working around the same time that night. Nightmare night for the technician on call, and we can only hope he retained his sanity.

The cause of the problems was this: the pumps have a series of pretty complicated settings, one of which being the period of time that needs to elapse until a mandatory software upgrade. That period of time was initially set to, you guessed it, 1 year, because it seemed like such a distant point in time. Well, after 1 year, the pumps begged to be upgraded (it's not clear whether that was a manual process to be initiated by the technician, or an automated process) -- but the gotcha was that normal functionality was suspended during the upgrade process, so the pumps effectively stopped working.

This story resonates with me on 2 fronts related to uniformity: the first is uniformity in time (most of the pumps were put in production around the same time), and the second is uniformity or monoculture in setup (and by this I mean single vendor/hardware/OS/software). These 2 aspects can introduce very subtle and hard to avoid bugs, which usually hit you when you least expect it.

I have a few stories of my own to tell regarding these issues.

First story: at Evite we purchased Dell C2100 servers to act as our production database servers. We got them in the summer of 2011 and we set them up in time for our high season of the year, which is late November/early-mid December. We installed Ubuntu 10.04 on all of them, and they performed magnificently, with remarkable uptime -- in fact, once we set them up, we never needed to reboot them. That is, until they started to crash one by one with kernel panic messages approximately 200 days after putting them in production. This seemed to be too much of a coincidence. After consulting with Dell support specialists, we were pointed to this Linux kernel bug which says that the scheduler code for kernel 2.6.32.11 crashes after 200+ days of uptime. We had 2 crashes in 24 hours, after which we preemptively rebooted the other servers during the night and this seemed to solve that particular issue. We also started to update Ubuntu to 12.04 on all servers so we can get an updated kernel version not affected by that bug.

As you can see, we had both uniformity/monoculture in setup (same vendor, same hardware, same OS), and uniformity in time (all servers had been put in production within a few days of each other). This combination hit us hard and unexpectedly.

Second story: one of the Dell C2100 servers mentioned above was displaying an unusual behavior. Every Saturday morning at 9 AM, we would get a monitoring alert about increased CPU I/O wait on that server. This would last for about 1 hour, after which things would return to normal. At first we thought it's a one-off, but after 2 or 3 consecutive Saturdays we looked more deeply on the system and we figured out (with the help of Percona, since those servers were running the Percona MySQL distribution) that the behavior was due to the RAID controller battery discharging and recharging itself as part of a relearning process. This had the effect or disabling the RAID write cache, so the I/O on the system suffered. I wrote more extensively about these battery issues in another post. The lesson here: if you see a cyclic behavior (which belongs to the issue of uniformity in time), investigate your hardware settings, especially the RAID controller!

Third story: this is the well known story of the 'leapocalypse', the addition of a leap second at midnight on July 1st 2012. While we weren't actually affected by the leap second bug, we still monitored our servers nervously as soon as word broke out on Twitter that servers running mostly Java apps, but also some MySQL servers, would become almost unresponsive, with 100% CPU utilization. The fix found by Mozilla was to run the date command and set it to the current date. The fact that they spread this fix to all their affected servers using Puppet also helped them.

Related to this story, it was interesting to read this article on how Google's servers weren't affected. They had been bit by leap second issues in the past, so on days that preceded the introduction of a leap second they introduced 'leap smear' to their NTP servers, adding a few milliseconds to every NTP update so that the overall effect was to get in sync more slowly (of course, they had to run their own modified NTP code for this to work).

The leapocalypse story contains elements of both unformity of time (everybody was affected at the same time due to the nature of the leap second update) and to a lesser degree uniformity of setup (Linux servers were affected because of a bug in the Linux kernel; Solaris and its variants weren't affected).

So...what steps can we take to try to alleviate these issues? I think we can do a few things:

1) Don't restart all servers at the same time prior to putting them in production; instead, introduce a 'jitter' of a couple of days in between restarts. This will give you time to react to 'uniformity in time' bugs, so that not all your servers will exhibit the bug at the same time.

Same strategy applies to clearing caches -- memcached or other cache servers.

2) Don't buy all your servers from the same vendor. This is harder to do, since vendors like to sell you things in bulk, and provide incentives for you to do so. But this would avoid issues with uniformity/monoculture in setup. Of course, even if you do buy servers from a single vendors, you can still install different OSes on them, different versions of the same OS, etc. This is only feasible I think if you use a configuration management tool such as Puppet or Chef, which in theory at least abstracts away the OS from you.

3) Make sure your monitoring is up to par. Monitor every single component of your servers which can be monitored. Pay attention to RAID controller cards! And of course graph the resources you monitor as well, to see spikes or dips that maybe are not caught by your alerting thresholds.

Monday, July 09, 2012

Installing Python scientific and statistics packages on Ubuntu

I tried to install the pandas Python library a while ago using easy_install/pip and I hit some roadblocks when it came to installing all the dependencies. So I tried it again, but this time I tried to install most of the required packages from source. Here are my notes, hopefully they'll be useful to somebody out there.

This is on an Ubuntu 12.04 machine.

Install NumPy

# wget http://downloads.sourceforge.net/project/numpy/NumPy/1.6.2/numpy-1.6.2.tar.gz
# tar xvfz numpy-1.6.2.tar.gz; cd numpy-1.6.2
# cat INSTALL.txt
# apt-get install libatlas-base-dev libatlas3gf-base
# apt-get install python-dev
# python setup.py install



Install SciPy


# wget http://downloads.sourceforge.net/project/scipy/scipy/0.11.0b1/scipy-0.11.0b1.tar.gz
# tar xvfz scipy-0.11.0b1.tar.gz; cd scipy-0.11.0b1/
# cat INSTALL.txt
# apt-get install gfortran g++
# python setup.py install


Install pandas


Prereq #1: NumPy 

- already installed (see above)

Prereq #2: python-dateutil

# wget http://labix.org/download/python-dateutil/python-dateutil-1.5.tar.gz
# tar xvfz python-dateutil-1.5.tar.gz; cd python-dateutil-1.5/
# python setup.py install



Prereq #3: pyTables (optional, needed for HDF5 support)

pyTables was the hardest package to install, since it has its own many dependencies:

numexpr

# wget http://numexpr.googlecode.com/files/numexpr-1.4.2.tar.gz
# tar xvfz numexpr-1.4.2.tar.gz; cd numexpr-1.4.2/
# python setup.py install


Cython

# wget http://www.cython.org/release/Cython-0.16.tar.gz
# tar xvfz Cython-0.16.tar.gz; cd Cython-0.16/
#python setup.py install


HDF5

# wget http://www.hdfgroup.org/ftp/HDF5/current/src/hdf5-1.8.9.tar.gz
# tar xvfz hdf5-1.8.9.tar.gz; cd hdf5-1.8.9/
# ./configure --prefix=/usr/local
# make; make install


pyTables itself

# wget http://downloads.sourceforge.net/project/pytables/pytables/2.4.0b1/tables-2.4.0b1.tar.gz
# tar xvfz tables-2.4.0b1.tar.gz; cd tables-2.4.0b1/
# python setup.py install


Edit 07/10/12: statsmodels is not a prereq, see below.

Prereq #4: statsmodels


Wasn't able to install it, it said 'requires pandas' but this is what I tried:


# wget http://pypi.python.org/packages/source/s/statsmodels/statsmodels-0.4.3.tar.gz
# tar xvfz statsmodels-0.4.3.tar.gz; cd statsmodels-0.4.3/
# python setup.py install --> requires pandas?


Prereq #4: pytz

# wget http://pypi.python.org/packages/source/p/pytz/pytz-2012c.tar.gz
# tar xvfz pytz-2012c.tar.gz; cd pytz-2012c/
# python setup.py install


Prereq #5: matplotlib

This was already installed on my target host during the EC2 instance bootstrap via Chef: 

# apt-get install python-matplotlib

pandas itself

# git clone git://github.com/pydata/pandas.git
# cd pandas
# python setup.py install

NOTE: Ralf Gommers added a comment that statsmodels is not a prerequisite to pandas, but instead needs to be installed once pandas is there. So I did this:

Install statsmodels

# wget http://pypi.python.org/packages/source/s/statsmodels/statsmodels-0.4.3.tar.gz
# tar xvfz statsmodels-0.4.3.tar.gz; cd statsmodels-0.4.3/
# python setup.py install


Finally, if you also want to dabble into machine learning algorithms:

Install scikit-learn

# wget http://pypi.python.org/packages/source/s/scikit-learn/scikit-learn-0.11.tar.gz
# tar xvfz scikit-learn-0.11.tar.gz; cd scikit-learn-0.11/
# python setup.py install