Friday, April 19, 2013

Setting up keepalived with Chef on Ubuntu 12.04

We have 2 servers running HAProxy on Ubuntu 12.04. We want to set them up in an HA configuration, and for that we chose keepalived.

The first thing we did was look for an existing Chef cookbook for keepalived -- luckily, @jtimberman already wrote it. It's a pretty involved cookbook, probably one of the most complex I've seen. The usage instructions are pretty good though. In any case, we ended up writing our own wrapper cookbook on top of keepalived -- let's call it frontend-keepalived.

The usage documentation for the Opscode keepalived cookbook contains a role-based example and a recipe-based example. We took inspiration from both. In our frontend-keepalived/recipes/default.rb file we have:


include_recipe 'keepalived'

node[:keepalived][:check_scripts][:chk_haproxy] = {
  :script => 'killall -0 haproxy',
  :interval => 2,
  :weight => 2
}
node[:keepalived][:instances][:vi_1] = {
  :ip_addresses => '172.30.10.10',
  :interface => 'frontend_if',
  :track_script => 'chk_haproxy',
  :nopreempt => false,
  :advert_int => 1,
  :auth_type => :pass, # :pass or :ah
  :auth_pass => 'mypass'
}


This code overrides the default values for many of the attributes defined in the Opscode keepalived cookbook. It specifies the floating IP address that will be common between the 2 servers that will each run HAProxy (:ip_addresses). It also specifies the network interface where the multicast-based keepalived protocol (:interface) and the 'check script' which tests whether HAProxy is still running on each server.

However, we still needed a way to specify which of the 2 servers is the master and which is the backup (in keepalived parlance), as well as indicating priorities for each server. The usage document in the keep alived cookbook shows this as an example of using a single role to define the master and the backup:


override_attributes(
  :keepalived => {
    :global => {
      :router_ids => {
        'node1' => 'MASTER_NODE',
        'node2' => 'BACKUP_NODE'
      }
    }
  }
)

We couldn't get this to work (if somebody who did reads this, please leave a comment and tell me how you did it!). Instead, we defined 2 roles, one for the master and one for the backup. Here's the master role:


$ cat frontend-keepalived-master.rb
name "frontend-keepalived-master"
description "install keepalived and set state to MASTER"

override_attributes(
    "keepalived" => {
      "instance_defaults" => {
        "state" => "MASTER",
        "priority" => "101"
}
      }
)

run_list(
  "recipe[frontend-keepalived]"
)

Here's the backup role:


$ cat frontend-keepalived-backup.rb
name "frontend-keepalived-backup"
description "install keepalived and set state to BACKUP"

override_attributes(
    "keepalived" => {
      "instance_defaults" => {
        "state" => "BACKUP",
        "priority" => "100"
}
    }
)

run_list(
  "recipe[frontend-keepalived]"
)

Notice that we override 2 attributes, the state and the priority. The defaults for these are in the Opscode keepalived cookbook, under attributes/default.rb


default['keepalived']['instance_defaults']['state'] = 'MASTER'
default['keepalived']['instance_defaults']['priority'] = 100

This was useful in determining how to specify the stanza overriding them in our roles -- it made us see that we needed to specify the instance_defaults key under keepalived in the role files.

At this point, we added the master role to the Chef run_list of server #1 and the backup role to the Chef run_list of server #2. We had to do one more thing on each server (which we'll add to the default recipe of our frontend-keepalived cookbook): per this very helpful blog post on setting up HAProxy and keepalived, we edited /etc/systctl.conf and added:

net.ipv4.ip_nonlocal_bind=1
then applied it via 'sysctl -p'. This was needed so that HAProxy can listen on the keepalived-created 'floating IP' common to the 2 servers, which is not a real IP tied to an existing local network interface.

Once we ran chef-client on each of the 2 servers, we were able to verify that keepalived does its job by pinging the common floating IP from a 3rd server, then shutting down the network interface 'frontend_if' on each server, with no interruption in the ICMP responses sent from the floating IP. Our next step is to do some heavy-duty testing involving HTTP requests handled by HAProxy, and see that there is no interruption in service when we fail over from one HAProxy server to the other.

UPDATE

My colleague Zmer Andranigian discovered an attribute in the Opscode keepalived cookbook that deals with the sysctl setup. The default value for this attribute is:

default['keepalived']['shared_address'] = false

If this attribute is set to 'true' (for example in one of the 2 roles we defined above), then the keepalived cookbook will create a file called /etc/sysctl.d/60-ip-nonlocal-bind.conf containing:

net.ipv4.ip_nonlocal_bind=1

and will also set it in the running configuration of sysctl.

For reference, the role frontend-keepalived-master would contain the following attributes:


override_attributes(
    "keepalived" => {
      "instance_defaults" => {
        "state" => "MASTER",
        "priority" => "101"
}
      "shared_address" => "true"
   }
)




4 comments:

Anonymous said...

To avoid using two roles for MASTER/BACKUP set attributes states and priorities like for router_ids in opscode's sample. E.g.

node.set[:keepalived][:instances][:vi_1] = {
:states => {
"node1" => 'MASTER',
"node2" => 'BACKUP'
},
:priorities => {
"node1" => '101',
"node2" => '100'
}
}

Anonymous said...

Should node1, node2 be mapped to the hostnames of the master and backup nodes?

Rob Berger said...

Very helpful. To clarify, would you set

"shared_address" => "true"

on both master and slave or just on master?

Thanks!

Grig Gheorghiu said...

Hi Robert -- glad the post was useful to you. We set shared_address to true on both the master and the slave.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...