Friday, October 18, 2013

Avoiding keepalive storms in sensu

Sensu is a great new monitoring tool, but also a bit rough around the edges. We've been willing to live with that, because of its benefits, in particular ease of automation and increased scalability due to its use of a queuing system. Speaking of queueing systems, Sensu uses RabbitMQ for that purpose. We haven't had performance or stability issues with the rabbit, but we have been encountering a pretty severe issue with the way Sensu and RabbitMQ interact with each other.

We have systems deployed across several cloud providers and data centers, with site-to-site VPN links between locations. What started to happen fairly often for us was what we call a "keepalive storm", where all of a sudden all Sensu clients were seen by the Sensu server as unavailable, since no keepalive had been sent by the clients to RabbitMQ.  The thresholds for the keepalive timers in Sensu are hardcoded (at least in the Sensu version we are using, which is 0.10.2) and are defined in /opt/sensu/embedded/lib/ruby/gems/2.0.0/gems/sensu-0.10.2/lib/sensu/server.rb as 120 seconds for warnings and 180 seconds for critical alerts:

             thresholds = {
                :warning => 120,
                :critical => 180

What we think was happening is that the connections between the Sensu clients and RabbitMQ (which in our case is running on the same box as the Sensu server) were reset, either because of a temporary glitch in the site-to-site VPN connection, or because of some other undetermined but probably network-related cause. In any case, this issue was becoming severe and was causing the engineer on pager duty to not get a lot of sleep at night.

After lots of hair-pulling, we found a workaround by specifying a non-default value for the heartbeat parameter in the RabbitMQ configuration file rabbitmq.config. Here's what the documentation says about the heartbeat parameter:

Value representing the heartbeat delay, in seconds, that the server sends in the connection.tune frame. If set to 0, heartbeats are disabled. Clients might not follow the server suggestion, see the AMQP reference for more details. Disabling heartbeats might improve performance in situations with a great number of connections, but might lead to connections dropping in the presence of network devices that close inactive connections.
Default: 600

Note that the default value is 600 seconds, much larger than the 120 and 180 second keepalive thresholds defined in Sensu. So what we did was set a heartbeat value of less than 120. We chose 60 seconds for this value and it seemed to work fine. We still have keepalive storms, but they are definitely due to real but temporary issues in site-to-site VPN connectivity and they usually resolve themselves immediately.

One more thing: we install Sensu via its Chef community cookbook. The Sensu cookbook uses the RabbitMQ community cookbook, which doesn't define the heartbeat parameter as an attribute. We had to add that attribute, as well as use it in the rabbitmq.config.erb template file.

Just for reference, we modified cookbooks/rabbitmq/attributes/default.rb and added:

#avoid sensu keepalive storms!
default['rabbitmq']['heartbeat'] = 60

We also modified cookbooks/rabbitmq/templates/default/rabbitmq.config.erb and added:

{heartbeat, <%= node['rabbitmq']['heartbeat'] %>}


Mathieu Sauve-Frankel said...

Hi, thought I would do you a small favor and get keepalive added to the official cookbook.

Grig Gheorghiu said...

Thanks, Mathieu! I hope your pull request gets accepted!

Alexis said...

Nice post. I hope the suggestions are accepted. This would be a boon. Thanks!

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...