We have systems deployed across several cloud providers and data centers, with site-to-site VPN links between locations. What started to happen fairly often for us was what we call a "keepalive storm", where all of a sudden all Sensu clients were seen by the Sensu server as unavailable, since no keepalive had been sent by the clients to RabbitMQ. The thresholds for the keepalive timers in Sensu are hardcoded (at least in the Sensu version we are using, which is 0.10.2) and are defined in /opt/sensu/embedded/lib/ruby/gems/2.0.0/gems/sensu-0.10.2/lib/sensu/server.rb as 120 seconds for warnings and 180 seconds for critical alerts:
thresholds = {
:warning => 120,
:critical => 180
}
What we think was happening is that the connections between the Sensu clients and RabbitMQ (which in our case is running on the same box as the Sensu server) were reset, either because of a temporary glitch in the site-to-site VPN connection, or because of some other undetermined but probably network-related cause. In any case, this issue was becoming severe and was causing the engineer on pager duty to not get a lot of sleep at night.
After lots of hair-pulling, we found a workaround by specifying a non-default value for the heartbeat parameter in the RabbitMQ configuration file rabbitmq.config. Here's what the documentation says about the heartbeat parameter:
Value representing the heartbeat delay, in seconds, that the server sends in the connection.tune frame. If set to 0, heartbeats are disabled. Clients might not follow the server suggestion, see the AMQP reference for more details. Disabling heartbeats might improve performance in situations with a great number of connections, but might lead to connections dropping in the presence of network devices that close inactive connections.
Default: 600
Note that the default value is 600 seconds, much larger than the 120 and 180 second keepalive thresholds defined in Sensu. So what we did was set a heartbeat value of less than 120. We chose 60 seconds for this value and it seemed to work fine. We still have keepalive storms, but they are definitely due to real but temporary issues in site-to-site VPN connectivity and they usually resolve themselves immediately.
One more thing: we install Sensu via its Chef community cookbook. The Sensu cookbook uses the RabbitMQ community cookbook, which doesn't define the heartbeat parameter as an attribute. We had to add that attribute, as well as use it in the rabbitmq.config.erb template file.
Just for reference, we modified cookbooks/rabbitmq/attributes/default.rb and added:
#avoid sensu keepalive storms!
default['rabbitmq']['heartbeat'] = 60
We also modified cookbooks/rabbitmq/templates/default/rabbitmq.config.erb and added:
{heartbeat, <%= node['rabbitmq']['heartbeat'] %>}