Friday, November 09, 2012

Quick troubleshooting of Sensu 'no keepalive from client' issue

As I mentioned in a previous post, we started using Sensu as our internal monitoring tool. We also integrated it with Pager Duty. Today we terminated an EC2 instance that had been registered as a client with Sensu. I started to get paged soon after with messages of the type:

 keepalive : No keep-alive sent from client in over 180 seconds

Even after removing the client from the Sensu dashboard, the messages kept coming. My next step was of course to get on the #sensu IRC channel. I immediately got help from robotwitharose and portertech.  They had me try the following:

1) Try to remove the client via the Sensu API.

I used curl and ran:

curl -X DELETE http://sensu.server.ip.address:4567/client/myclient

2) Try to retrieve the client via the Sensu API and make sure I get a 404

curl -v http://sensu.server.ip.address:4567/client/myclient

This indeed returned a 404.

3) Check that there is a single redis process running

BINGO -- when I ran 'ps -def | grep redis', the command returned TWO redis-server processes! I am not sure how they got to be both running, but this solved the mystery: sensu-server was talking to one redis-server process, and sensu-api was talking to another. When the client was removed via the sensu-api, the Sensu server was still seeing events sent by the client, such as this one from /var/log/sensu/sensu-server.log:

{"timestamp":"2012-11-10T01:41:14.154418+0000","message":"handling event","event":{"client":{"subscriptions":["all"],"name":"myclient","address":"","timestamp":1352502348},"check":{"name":"keepalive","issued":1352511674,"output":"No keep-alive sent from client in over 180 seconds","status":2,"history":["2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2","2"],"flapping":false},"occurrences":305,"action":"create"},"handler":{"type":"pipe","command":"/etc/sensu/handlers/ops_pagerduty.rb","api_key":"myapikey","name":"ops_pagerduty"},"level":"info"}

To actually solve this, I killed the 2 redis-server processes (since 'service redis-server stop' didn't seem to do it), then stopped sensu-server and sensu-api, then started redis-server, and finally started sensu-server and sensu-api again.

At this point, the Sensu dashboard showed the 'myclient' client again. I removed it one more time from the dashboard (I could have done it via the API too) and it finally went away for good.

This was quite some obscure issue. I wouldn't have been able to solve it were it not for the awesomeness of the #sensu IRC channel (and kudos to the aforementioned robotwitharose and portertech!)

I hope google searches for 'sensu no keepalive from client' will result in this blog post helping somebody out there! :-)

Sensu rocks BTW.


jeffbmartinez said...

Google brought me here like you hoped. Although I didn't have two redis instances, the curl -X DELETE helped me clear out my 'no keep-alive sent from client' messages.

jeffbmartinez said...
This comment has been removed by a blog administrator.
guizm0 said...

I would like to add to this post, while very helpful in my troubleshooting of the same issue, I also had to flush the redis database (redis-cli flushdb)in order for the alerts to clear from the dashboard.
I'm new to sensu, but I hope that this addition would be beneficial to those of you out there (like me) that are new to the sensu monitoring tool.


Grig Gheorghiu said...

Thanks guizm0! Very good tip!

Anonymous said...

Another one is out of sync time on the client / server. Found my server was a few seconds out so the threshold was always hit for the keep alive timer.

Modifying EC2 security groups via AWS Lambda functions

One task that comes up again and again is adding, removing or updating source CIDR blocks in various security groups in an EC2 infrastructur...