Wednesday, July 10, 2013

The mystery of stale haproxy processes

We had the situation with our haproxy-based load balancers where our monitoring alerts were triggered by the fact that several haproxy processes were running, when in fact only one was supposed to be running. Looking more into it, we determined that each time Chef client ran (which by default is every 30 minutes), a new haproxy process was launched. The logic in the haproxy cookbook applied to that node was to do a 'service haproxy reload' every time the haproxy configuration file changed. Since our haproxy configuration file is based on a Chef template populated via a Chef search, that meant that the haproxy reload was happening on each Chef client run.

If you look in /etc/init.d/haproxy, you'll see that the reload launches a new haproxy process, while the existing process is supposed to finish serving existing connections, then exit. However, the symptom we were seeing was that the existing haproxy process never closed all the outstanding connections, so it never exited. Inspection via lsof also revealed that the haproxy process kept many network connections in the CLOSE_WAIT state. I need to mention that this particular haproxy box was load balancing requests from Ruby clients across a Riak cluster. After some research, it turned out that the symptom of haproxy connections in CLOSE_WAIT that never go away is due to the fact that the client connection goes away, while haproxy still waits for a confirmation of the termination of that connection. See this haproxy mailing list thread for a great in-depth explanation of the issue by the haproxy author Willy Tarreau.

In short, the solution in our case (per the mailing list thread) was to add

option forceclose

to the defaults section of haproxy.cfg.