Agile Testing: March 2009

I want to start my post by a big shout-out to Willy Tarreau, the author of HAProxy, for his help in fine-tuning one of our HAProxy installations and working with us through some issues we had. Willy is amazingly responsive and obviously lives and breathes stuff related to load balancing, OS and TCP stack tuning, and other arcane subjects ;-)

Let's assume you have a cluster of Apache servers behind an HAProxy and you want to sustain 500 requests/second with low latency per request. First of all, you need to bump up MaxClients and ServerLimit in your Apache configuration, as I explained in another post. In this case you would set both variables to 500. Note that you actually need to stop and start the httpd service, because simply restarting it won't change the built-in limit (which is 256). Also ignore the warning that Apache gives you on startup:

WARNING: MaxClients of 500 exceeds ServerLimit value of 256 servers,
lowering MaxClients to 256. To increase, please see the ServerLimit
directive.

Note that the more httpd processes you have, the more CPU and RAM will be consumed on the server. You need to decide how much to push the envelope in terms of concurrent httpd processes you can sustain on a given server. A good measure is the latency / responsiveness you expect from your Web application. At some point, it will start to suffer, and that will be a sign that you need to add a new Web server to your server farm (of course, this over-simplifies things a bit, since there's always the question of the database layer; I'm assuming you can use memcache to minimize database access.) Here's a good overview of the trade-offs related to MaxClients.

Other Apache configuration variables I've tweaked are StartServers, MinSpareServers and MaxSpareServers. It sometimes pays to bump up the values for these variables, so you can have spare httpd processes waiting around for those peak times when the requests hitting your server suddenly increase. Again, there's a trade-off here between server resources and number of spare httpd processes you want to maintain.

Assuming you fine-tuned your Apache servers, it's time to tweak some variables in the HAProxy configuration. Perhaps the most important ones for our discussion are the number of maximum connections per server (maxconn), httpclose and abortonclose.

It's a good idea to throttle the maximum number of connections per server and set it to a number related to the request/second rate you're shooting for. In our case, that number is 500. Since HAProxy itself needs some connections for healthchecking and other internal bookkeeping, you should set the maxconn per server to something slightly lower than 500. In terms of syntax, I have something similar to this in the backend section of haproxy.cfg:


server server1 10.1.1.1:80 check maxconn 500

I also have the following 2 lines in the backend section:


option abortonclose
option httpclose

According to the official HAProxy documentation, here's what these options do:

option abortonclose

In presence of very high loads, the servers will take some time to respond.
The per-instance connection queue will inflate, and the response time will
increase respective to the size of the queue times the average per-session
response time. When clients will wait for more than a few seconds, they will
often hit the "STOP" button on their browser, leaving a useless request in
the queue, and slowing down other users, and the servers as well, because the
request will eventually be served, then aborted at the first error
encountered while delivering the response.

As there is no way to distinguish between a full STOP and a simple output
close on the client side, HTTP agents should be conservative and consider
that the client might only have closed its output channel while waiting for
the response. However, this introduces risks of congestion when lots of users
do the same, and is completely useless nowadays because probably no client at
all will close the session while waiting for the response. Some HTTP agents
support this behaviour (Squid, Apache, HAProxy), and others do not (TUX, most
hardware-based load balancers). So the probability for a closed input channel
to represent a user hitting the "STOP" button is close to 100%, and the risk
of being the single component to break rare but valid traffic is extremely
low, which adds to the temptation to be able to abort a session early while
still not served and not pollute the servers.

In HAProxy, the user can choose the desired behaviour using the option
"abortonclose". By default (without the option) the behaviour is HTTP
compliant and aborted requests will be served. But when the option is
specified, a session with an incoming channel closed will be aborted while
it is still possible, either pending in the queue for a connection slot, or
during the connection establishment if the server has not yet acknowledged
the connection request. This considerably reduces the queue size and the load
on saturated servers when users are tempted to click on STOP, which in turn
reduces the response time for other users.

option httpclose

As stated in section 2.1, HAProxy does not yes support the HTTP keep-alive
mode. So by default, if a client communicates with a server in this mode, it
will only analyze, log, and process the first request of each connection. To
workaround this limitation, it is possible to specify "option httpclose". It
will check if a "Connection: close" header is already set in each direction,
and will add one if missing. Each end should react to this by actively
closing the TCP connection after each transfer, thus resulting in a switch to
the HTTP close mode. Any "Connection" header different from "close" will also
be removed.

It seldom happens that some servers incorrectly ignore this header and do not
close the connection eventough they reply "Connection: close". For this
reason, they are not compatible with older HTTP 1.0 browsers. If this
happens it is possible to use the "option forceclose" which actively closes
the request connection once the server responds.

And now for something completely different.....TCP stack tuning! Even with all the tuning above, we were still seeing occasional high latency numbers. Willy Tarreau to the rescue again....he was kind enough to troubleshoot things by means of the haproxy log and a tcpdump. It turned out that some of the TCP/IP-related OS variables were set too low. You can find out what those values are by running:


sysctl -a | grep ^net

In our case, the main one that was out of tune was:


net.ipv4.tcp_max_syn_backlog = 1024

Because of this, when there were more than 1,024 concurrent sessions on the machine running HAProxy, the OS had to recycle through the SYN backlog, causing the latency issues. Here are all the variables we set in /etc/sysctl.conf at the advice of Willy:

net.ipv4.tcp_tw_reuse = 1
net.ipv4.ip_local_port_range = 1024 65023
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_max_orphans = 60000
net.ipv4.tcp_synack_retries = 3
net.core.somaxconn = 10000

(to have these values take effect, you need to run 'sysctl -p')

That's it for now. As I continue to use HAProxy in production, I'll report back with other tips/tricks/suggestions.

I know the title of this post doesn't make much sense, I wrote it that way so that people who run into issues similar to mine will have an easier time finding it.

Here's a mysterious issue that I recently solved with the help of my colleague Chris Nutting:

1) Apache/PHP server sitting behind an HAProxy instance
2) MaxMind's GeoIP module installed in Apache
3) Application making use of the geotargeting features offered by the GeoIP module was sometimes displaying those features in a drop-down, and sometimes not

It turns out that the application was using the X-Forwarded-For headers in the HTTP requests to pass the real source IP of the request to the mod_geoip module and thus obtain geotargeting information about that IP. However, mysteriously, HAProxy was sometimes (once out of every N requests) not sending the X-Forwarded-For headers at all. Why? Because KeepAlive was enabled in Apache, so HAProxy was sending those headers only on the first request of the HTTP connection that was being "kept alive". Subsequent requests in that connection didn't have those headers set, so those requests weren't identified properly by mod_geoip.

The solution in this case was to disable KeepAlive in Apache. Willy Tarreau, the author of HAProxy, also recommends setting 'option httpclose' in the HAProxy configuration file. Here's an excerpt from the official HAProxy documentation:

option forwardfor [ except  ] [ header  ]
....
It is important to note that as long as HAProxy does not support keep-alive
connections, only the first request of a connection will receive the header.
For this reason, it is important to ensure that "option httpclose" is set
when using this option.

I hope this post will be of some use to people who might run into this issue.

Agile Testing

Tuesday, March 17, 2009

HAProxy and Apache performance tuning tips

Wednesday, March 04, 2009

HAProxy, X-Forwarded-For, GeoIP, KeepAlive

Modifying EC2 security groups via AWS Lambda functions

Followers