TCP_DEFER_ACCEPT causes random HTTP connection failures in load-balanced web-server farms
Binary package hint: apache2
This applies to Apache 2.1.5 +
In a web server-farm scenario that is fronted by hardware load-balancers, in this case Juniper Redline aka DX, where the load-balancers are configured to use TCP multiplexing (holding open and re-using HTTP connections to the web servers) there exists the potential for random, unexplained and untraceable connection failures.
From the end-user web browser client perspective, all they see is a blank page returned. This happens so randomly any reports from end users would be dismissed as network glitches.
I've spent the last two weeks working with a large IT e-commerce retailer. Their system administrator initially came to me with the belief that something in the Linux kernel network stack was faulty. He had already done extensive diagnostic work with the Juniper support engineers and neither had been able to pin-point the cause of the failure.
What they knew was the persistent connection between the DXs and the web-servers would occasionally, and seemingly randomly, be RESET by the server. Some web servers in some clusters were affected; others weren't.
When I examined the tcpdump capture taken on a web server it quickly became evident that Linux was ignoring ACKs from the DX during the initial handshake, was retrying the SYN ACK the default 5 times, and then closing the half-open connection.
After a lot of work with custom-written tools that detected packets at the PF_PACKET level (libpcap) and checked they were seen by the netfilters/iptables layer, we decided to hack a custom kernel. I added printk() statements into net/ipv4/
As a result we discovered handshakes were having their ACK from the client (in this case the Juniper DX) discarded because the listening socket was operating with TCP_DEFER_ACCEPT flag (SO_ACCEPTFILTER on BSD).
The server's SYN_RECEIVED timer would time-out, and the server would resend the SYN ACK. The DX would reply with a duplicate ACK, which would again be discarded.
This would repeat 5 times (the default retries for SYN ACK). Each time-out doubled in time: 3, 6, 12, 24, 48, 96 seconds respectively - ~190 seconds in total. If a request arrived from the DX *after* this the DX received a RST from the server since the socket had been closed due to the handshake failure. This causes the end-user client to see a 'white page' (empty response).
If a request arrived from the DX *before* the retries and time-out expired it would cause the connection to be ESTABLISHED and the request would be handled.
The reason for the failures is the Juniper DX maintains a group (by default 6) of persistent connections to each target host in a cluster of servers. It creates these persistent connections *before* it has HTTP requests for the target server. If the server is using Deferred Accept (TCP_DEFER_ACCEPT) on listening sockets the connection will not be promoted to ESTABLISHED until data is received.
It turns out that Apache introduced TCP_DEFER_ACCEPT as the *default* for its socket options in version 2.1.5. There needs to be no specific
AcceptFilter http data
rule in the Apache configuration files to enable it. In fact, it needs
AcceptFilter http none
in order to disable TCP_DEFER_ACCEPT on its sockets.
Because the Juniper DX OS up to at least version 5.2.6 doesn't correctly implement the HTTP protocol when using persistent connections, the interaction between Apache 2.1.5+ and the DX persistent connections brings about this issue when *traffic is light* - it won't happen if the work load is medium or heavy.
The root cause of the failure, but exacerbated by the change in Apache 2.1.5+ to using TCP_DEFER_ACCEPT, is that the Juniper DX OS tries to open a connection to the HTTP server *but doesn't send a request*.
Unlike other protocols like telnet, HTTP expects the connection to be accompanied by a request, so the TCP packet contains data. RFC2616 (HTTP 1.1) section 1.4 states:
"...a connection may be used for one or more request/response exchanges..."
The Juniper DX however creates a connection and in low-traffic situations doesn't send "one or more request[s]..." causing the Linux kernel network stack to time-out the socket.
The work-around is to disable TCP_DEFER_ACCEPT when deploying Apache 2.1.5+ behind load-balancing systems such as the Juniper Redline / DX by adding to the Apache configuration:
AcceptFilter http none
|Changed in apache2:|
|importance:||Undecided → Low|
|Changed in apache2:|
|assignee:||intuitivenipple → nobody|