nginx incorrectly handles slow connection in deferred mode

Bug #1619687 reported by Cory Benfield on 2016-09-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nginx (Ubuntu)
Undecided
Unassigned
Trusty
Undecided
Unassigned
Xenial
Undecided
Unassigned

Bug Description

Summary:

When nginx has a server that is configured in the 'deferred' listen mode and it takes the client more than approx. 1 second to send data, nginx will hang and eventually time the connection out.

Tested Versions:

  lsb_release -rd:
    Description: Ubuntu 14.04.5 LTS
    Release: 14.04

  apt-cache policy nginx
    nginx:
      Installed: 1.4.6-1ubuntu3.5
      Candidate: 1.4.6-1ubuntu3.5
      Version table:
     *** 1.4.6-1ubuntu3.5 0
            500 http://mirrors.digitalocean.com/ubuntu/ trusty-updates/main amd64 Packages
            500 http://security.ubuntu.com/ubuntu/ trusty-security/main amd64 Packages
            100 /var/lib/dpkg/status
         1.4.6-1ubuntu3 0
            500 http://mirrors.digitalocean.com/ubuntu/ trusty/main amd64 Packages

Reproduction Guidelines:

1. Install nginx on a vanilla system: apt-get update && apt-get upgrade -y && apt-get install -y nginx
2. Configure nginx with a basic SSL setup. A self-signed certificate is fine. The only divergence from the regular nginx SSL setup should be that the 'listen' line must contain the "deferred" directive.
3. Confirm nginx functions correctly using curl (`curl -k https://localhost`). This should print the HTML of the default welcome page.
4. Open a file called test.py. Copy the following script into it:

        import socket
        import ssl
        import sys
        import time

        TARGET = ('localhost', 443)

        def connect_slowly():
            print "connecting...",
            sys.stdout.flush()
            s = socket.create_connection(TARGET)
            print "connected"

            print "sleeping...",
            sys.stdout.flush()
            time.sleep(2)
            print "slept"

            print "beginning ssl handshake...",
            sys.stdout.flush()
            s = ssl.wrap_socket(s)
            print "complete"
            s.sendall(
                    b'GET / HTTP/1.1\r\n'
                    b'Host: localhost\r\n'
                    b'\r\n'
            )
            print s.recv(8192)

        connect_slowly()

5. Execute `python test.py`

Expected behaviour:

Expect the script to run fairly promptly, and to print:

    connecting... connected
    sleeping... slept
    beginning ssl handshake...complete

followed by some portion (possibly all) of the default landing page.

Actual behaviour:

The script produces the following output:

    connecting... connected
    sleeping... slept
    beginning ssl handshake...

and then hangs for approximately a minute, after which it prints the following traceback:

    Traceback (most recent call last):
      File "test.py", line 33, in <module>
        connect_slowly()
      File "test.py", line 23, in connect_slowly
        s = ssl.wrap_socket(s)
      File "/usr/lib/python2.7/ssl.py", line 487, in wrap_socket
        ciphers=ciphers)
      File "/usr/lib/python2.7/ssl.py", line 243, in __init__
        self.do_handshake()
      File "/usr/lib/python2.7/ssl.py", line 405, in do_handshake
        self._sslobj.do_handshake()
    socket.error: [Errno 104] Connection reset by peer

Explanation:

Some misbehaviour appears to be occurring in nginx around the TCP_DEFER_ACCEPT timeout and TLS handshakes. Specifically, nginx seems to "lose" the connection once the deferred timeout fires. Observing a packet capture demonstrates the problem more clearly (though note that for obvious reasons you can't do this on localhost!)

13:59:06.546468 IP 51.6.64.223.65442 > 67.205.133.224.443: Flags [S], seq 4198376207, win 65535, options [mss 1452,nop,wscale 5,nop,nop,TS val 146528037 ecr 0,sackOK,eol], length 0
13:59:06.546543 IP 67.205.133.224.443 > 51.6.64.223.65442: Flags [S.], seq 3994111434, ack 4198376208, win 28960, options [mss 1460,sackOK,TS val 729780 ecr 146528037,nop,wscale 6], length 0
13:59:06.835143 IP 51.6.64.223.65442 > 67.205.133.224.443: Flags [.], ack 3994111435, win 4140, options [nop,nop,TS val 146528326 ecr 729780], length 0
13:59:07.542989 IP 67.205.133.224.443 > 51.6.64.223.65442: Flags [S.], seq 3994111434, ack 4198376208, win 28960, options [mss 1460,sackOK,TS val 730030 ecr 146528326,nop,wscale 6], length 0
13:59:07.657420 IP 51.6.64.223.65442 > 67.205.133.224.443: Flags [.], ack 3994111435, win 4140, options [nop,nop,TS val 146529110 ecr 729780], length 0
13:59:08.943065 IP 51.6.64.223.65442 > 67.205.133.224.443: Flags [P.], seq 4198376208:4198376725, ack 3994111435, win 4140, options [nop,nop,TS val 146530274 ecr 729780], length 517
13:59:08.943087 IP 67.205.133.224.443 > 51.6.64.223.65442: Flags [.], ack 4198376725, win 470, options [nop,nop,TS val 730380 ecr 146530274], length 0
14:00:07.693962 IP 67.205.133.224.443 > 51.6.64.223.65442: Flags [R.], seq 3994111435, ack 4198376725, win 470, options [nop,nop,TS val 745067 ecr 146530274], length 0

In this capture, you can see that the TCP 3-way handshake is completed almost immediately: within 300ms. The client then pauses for 2 seconds. 1 second after the original SYN packet was received, the SYN+ACK from the server is retransmitted: this is the characteristic sign of the timeout from TCP_DEFER_ACCEPT firing. The client retransmits the ACK from packet 3. Two seconds after the original client ACK was sent in packet 3, the client sends the SSL Client Hello in a PSH+ACK segment that's 517 bytes long. This is ACK'd by the server (showing that it received it), but no Server Hello is ever sent back: nginx has *lost* the TLS handshake.

61 seconds after the connection was initiated, 60 seconds after the DEFER_ACCEPT timer fired, nginx tears the connection down as idle. This causes our Python traceback.

It's not yet clear to me where in nginx this problem is occurring, and I'm not enough of an expert to dive in. I also haven't checked whether this is patched in a future release: I'll investigate in Xenial shortly. Insight welcome!

Cory Benfield (lukasaoz) wrote :

I just investigated on Xenial, and this problem appears not to exist there. The behaviour is the same (SYN+ACK retransmission after TCP_DEFER_ACCEPT timeout), but this time nginx correctly continues the connection. So it seems that this was fixed somewhere between trusty and xenial.

For the fixed version:

  lsb_release -rd:
    Description: Ubuntu 16.04.1 LTS
    Release: 16.04

  apt-cache policy nginx:
    nginx:
      Installed: 1.10.0-0ubuntu0.16.04.2
      Candidate: 1.10.0-0ubuntu0.16.04.2
      Version table:
     *** 1.10.0-0ubuntu0.16.04.2 500
            500 http://mirrors.digitalocean.com/ubuntu xenial-updates/main amd64 Packages
            500 http://mirrors.digitalocean.com/ubuntu xenial-updates/main i386 Packages
            500 http://security.ubuntu.com/ubuntu xenial-security/main amd64 Packages
            500 http://security.ubuntu.com/ubuntu xenial-security/main i386 Packages
            100 /var/lib/dpkg/status
         1.9.15-0ubuntu1 500
            500 http://mirrors.digitalocean.com/ubuntu xenial/main amd64 Packages
            500 http://mirrors.digitalocean.com/ubuntu xenial/main i386 Packages

In essence, this appears to have been fixed in between 1.10.0 and 1.4.6 of nginx.

Cory Benfield (lukasaoz) wrote :

I have located the bug fix: http://hg.nginx.org/nginx/rev/5b549cc7f698. Such a fix would be extremely easy to backport if there is interest in doing so. Otherwise, the fix landed in 1.7.11, so any version later than that is safe.

Thomas Ward (teward) on 2016-09-02
Changed in nginx (Ubuntu):
status: New → Fix Released
Changed in nginx (Ubuntu Xenial):
status: New → Fix Released
Thomas Ward (teward) wrote :

Can you include a sample configuration for how you would configure this type of setup? (That way we are working with similar setups)

Cory Benfield (lukasaoz) wrote :

Sure. I would add this block to the top of the nginx default site file:

server {
        listen 443 ssl deferred;

        root /usr/share/nginx/html;
        index index.html index.htm;

        # Make site accessible from http://localhost/
        server_name localhost;

        location / {
                # First attempt to serve request as file, then
                # as directory, then fall back to displaying a 404.
                try_files $uri $uri/ =404;
                # Uncomment to enable naxsi on this location
                # include /etc/nginx/naxsi.rules
        }
}

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers