auth_token middleware leaves sockets in CLOSE_WAIT state

Bug #1475189 reported by Mark McLoughlin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
python-keystoneclient
Expired
Undecided
Unassigned

Bug Description

In a Juno deployment with 3 controllers, 96 logical CPUs on each controller, and one worker process per nova/glance/neutron API service, we hit a database connections issue ... but while debugging it, got distracted by a huge number of sockets from nova/glance/neutron to port 35357 that we in a CLOSE_WAIT state

Versions:

  python-keystoneclient-0.11.1
  python-keystonemiddleware-1.2.0

Another aspect of our configuration is that these connections were being proxied through haproxy using tcp mode, and haproxy was closing the connection after a timeout.

Looking at the CLOSE_WAIT sockets from the logs:

$ for iii in lsof-*; do grep -rni 35357.*CLOSE_WAIT $iii | awk '{print $1}' | awk -F : '{print $2}' | sort | uniq -c | sort -n; done
      1 ceilomete
     89 nova-api
     96 glance-ap
     96 glance-re
    115 neutron-s

      2 ceilomete
     91 nova-api
     95 glance-re
     96 glance-ap
    101 neutron-s

      2 ceilomete
     92 nova-api
     95 glance-re
     96 glance-ap
     99 neutron-s

Each glance and nova worker process appears to only have 1 CLOSE_WAIT socket per worker. Neutron workers sometimes have 2 or 3. But we were hovering around a state where almost every worker process has a CLOSE_WAIT socket open.

Cedric Lecomte was able to reproduce the problem with a devstack stable/juno on an ubuntu 14.04 usingthe stack.sh from scratch without any particular configuration. (i.e. not using haproxy)

After using a little the openstack and CLOSE_WAIT start to appear :

ubuntu@keystone-close-wait:~/devstack$ lsof | grep CLOSE_WAIT
glance-re 23827 ubuntu 5u IPv4 76475 0t0 TCP 10.0.0.248:35927->IP-198-154-188-219.static.fibrenoire.ca:35357 (CLOSE_WAIT)
glance-re 23828 ubuntu 5u IPv4 79207 0t0 TCP 10.0.0.248:35936->IP-198-154-188-219.static.fibrenoire.ca:35357 (CLOSE_WAIT)
glance-re 23829 ubuntu 5u IPv4 75287 0t0 TCP 10.0.0.248:35922->IP-198-154-188-219.static.fibrenoire.ca:35357 (CLOSE_WAIT)
....

Change the number of worker on glance-registry to 1, to strace more easily the process.

This is the commented strace of the concerned socket :

/*** Socket Creation ***/
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 6
fcntl(6, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0

/*** Connect to keystone admin API ***/

connect(6, {sa_family=AF_INET, sin_port=htons(35357), sin_addr=inet_addr("198.154.188.219")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}, {fd=6, events=POLLOUT|POLLERR|POLLHUP}], 2, 60000) = 1 ([{fd=6, revents=POLLOUT}])
getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(35357), sin_addr=inet_addr("198.154.188.219")}, 16) = 0

/*** GET token ***/
sendto(6, "POST /v2.0/tokens HTTP/1.1\r\nHost"..., 323, 0, NULL, 0) = 323
recvfrom(6, 0x19e0a74, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}, {fd=6, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 2, 60000) = 1 ([{fd=6, revents=POLLIN}])
recvfrom(6, "HTTP/1.1 200 OK\r\nDate: Tue, 14 J"..., 8192, 0, NULL, NULL) = 4141

/*** Verify Token ***/
poll([{fd=6, events=POLLIN}], 1, 0) = 0 (Timeout)
sendto(6, "GET /v3/auth/tokens HTTP/1.1\r\nHo"..., 263, 0, NULL, 0) = 263
recvfrom(6, 0x19e0a74, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}, {fd=6, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 2, 60000) = 1 ([{fd=6, revents=POLLIN}])
recvfrom(6, "HTTP/1.1 200 OK\r\nDate: Tue, 14 J"..., 8192, 0, NULL, NULL) = 7092

/*** Stay a while in this state, Apache+Keystone send a FIN, so the socket become CLOSE_WAIT ***/

/*** New query ***/

/*** Try to poll, there is data to read (=1) ***/
poll([{fd=6, events=POLLIN}], 1, 0) = 1 ([{fd=6, revents=POLLIN}])

/*** So close the socket in CLOSE_WAIT ... ***/
close(6) = 0

/*** Start reopening a socket ***/
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 6
fcntl(6, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(6, F_SETFL, O_RDWR|O_NONBLOCK) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(35357), sin_addr=inet_addr("198.154.188.219")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}, {fd=6, events=POLLOUT|POLLERR|POLLHUP}], 2, 60000) = 1 ([{fd=6, revents=POLLOUT}])
getsockopt(6, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
connect(6, {sa_family=AF_INET, sin_port=htons(35357), sin_addr=inet_addr("198.154.188.219")}, 16) = 0
sendto(6, "GET /v3/auth/tokens HTTP/1.1\r\nHo"..., 263, 0, NULL, 0) = 263

recvfrom(6, 0x1825d34, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}, {fd=6, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 2, 60000) = 1 ([{fd=6, revents=POLLIN}])
recvfrom(6, "HTTP/1.1 200 OK\r\nDate: Tue, 14 J"..., 8192, 0, NULL, NULL) = 7092

/*** Socket here in CLOSE_WAIT ***/

The socket is reused but not in an efficient way because the socket is closed and reopen. So the max CLOSE_WAIT is equal to the number of worker for each service who use keystonemiddleware.auth_token.

This doesn't seem to be causing any nasty side-effects, but it certainly results in a messy state that can appear problematic to operators attempting to diagnose an issue.

Note this requests issue on CLOSE_WAIT sockets: https://github.com/kennethreitz/requests/issues/1973

Fundamentally, an application with a socket that's held open for a long period should poll() on that socket even when it's not expecting data, and when it discovers (because read() returns zero) the other side has closed the connection (sending a FIN), it should close the socket.

The behavior we're seeing is that the socket gets returned to the requests connection pool and we completely ignore it until we need to make another request and re-open the socket, and during this time the socket is in CLOSE_WAIT state.

This is probably non-trivial to fix, given that the connection pool is in a library and our event loop is eventlet - I could imagine, for example, that nova-api would spawn off a greenthread to poll() or select() on the sockets in the connection pool but we'd need a way to have the greenthread release a socket when we need to pull it from the pool. Nasty.

Revision history for this message
Steve Martinelli (stevemar) wrote :

this seems like it's out of the hands of keystone, we've had a lot of major changes since juno and it looks like the requests issue is closed. are you still hitting this problem?

Changed in python-keystoneclient:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for python-keystoneclient because there has been no activity for 60 days.]

Changed in python-keystoneclient:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.