swift proxy-server stops accepting connections
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Object Storage (swift) |
New
|
Undecided
|
Unassigned |
Bug Description
We're running swift version 2.26.0-
Briefly: after an extended period of operation, we sometimes see our swift proxy-servers stop accepting connections, leading to connection timeouts for the client. Once a proxy-server has got into this state, the only way to recover it is to restart the proxy-server processes; unloading it is not sufficient.
We use envoy for TLS termination (and have previously used nginx, where we saw the same issue), but can demonstrate it's not that that's at fault by attempting to connect to port 80 directly (e.g. with curl), and observing the Connection timed out error:
mvernon@
* Uses proxy env variable no_proxy == 'wikipedia.
% Total % Received % Xferd Average Speed Time Time Time Current
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.192.16.76:80...
0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0* connect to 10.192.16.76 port 80 failed: Connection timed out
* Failed to connect to ms-fe2010.
0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0
* Closing connection 0
curl: (28) Failed to connect to ms-fe2010.
[note here the client and server are on the same node]
If we repeat this test from a separate client machine and run tcpdump, we can see the connection requests from the client to port 80, but no response.
lsof -i shows that there are swift-proxy processes (and nothing else) with sockets in state LISTEN on port 80; using strace on all of them while attempting a connection attempt shows that none of the proxy-servers attempts to call `accept`, hence the Connection timed out errors.
Some more of our investigations are available here: https:/
If there are other things we could usefully do to attempt to debug thus, do let me know (but they would need to be something that can be done to a server in the failed state, since a restart will clear the failures).