swift proxy-server stops accepting connections

Bug #2058945 reported by Matthew Vernon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
New
Undecided
Unassigned

Bug Description

We're running swift version 2.26.0-10+deb11u1+wmf1, but have been seeing this issue for some time (including older versions of swift).

Briefly: after an extended period of operation, we sometimes see our swift proxy-servers stop accepting connections, leading to connection timeouts for the client. Once a proxy-server has got into this state, the only way to recover it is to restart the proxy-server processes; unloading it is not sufficient.

We use envoy for TLS termination (and have previously used nginx, where we saw the same issue), but can demonstrate it's not that that's at fault by attempting to connect to port 80 directly (e.g. with curl), and observing the Connection timed out error:

mvernon@ms-fe2010:~$ curl -o /tmp/foo -v -H "Host: upload.wikimedia.org" http://$(hostname -f)/wikipedia/commons/thumb/1/1d/1-month-old_kittens_32.jpg/800px-1-month-old_kittens_32.jpg
* Uses proxy env variable no_proxy == 'wikipedia.org,wikimedia.org,wikibooks.org,wikinews.org,wikiquote.org,wikisource.org,wikiversity.org,wikivoyage.org,wikidata.org,wikiworkshop.org,wikifunctions.org,wiktionary.org,mediawiki.org,wmfusercontent.org,w.wiki,wmnet,127.0.0.1,::1'
  % Total % Received % Xferd Average Speed Time Time Time Current
                                 Dload Upload Total Spent Left Speed
  0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Trying 10.192.16.76:80...
  0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0* connect to 10.192.16.76 port 80 failed: Connection timed out
* Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out
  0 0 0 0 0 0 0 0 --:--:-- 0:00:07 --:--:-- 0
* Closing connection 0
curl: (28) Failed to connect to ms-fe2010.codfw.wmnet port 80: Connection timed out

[note here the client and server are on the same node]

If we repeat this test from a separate client machine and run tcpdump, we can see the connection requests from the client to port 80, but no response.

lsof -i shows that there are swift-proxy processes (and nothing else) with sockets in state LISTEN on port 80; using strace on all of them while attempting a connection attempt shows that none of the proxy-servers attempts to call `accept`, hence the Connection timed out errors.

Some more of our investigations are available here: https://phabricator.wikimedia.org/T360913

If there are other things we could usefully do to attempt to debug thus, do let me know (but they would need to be something that can be done to a server in the failed state, since a restart will clear the failures).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.