I ran tcpdump on lo during a failed/hanging wget attempt. This showed the SYN packet coming in, but no SYN|ACK response going out. So the packet is coming in, but nothing appears to be accept()ing it. I took a look to see what pids are listening on port 5240: tcp6 0 0 :::5240 :::* LISTEN 1684/python3 tcp6 0 0 :::5240 :::* LISTEN 1694/python3 tcp6 0 0 :::5240 :::* LISTEN 1703/python3 tcp6 51 0 :::5240 :::* LISTEN 1680/python3 I then ran strace on those processes, and ran wget a few times until it hung: strace of hanging wget (mainly for timestamp): 19:12:28.826032 connect(4, {sa_family=AF_INET6, sin6_port=htons(5240), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28^Cstrace: Process 609353 detached strace of the 4 python processes (regiond): # strace -tt -p 1684 -p 1694 -p 1703 -p1680 [pid 1703] 19:12:28.089944 <... epoll_wait resumed>[], 10, 397) = 0 [pid 1703] 19:12:28.090230 epoll_wait(5, [pid 1694] 19:12:28.178543 <... epoll_wait resumed>[], 9, 206) = 0 [pid 1694] 19:12:28.178798 epoll_wait(5, [], 9, 0) = 0 [pid 1694] 19:12:28.179059 epoll_wait(5, [pid 1703] 19:12:28.190505 <... epoll_wait resumed>[], 10, 100) = 0 [pid 1703] 19:12:28.190798 epoll_wait(5, [], 10, 0) = 0 [pid 1703] 19:12:28.191106 epoll_wait(5, [pid 1684] 19:12:28.247185 <... epoll_wait resumed>[], 10, 300) = 0 [pid 1684] 19:12:28.247463 epoll_wait(5, [], 10, 0) = 0 [pid 1684] 19:12:28.247775 epoll_wait(5, [pid 1694] 19:12:28.677734 <... epoll_wait resumed>[], 9, 498) = 0 [pid 1694] 19:12:28.677978 epoll_wait(5, [], 9, 0) = 0 [pid 1694] 19:12:28.678228 epoll_wait(5, [pid 1703] 19:12:28.690792 <... epoll_wait resumed>[], 10, 499) = 0 [pid 1703] 19:12:28.691054 epoll_wait(5, [], 10, 0) = 0 [pid 1703] 19:12:28.691333 epoll_wait(5, [pid 1684] 19:12:28.747457 <... epoll_wait resumed>[], 10, 499) = 0 [pid 1684] 19:12:28.747698 epoll_wait(5, [], 10, 0) = 0 [pid 1684] 19:12:28.747951 epoll_wait(5, [], 10, 31) = 0 [pid 1684] 19:12:28.779333 epoll_wait(5, [pid 1694] 19:12:28.972708 <... epoll_wait resumed>[], 9, 294) = 0 [pid 1694] 19:12:28.973067 epoll_wait(5, [pid 1703] 19:12:29.090921 <... epoll_wait resumed>[], 10, 399) = 0 When the wget does not hang, we expect to see epoll return an fd followed by an accept(), such as: [pid 1684] 19:12:30.247918 epoll_wait(5, [{EPOLLIN, {u32=19, u64=140711718551571}}], 10, 499) = 1 [pid 1684] 19:12:30.319911 accept4(19, {sa_family=AF_INET6, sin6_port=htons(55268), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_scope_id=0}, [28], SOCK_CLOEXEC) = 14 But we do not see that in the above output - each process just seems to be in an epoll_wait() loop, never noticing a new connection. But that's only 3 of the processes. We see nothing from pid 1680. What's it doing? # strace -p 1680 strace: Process 1680 attached read(39, Huh - this process just seems hung in a read() call. And it stays hung here for a very long time (at least an hour), even if I drop the external network interface from the container. Looking at another not-hung MAAS server of the same version, I can see that each of the 4 listening regiond processes are in epoll() loops, so this seems like an important difference. A regiond process has told the kernel it is listening on port 5240, but it isn't actually polling for new connections. So, what it is it trying to read? # ls -l /proc/1680/fd/39 lrwx------ 1 root root 64 Jan 12 18:20 /proc/1680/fd/39 -> 'socket:[2294475183]' # lsof -i -a -p 1680 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME python3 1680 root 14u IPv6 1987974159 0t0 TCP 10.229.32.21:5240->10.172.65.1:41332 (ESTABLISHED) python3 1680 root 15u IPv6 1057339385 0t0 TCP *:5250 (LISTEN) python3 1680 root 23u IPv6 1057352793 0t0 TCP *:5240 (LISTEN) python3 1680 root 28u IPv6 1082137789 0t0 TCP 10.229.32.21:5240->10.172.69.16:58096 (ESTABLISHED) python3 1680 root 30u IPv6 1652988559 0t0 TCP 10.229.32.21:5240->10.172.68.66:56308 (ESTABLISHED) python3 1680 root 31u IPv6 1466877827 0t0 TCP 10.229.32.21:5240->10.172.68.66:47920 (ESTABLISHED) python3 1680 root 35u IPv6 1937014214 0t0 TCP 10.229.32.21:5240->10.172.68.187:36276 (CLOSE_WAIT) python3 1680 root 39u IPv4 2294475183 0t0 TCP 10.229.32.21:54640->91.189.88.136:https (ESTABLISHED) python3 1680 root 57u IPv6 1699973134 0t0 TCP 10.229.32.21:5250->10.229.32.21:58460 (CLOSE_WAIT) $ host 91.189.88.136 136.88.189.91.in-addr.arpa domain name pointer images-maas-io.sawo.canonical.com. Ah.. maas images. So, is it possible that a regiond process remains bound to port 5240 while it goes off to fetch images, and does so w/o a reasonable timeout that causes it to block indefinitely? And is it the case that we therefore only have 3/4 of our regiond workers processing requests, making 1/4 of those requests timeout?