Comment 11 for bug 1908452

Revision history for this message
dann frazier (dannf) wrote :

I ran tcpdump on lo during a failed/hanging wget attempt. This showed the SYN packet coming in, but no SYN|ACK response going out. So the packet is coming in, but nothing appears to be accept()ing it.

I took a look to see what pids are listening on port 5240:
tcp6 0 0 :::5240 :::* LISTEN 1684/python3
tcp6 0 0 :::5240 :::* LISTEN 1694/python3
tcp6 0 0 :::5240 :::* LISTEN 1703/python3
tcp6 51 0 :::5240 :::* LISTEN 1680/python3

I then ran strace on those processes, and ran wget a few times until it hung:

strace of hanging wget (mainly for timestamp):
19:12:28.826032 connect(4, {sa_family=AF_INET6, sin6_port=htons(5240), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28^Cstrace: Process 609353 detached
 <detached ...>

strace of the 4 python processes (regiond):
# strace -tt -p 1684 -p 1694 -p 1703 -p1680
[pid 1703] 19:12:28.089944 <... epoll_wait resumed>[], 10, 397) = 0
[pid 1703] 19:12:28.090230 epoll_wait(5, <unfinished ...>
[pid 1694] 19:12:28.178543 <... epoll_wait resumed>[], 9, 206) = 0
[pid 1694] 19:12:28.178798 epoll_wait(5, [], 9, 0) = 0
[pid 1694] 19:12:28.179059 epoll_wait(5, <unfinished ...>
[pid 1703] 19:12:28.190505 <... epoll_wait resumed>[], 10, 100) = 0
[pid 1703] 19:12:28.190798 epoll_wait(5, [], 10, 0) = 0
[pid 1703] 19:12:28.191106 epoll_wait(5, <unfinished ...>
[pid 1684] 19:12:28.247185 <... epoll_wait resumed>[], 10, 300) = 0
[pid 1684] 19:12:28.247463 epoll_wait(5, [], 10, 0) = 0
[pid 1684] 19:12:28.247775 epoll_wait(5, <unfinished ...>
[pid 1694] 19:12:28.677734 <... epoll_wait resumed>[], 9, 498) = 0
[pid 1694] 19:12:28.677978 epoll_wait(5, [], 9, 0) = 0
[pid 1694] 19:12:28.678228 epoll_wait(5, <unfinished ...>
[pid 1703] 19:12:28.690792 <... epoll_wait resumed>[], 10, 499) = 0
[pid 1703] 19:12:28.691054 epoll_wait(5, [], 10, 0) = 0
[pid 1703] 19:12:28.691333 epoll_wait(5, <unfinished ...>
[pid 1684] 19:12:28.747457 <... epoll_wait resumed>[], 10, 499) = 0
[pid 1684] 19:12:28.747698 epoll_wait(5, [], 10, 0) = 0
[pid 1684] 19:12:28.747951 epoll_wait(5, [], 10, 31) = 0
[pid 1684] 19:12:28.779333 epoll_wait(5, <unfinished ...>
[pid 1694] 19:12:28.972708 <... epoll_wait resumed>[], 9, 294) = 0
[pid 1694] 19:12:28.973067 epoll_wait(5, <unfinished ...>
[pid 1703] 19:12:29.090921 <... epoll_wait resumed>[], 10, 399) = 0

When the wget does not hang, we expect to see epoll return an fd followed by an accept(), such as:
[pid 1684] 19:12:30.247918 epoll_wait(5, [{EPOLLIN, {u32=19, u64=140711718551571}}], 10, 499) = 1
[pid 1684] 19:12:30.319911 accept4(19, {sa_family=AF_INET6, sin6_port=htons(55268), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::ffff:127.0.0.1", &sin6_addr), sin6_scope_id=0}, [28], SOCK_CLOEXEC) = 14

But we do not see that in the above output - each process just seems to be in an epoll_wait() loop, never noticing a new connection. But that's only 3 of the processes. We see nothing from pid 1680. What's it doing?

# strace -p 1680
strace: Process 1680 attached
read(39,

Huh - this process just seems hung in a read() call. And it stays hung here for a very long time (at least an hour), even if I drop the external network interface from the container. Looking at another not-hung MAAS server of the same version, I can see that each of the 4 listening regiond processes are in epoll() loops, so this seems like an important difference. A regiond process has told the kernel it is listening on port 5240, but it isn't actually polling for new connections. So, what it is it trying to read?

# ls -l /proc/1680/fd/39
lrwx------ 1 root root 64 Jan 12 18:20 /proc/1680/fd/39 -> 'socket:[2294475183]'

# lsof -i -a -p 1680
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
python3 1680 root 14u IPv6 1987974159 0t0 TCP 10.229.32.21:5240->10.172.65.1:41332 (ESTABLISHED)
python3 1680 root 15u IPv6 1057339385 0t0 TCP *:5250 (LISTEN)
python3 1680 root 23u IPv6 1057352793 0t0 TCP *:5240 (LISTEN)
python3 1680 root 28u IPv6 1082137789 0t0 TCP 10.229.32.21:5240->10.172.69.16:58096 (ESTABLISHED)
python3 1680 root 30u IPv6 1652988559 0t0 TCP 10.229.32.21:5240->10.172.68.66:56308 (ESTABLISHED)
python3 1680 root 31u IPv6 1466877827 0t0 TCP 10.229.32.21:5240->10.172.68.66:47920 (ESTABLISHED)
python3 1680 root 35u IPv6 1937014214 0t0 TCP 10.229.32.21:5240->10.172.68.187:36276 (CLOSE_WAIT)
python3 1680 root 39u IPv4 2294475183 0t0 TCP 10.229.32.21:54640->91.189.88.136:https (ESTABLISHED)
python3 1680 root 57u IPv6 1699973134 0t0 TCP 10.229.32.21:5250->10.229.32.21:58460 (CLOSE_WAIT)

$ host 91.189.88.136
136.88.189.91.in-addr.arpa domain name pointer images-maas-io.sawo.canonical.com.

Ah.. maas images. So, is it possible that a regiond process remains bound to port 5240 while it goes off to fetch images, and does so w/o a reasonable timeout that causes it to block indefinitely? And is it the case that we therefore only have 3/4 of our regiond workers processing requests, making 1/4 of those requests timeout?