Comment 4 for bug 1506543

Revision history for this message
William Grant (wgrant) wrote :

The only scenario we've thoroughly tested is 3.13 on 3.19 on mcdivitts (Moonshot X-Genes). We haven't tested 3.19 in the guest enough to rule it out, and we need at least 3.19 on the host for guest UEFI support.

There's normally nothing in dmesg during the hang, though I did once see "Oct 15 09:59:07 dogfood-bos01-arm64-003 kernel: [ 3840.420637] INFO: task tcpdump:2023 blocked for more than 120 seconds." Once.

I've never seen an instance totally hang, but our buildds more often than not get stuck in ntpdate:

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDP) = 4
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
rt_sigaction(SIGALRM, {0x557a3ad8d8, [], 0}, {SIG_DFL, [], 0}, 8) = 0
setitimer(ITIMER_REAL, {it_interval={0, 200000}, it_value={0, 100000}}, NULL) = 0
setpriority(PRIO_PROCESS, 0, 4294967284) = -1 EACCES (Permission denied)
ppoll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, {60, 0}, NULL, 0

The ppoll never returns despite having a 60s timeout set, and the 100ms timer never fires either. Other ntpdates also hang there, but general shell operations continue to work fine.

A non-buildd instance left for 24 hours had apache2 and various other daemons all stuck in epoll and similar.

I'll try to gather more logs and devise an easier reproducer than "run a buildd".