openstack instances on arm64 lock up on polling and other situations fairly repeatably

Bug #1506543 reported by Nick Moffitt
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

While working on deploying arm64 builders in openstack, we found that it was pretty easy to wedge them. Many hang just running ntpdate right off the bat.

William Grant had more direct tests he did, but upgrading from 3.13 to 3.19 didn't seem to fix things.

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1506543

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: kernel-da-key
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: trusty vivid
Revision history for this message
Chris J Arges (arges) wrote :

A few questions:
1) Just to check , you are emulating arm64 on amd64, correct?
2) Can you gather logs of the hung arm64 instances? Crashdump, tail -f /var/log/kern.log until it hangs, etc, etc
3) What kernel version are you running on the arm64 instances?

tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
William Grant (wgrant) wrote :

The only scenario we've thoroughly tested is 3.13 on 3.19 on mcdivitts (Moonshot X-Genes). We haven't tested 3.19 in the guest enough to rule it out, and we need at least 3.19 on the host for guest UEFI support.

There's normally nothing in dmesg during the hang, though I did once see "Oct 15 09:59:07 dogfood-bos01-arm64-003 kernel: [ 3840.420637] INFO: task tcpdump:2023 blocked for more than 120 seconds." Once.

I've never seen an instance totally hang, but our buildds more often than not get stuck in ntpdate:

socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 3
setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl(3, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDP) = 4
setsockopt(4, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(4, SOL_IPV6, IPV6_V6ONLY, [1], 4) = 0
fcntl(4, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
rt_sigaction(SIGALRM, {0x557a3ad8d8, [], 0}, {SIG_DFL, [], 0}, 8) = 0
setitimer(ITIMER_REAL, {it_interval={0, 200000}, it_value={0, 100000}}, NULL) = 0
setpriority(PRIO_PROCESS, 0, 4294967284) = -1 EACCES (Permission denied)
ppoll([{fd=3, events=POLLIN}, {fd=4, events=POLLIN}], 2, {60, 0}, NULL, 0

The ppoll never returns despite having a 60s timeout set, and the 100ms timer never fires either. Other ntpdates also hang there, but general shell operations continue to work fine.

A non-buildd instance left for 24 hours had apache2 and various other daemons all stuck in epoll and similar.

I'll try to gather more logs and devise an easier reproducer than "run a buildd".

Revision history for this message
Adam Conrad (adconrad) wrote :

(For the record, this isn't emulated on amd64, Nick's dump was clearly from the wrong machine...)

Revision history for this message
Nick Moffitt (nick-moffitt) wrote :

crap, I tried to run it on the right box and copy it over.

Revision history for this message
William Grant (wgrant) wrote :

3.19-on-3.19 is much more stable, and we haven't seen anything like this issue.

tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Martin Pitt (pitti) wrote :

FTR, I'm also having problems with the wily arm64 kernel on multi-CPU machines. I filed that as bug 1531768 as the logs here look rather different to mine. But maybe it's related.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.