qemu-nbd on ARM64 deadlock? Stuck in rt_sigtimedwait([BUS ALRM IO], ..) and futex(0x7f749ec230, FUTEX_WAIT, ...)

Bug #1512185 reported by Haw Loeung
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Medium
Unassigned
linux-meta-lts-vivid (Ubuntu)
Invalid
Undecided
Unassigned
qemu (Ubuntu)
Expired
Medium
Unassigned

Bug Description

Hi,

We're seeing this often on our HP Moonshot ARM64 nova-compute nodes where qemu-nbd processes would lock up. At the same time, there's also a bunch of kernel spew as follows:

| [605282.018238] block nbd3: Attempted send on closed socket
| [605282.018242] block nbd3: Attempted send on closed socket
| [605282.018245] block nbd3: Attempted send on closed socket
| [605282.018249] block nbd3: Attempted send on closed socket

swirlix01:

| hloeung@swirlix01:~$ uname -a
| Linux swirlix01 3.19.0-30-generic #34~14.04.1-Ubuntu SMP Fri Oct 2 22:15:46 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
| hloeung@swirlix01:~$ ps afx | grep qe\\mu-nbd
| 27782 ? Ssl 0:00 /usr/bin/qemu-nbd -c /dev/nbd10 /var/lib/nova/instances/ba50751e-56d7-4bc4-8742-1193fe7a138e/disk
| hloeung@swirlix01:~$ sudo cat /proc/$(ps afx | grep qe\\mu-nbd | awk '{ print $1 }')/stack
| [<ffffffc0000875b0>] __switch_to+0x74/0x8c
| [<ffffffc000125dac>] futex_wait_queue_me+0xf4/0x184
| [<ffffffc0001268b4>] futex_wait+0x154/0x24c
| [<ffffffc000128638>] do_futex+0x1a0/0x9ec
| [<ffffffc000128f1c>] SyS_futex+0x98/0x1cc
| [<ffffffc00008642c>] el0_svc_naked+0x20/0x28
| [<ffffffffffffffff>] 0xffffffffffffffff

swirlix08:

| hloeung@swirlix08:~$ uname -a
| Linux swirlix08 3.19.0-31-generic #36~14.04.1-Ubuntu SMP Thu Oct 8 10:50:10 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
| hloeung@swirlix08:~$ ps afx | grep qe\\mu-nbd
| 31976 ? Ssl 0:00 /usr/bin/qemu-nbd -c /dev/nbd6 /var/lib/nova/instances/92ceb061-2ea4-4212-be20-ab0ded6eb3cd/disk
| hloeung@swirlix08:~$ sudo cat /proc/$(ps afx | grep qe\\mu-nbd | awk '{ print $1 }')/stack
| [<ffffffc0000875b0>] __switch_to+0x74/0x8c
| [<ffffffc000125d6c>] futex_wait_queue_me+0xf4/0x184
| [<ffffffc000126874>] futex_wait+0x154/0x24c
| [<ffffffc0001285f8>] do_futex+0x1a0/0x9ec
| [<ffffffc000128edc>] SyS_futex+0x98/0x1cc
| [<ffffffc00008642c>] el0_svc_naked+0x20/0x28
| [<ffffffffffffffff>] 0xffffffffffffffff

swirlix11:

| hloeung@swirlix11:~$ uname -a
| Linux swirlix11 3.19.0-31-generic #36~14.04.1-Ubuntu SMP Thu Oct 8 10:50:10 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux
| hloeung@swirlix11:~$ ps afx | grep qe\\mu-nbd
| 18149 ? Ssl 0:00 /usr/bin/qemu-nbd -c /dev/nbd3 /var/lib/nova/instances/84cac137-c1e4-46ac-894a-efcd55ef7e05/disk
| hloeung@swirlix11:~$ sudo cat /proc/$(ps afx | grep qe\\mu-nbd | awk '{ print $1 }'/stack
| hloeung@swirlix11:~$ sudo cat /proc/$(ps afx | grep qe\\mu-nbd | awk '{ print $1 }')/stack
| [<ffffffc0000875b0>] __switch_to+0x74/0x8c
| [<ffffffc000125d6c>] futex_wait_queue_me+0xf4/0x184
| [<ffffffc000126874>] futex_wait+0x154/0x24c
| [<ffffffc0001285f8>] do_futex+0x1a0/0x9ec
| [<ffffffc000128edc>] SyS_futex+0x98/0x1cc
| [<ffffffc00008642c>] el0_svc_naked+0x20/0x28
| [<ffffffffffffffff>] 0xffffffffffffffff

| hloeung@swirlix11:~$ sudo strace -f -p 18149
| Process 18149 attached with 3 threads
| [pid 18150] rt_sigtimedwait([BUS ALRM IO], NULL, NULL, 8 <unfinished ...>
| [pid 18149] futex(0x7f749ec230, FUTEX_WAIT, 18152, NULL
| ... (hangs here) ...

We're using the QEMU package backported from Vivid as per LP:1457639

| hloeung@swirlix11:~$ apt-cache policy qemu-utils
| qemu-utils:
| Installed: 1:2.2+dfsg-5expubuntu9.5+bug1457639~ubuntu14.04.1
| Candidate: 1:2.2+dfsg-5expubuntu9.5+bug1457639~ubuntu14.04.1
| Version table:
| *** 1:2.2+dfsg-5expubuntu9.5+bug1457639~ubuntu14.04.1 0
| 500 http://ppa.launchpad.net/canonical-is-sa/arm64-infra-workarounds/ubuntu/ trusty/main arm64 Packages

I'm also not sure if this is related to LP:1505564, which is for amd64/x86_64.
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Oct 25 17:42 seq
 crw-rw---- 1 root audio 116, 33 Oct 25 17:42 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.18
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
Lsusb: Error: command ['lsusb'] failed with exit code 1: unable to initialize libusb: -99
Package: qemu (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_GB
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: console=ttyS0,9600n8r ro
ProcVersionSignature: Ubuntu 3.19.0-31.36~14.04.1-generic 3.19.8-ckt7
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images trusty uec-images
Uname: Linux 3.19.0-31-generic aarch64
UnreportableReason: This is not an official Ubuntu package. Please remove any third party package and try again.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm
_MarkForUpload: True

Haw Loeung (hloeung)
description: updated
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1512185

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Haw Loeung (hloeung) wrote : BootDmesg.txt

apport information

tags: added: apport-collected trusty uec-images
description: updated
Revision history for this message
Haw Loeung (hloeung) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : IwConfig.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : KvmCmdLine.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : Lspci.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : ProcModules.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : RelatedPackageVersions.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : UdevDb.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : UdevLog.txt

apport information

Revision history for this message
Haw Loeung (hloeung) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.3 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3-unstable/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Haw Loeung (hloeung) wrote :
Download full text (3.3 KiB)

Kernel OOPS on one of the mcdivitts:

| [544599.231964] block nbd0: Attempted send on closed socket
| [544599.231968] block nbd0: Attempted send on closed socket
| [544599.231972] block nbd0: Attempted send on closed socket
| [544599.231975] block nbd0: Attempted send on closed socket
| [544627.046717] INFO: rcu_sched self-detected stall on CPU { 3} (t=15040176 jiffies g=8751031 c=8751030 q=246988591)
| [544627.046719] Task dump for CPU 3:
| [544627.046723] qemu-nbd R running task 0 32375 1 0x0000000a
| [544627.046724] Call trace:
| [544627.046733] [<ffffffc00008acf4>] dump_backtrace+0x0/0x170
| [544627.046737] [<ffffffc00008ae84>] show_stack+0x20/0x2c
| [544627.046740] [<ffffffc0000e066c>] sched_show_task+0xa0/0xf8
| [544627.046742] [<ffffffc0000e3c24>] dump_cpu_task+0x44/0x54
| [544627.046745] [<ffffffc00010bb90>] rcu_dump_cpu_stacks+0x98/0xec
| [544627.046747] [<ffffffc00010f350>] rcu_check_callbacks+0x410/0x740
| [544627.046751] [<ffffffc000114974>] update_process_times+0x40/0x74
| [544627.046754] [<ffffffc0001246d8>] tick_sched_handle.isra.15+0x38/0x7c
| [544627.046756] [<ffffffc000124764>] tick_sched_timer+0x48/0x84
| [544627.046758] [<ffffffc000114ff4>] __run_hrtimer+0x90/0x1d0
| [544627.046760] [<ffffffc000115ae4>] hrtimer_interrupt+0xec/0x28c
| [544627.046764] [<ffffffc00063f4b4>] arch_timer_handler_phys+0x38/0x48
| [544627.046766] [<ffffffc000105554>] handle_percpu_devid_irq+0x90/0x12c
| [544627.046769] [<ffffffc0001010c0>] generic_handle_irq+0x38/0x54
| [544627.046770] [<ffffffc000101404>] __handle_domain_irq+0x64/0xc0
| [544627.046772] [<ffffffc000082478>] gic_handle_irq+0x38/0x88
| [544627.046773] Exception stack(0xffffffc634947610 to 0xffffffc634947730)
| [544627.046776] 7600: 00c1a000 ffffffc0 00c1f000 ffffffc0
| [544627.046778] 7620: 34947750 ffffffc6 000ffe0c ffffffc0 00000900 00000000 000001c0 00000000
| [544627.046780] 7640: 00000005 00000000 004a1870 ffffffc0 004a1870 ffffffc0 004a433c ffffffc0
| [544627.046782] 7660: 000000ff 00000000 00b5c658 ffffffc0 6465736f 636f7320 00ba157e 00000000
| [544627.046784] 7680: 00ba1449 00000000 00000000 00000000 00000006 00000000 66666666 20666366
| [544627.046786] 76a0: 34393433 30353737 ec9de100 0038d0cb 0023ceec ffffffc0 004bc7f8 00000000
| [544627.046788] 76c0: aff5f5d0 0000007f 00c1a000 ffffffc0 00c1f000 ffffffc0 00000140 00000000
| [544627.046790] 76e0: 00c1adc0 ffffffc0 00b7c198 ffffffc0 00000001 00000000 00c1f1c0 ffffffc0
| [544627.046792] 7700: 00000003 00000000 00000000 00000000 fff47938 ffffffcf 34947750 ffffffc6
| [544627.046793] 7720: 000ffe08 ffffffc0 34947750 ffffffc6
| [544627.046795] [<ffffffc000085da4>] el1_irq+0x64/0xc0
| [544627.046798] [<ffffffc000100170>] vprintk_emit+0x33c/0x59c
| [544627.046801] [<ffffffc0004c1c4c>] dev_vprintk_emit+0xc8/0x204
| [544627.046802] [<ffffffc0004c1dfc>] dev_printk_emit+0x74/0x84
| [544627.046804] [<ffffffc0004c1e60>] __dev_printk+0x54/0x9c
| [544627.046805] [<ffffffc0004c2114>] dev_err+0x70/0x80
| [544627.046814] [<ffffffbffc3c8544>] __nbd_ioctl+0x810/0x944 [nbd]
| [544627.046817] [<ffffffbffc3c86f4>] nbd_ioctl+0x7c/0x228 [nbd]
| [544627.046821] [<ffffffc0003b2fe8>] blkdev_ioctl+...

Read more...

Revision history for this message
Barry Price (barryprice) wrote :

Hi Joseph,

Happy to test upstream kernels, but this setup requires a 64-bit native arm64/aarch64 kernel, which I can't see at the link provided - only 32-bit armhf kernels.

Currently we're running the kernel from the linux-generic-lts-wily package:

Linux swirlix18 4.2.0-16-generic #19~14.04.1-Ubuntu SMP Thu Oct 8 15:36:19 UTC 2015 aarch64 aarch64 aarch64 GNU/Linux

If you know where we can find an appropriate mainline kernel, we can get it tested. Thanks.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-meta-lts-vivid (Ubuntu):
status: New → Confirmed
Changed in qemu (Ubuntu):
status: New → Confirmed
Changed in qemu (Ubuntu):
status: Confirmed → Incomplete
importance: Undecided → Medium
Changed in linux-meta-lts-vivid (Ubuntu):
status: Confirmed → Invalid
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for qemu (Ubuntu) because there has been no activity for 60 days.]

Changed in qemu (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.