[Supermicro X9DR3-F] Virtual machines hang in kernel 3.13

Bug #1332409 reported by Steve on 2014-06-20
32
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned

Bug Description

Multi-cpu systems are experiencing guest kernels hanging/crashing (both windows & linux guests) in kernel 3.13. (Windows guests are getting bluescreens, linux guests just seem to be locking up).

Downgrading to 3.12 eliminates the problem.

ProblemType: Bug
DistroRelease: Ubuntu 14.04
Package: linux-image-3.13.0-29-lowlatency 3.13.0-29.53
ProcVersionSignature: Ubuntu 3.13.0-29.53-lowlatency 3.13.11.2
Uname: Linux 3.13.0-29-lowlatency x86_64
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jun 18 16:43 seq
 crw-rw---- 1 root audio 116, 33 Jun 18 16:43 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay'
ApportVersion: 2.14.1-0ubuntu3.2
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory: 'iw'
Date: Fri Jun 20 01:07:59 2014
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig'
Lsusb:
 Bus 002 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 003: ID 0557:2221 ATEN International Co., Ltd Winbond Hermon
 Bus 001 Device 002: ID 8087:0024 Intel Corp. Integrated Rate Matching Hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Supermicro X9DR3-F
PciMultimedia:

ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-29-lowlatency root=UUID=0d5098de-e0c2-4884-bf6a-5d68f9ad5266 ro console=tty1 console=ttyS1,115200n8 nomodeset rootdelay=15 text nomdmonddf nomdmonisw
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-29-lowlatency N/A
 linux-backports-modules-3.13.0-29-lowlatency N/A
 linux-firmware 1.127.2
RfKill: Error: [Errno 2] No such file or directory: 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to trusty on 2014-04-19 (61 days ago)
dmi.bios.date: 07/31/2013
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 3.0a
dmi.board.asset.tag: To be filled by O.E.M.
dmi.board.name: X9DR3-F
dmi.board.vendor: Supermicro
dmi.board.version: 0123456789
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 3
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 0123456789
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr3.0a:bd07/31/2013:svnSupermicro:pnX9DR3-F:pvr0123456789:rvnSupermicro:rnX9DR3-F:rvr0123456789:cvnSupermicro:ct3:cvr0123456789:
dmi.product.name: X9DR3-F
dmi.product.version: 0123456789
dmi.sys.vendor: Supermicro

Steve (lp-z) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed

Steve - can you try using the generic flavour as your hypervisor ? The lowlatency flavour has never been vetted for virtual environments.

tags: added: latest-bios-3.0a needs-bisect regression-release
description: updated
summary: - Virtual machines hang in kernel 3.13
+ [Supermicro X9DR3-F] Virtual machines hang in kernel 3.13
description: updated
Changed in linux (Ubuntu):
importance: Undecided → Low
status: Confirmed → Incomplete
Steve (lp-z) wrote : BootDmesg.txt

apport information

tags: added: apport-collected
description: updated
Steve (lp-z) wrote : CurrentDmesg.txt

apport information

Steve (lp-z) wrote : Lspci.txt

apport information

Steve (lp-z) wrote : ProcCpuinfo.txt

apport information

apport information

Steve (lp-z) wrote : ProcModules.txt

apport information

Steve (lp-z) wrote : UdevDb.txt

apport information

Steve (lp-z) wrote : UdevLog.txt

apport information

Steve (lp-z) wrote :

Updated with apport-collect info from kernel 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Steve (lp-z) wrote :

Other symptoms I noticed - when the kernel is in this state (when I collected the apport-collect a few min ago, then was in the process of rebooting back to 3.12 kernel):

 'virsh destroy' errored out with "device or resource busy" on a sigkill:

root# virsh destroy centos-nessus
error: Failed to destroy domain centos-nessus
error: Failed to terminate process 3170 with SIGKILL: Device or resource busy

and a 'ps axl' hung for several minutes; when it completed here's the next line it was hanging on:

6 109 3170 1 20 0 13198936 5996384 exit Dl ? 341:47 qemu-system-x86_64 -enable-kvm -name centos-nessus -S -machine pc-1.0,accel=kvm,usb=off -cpu SandyBridge,+pdpe1gb,+osxsave,+dca,

and the process exited shortly thereafter.

There weren't any messages in dmesg from this.

Steve, could you please test the latest upstream kernel available from the very top line at the top of the page (not the daily folder) following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-3.16-rc1

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

description: updated
tags: removed: apport-collected
Changed in linux (Ubuntu):
importance: Low → Medium
Steve (lp-z) wrote :

Tested with the latest upstream trusty kernel version as requested --

Linux 3.13.11-03131104-generic #201406201536 SMP Fri Jun 20 19:37:17 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Seeing similar behavior there with things like:

[88719.002167] INFO: task qemu-system-x86:215923 blocked for more than 120 seconds.
[88719.009603] Not tainted 3.13.11-03131104-generic #201406201536
[88719.015985] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[88719.023857] qemu-system-x86 D ffffffff818114c0 0 215923 1 0x00000000
[88719.023859] ffff881e43703d30 0000000000000002 ffffffff81091670 ffff881e43703fd8
[88719.023862] 0000000000014440 0000000000014440 ffff881fd2c0afe0 ffff881e435dc7d0
[88719.023864] ffffffff81091748 ffff881e435dc7d0 ffff883fd1633160 ffff883fd1633168
[88719.023867] Call Trace:
[88719.023869] [<ffffffff81091670>] ? lock_hrtimer_base.isra.24+0x30/0x60
[88719.023871] [<ffffffff81091748>] ? hrtimer_try_to_cancel+0x58/0x110
[88719.023874] [<ffffffff8174c5e9>] schedule+0x29/0x70
[88719.023876] [<ffffffff8174efc5>] rwsem_down_read_failed+0xb5/0x140
[88719.023881] [<ffffffff81091822>] ? hrtimer_cancel+0x22/0x30
[88719.023889] [<ffffffff813780b4>] call_rwsem_down_read_failed+0x14/0x30
[88719.023892] [<ffffffff8174ea74>] ? down_read+0x24/0x30
[88719.023895] [<ffffffff8175439a>] __do_page_fault+0x20a/0x570
[88719.023897] [<ffffffff81116d6c>] ? acct_account_cputime+0x1c/0x20
[88719.023900] [<ffffffff810a1099>] ? account_user_time+0x99/0xb0
[88719.023902] [<ffffffff810a171d>] ? vtime_account_user+0x5d/0x70
[88719.023904] [<ffffffff8175471a>] do_page_fault+0x1a/0x70
[88719.023906] [<ffffffff81750888>] page_fault+0x28/0x30

---
and the VMs with hyperv enabled see the delay: [kernel 3.11]
[80683.552092] INFO: rcu_sched self-detected stall on CPU { 0} (t=15897 jiffies g=315984 c=315983 q=0)
and [kernel 2.6]
hrtimer: interrupt took 3950739 ns
Clocksource tsc unstable (delta = -8589936825 ns)

Not seeing this at all with kernel 3.12.

Added tags kernel-bug-exists-upstream kernel-bug-exists-upstream-3.13.11-03131104-generic as requested.

tags: added: kernel-bug-exists-upstream kernel-bug-exists-upstream-3.13.11-03131104-generic
Changed in linux (Ubuntu):
status: Incomplete → Confirmed

Steve, could you please test the latest mainline kernel (3.16-rc3) via http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc3-utopic/ and advise to the results?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Steve (lp-z) wrote :

It's been running on 3.16rc3 upstream for 30 hrs now and the problem has not reoccured. With 3.13 I'd see the problem after 12-24 hours, so this is a good sign.

Linux 3.16.0-031600rc3-generic #201406291835 SMP Sun Jun 29 22:36:41 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

Marking it as kernel-fixed-upstream-3.16-rc3.

I'll keep it on this kernel for at least the next few days (probably longer) and will update this bug if anything changes.

tags: added: kernel-fixed-upstream kernel-fixed-upstream-3.16-rc3

Steve, the next step is to fully reverse commit bisect the kernel in order to identify the fix commit. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection#How_do_I_reverse_bisect_the_upstream_kernel.3F ?

Steve (lp-z) wrote :

Results to date (bisection still in progress)

broken in: v3.13.11.4
fixed in: v3.16-rc3
fixed in: v3.14.5

trying now: v3.14.0

Steve (lp-z) wrote :

fixed in: v3.16-rc3
fixed in: v3.14.5
fixed in: v3.14.0
fixed in: v3.14.0rc1
broken in: v3.13.11.4 [the most recent v3.13 kernel]
broken in v3.13.0 [the earliest v3.13 kernel, without going into the rc versions]
fixed in: 3.12.22 [one of the more recent 3.12 kernels; was the most recent at the time I had tried it]
[and broken in all other v3.13 versions I had tried]

At this point it looks like it's something that is special only to 3.13 (possibly some commit that was applied into 3.13 at an early time, never made it into 3.14 branch or was rewritten in 3.14...?) At this point, I'm not planning on testing more kernels since it seems like this is the pattern.

Looking at http://kernelnewbies.org/Linux_3.13 - there were some major changes around both "Improved performance in NUMA systems" (this is a multi-processor system) and "Improved page table access scalability in hugepage workloads" (this does have 256gb of memory); one guess is that there could be something in those changes that was redesigned in 3.14 to work better(?) (though scanning through the top features in 3.14 nothing is jumping out to me...?)

Chris J Arges (arges) on 2014-07-21
tags: added: ksm-numa-guest-perf
Chris J Arges (arges) wrote :

I believe I've found the fix for this issue for 3.13.
If you can, please test the kernel posted on comment #1 on this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917

If this fixes the issue for you, you are welcome to mark this bug as a duplicate of 1346917.

Thanks!

Joseph Salisbury (jsalisbury) wrote :

Marking incomplete until requested testing is complete.

Changed in linux (Ubuntu):
importance: Medium → High
Steve (lp-z) wrote :

I upgraded to the new kernel. At this point it does appear that the fix in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917 has addressed the issue; I've been running now for ~38 hrs on that kernel with no issues.

Marking the bug as a dup of 1346917. If something changes (note: the problem has always manifested itself faster than this before) - I'll note that.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers