QEMU Windows guest unstable after random amount of time

Bug #1322441 reported by Matthew Anderson
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Ubuntu 14.04, all updates done as of 23/5/2014
Kernel : Linux 3.13.0-24-generic
Qemu : 1.7.1 and 2.0.0 tested
Tested using both Xeon 5620 and 5520 processors, 48GB RAM.

Anywhere from 20 minutes for 3+ hours after booting Windows guests become unstable. Guest appear to freeze intermittently (VNC console unresponsive, network pings dropped to guest, frozen IO) for 20-60 seconds at a time which repeats constantly every few minutes and guests do not recover unless QEMU is killed and the guest is started again. Whilst the guest is frozen CPU usage of the QEMU process jumps to 100-150%.

WORKAROUND: Disable automatic NUMA balancing:
echo 0 > /proc/sys/kernel/numa_balancing

---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 May 24 01:25 seq
 crw-rw---- 1 root audio 116, 33 May 24 01:25 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.1
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
HibernationDevice: RESUME=UUID=38dc08b6-55f7-482d-8f82-f048b3dbad56
InstallationDate: Installed on 2013-03-21 (428 days ago)
InstallationMedia: Ubuntu-Server 12.04.1 LTS "Precise Pangolin" - Release amd64 (20120817.3)
MachineType: Supermicro X8DTT
Package: linux (not installed)
PciMultimedia:

ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.13.0-24-generic root=UUID=386ad1f0-7ced-414f-9a09-4c078c36977c ro nomdmonddf nomdmonisw nomdmonddf nomdmonisw
ProcVersionSignature: Ubuntu 3.13.0-24.47-generic 3.13.9
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-24-generic N/A
 linux-backports-modules-3.13.0-24-generic N/A
 linux-firmware 1.127.2
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty
Uname: Linux 3.13.0-24-generic x86_64
UpgradeStatus: Upgraded to trusty on 2014-05-21 (2 days ago)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 05/20/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 080016
dmi.board.asset.tag: 1234567890
dmi.board.name: X8DTT
dmi.board.vendor: Supermicro
dmi.board.version: 2.0
dmi.chassis.asset.tag: To Be Filled By O.E.M.
dmi.chassis.type: 17
dmi.chassis.vendor: Supermicro
dmi.chassis.version: 1234567890
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr080016:bd05/20/2010:svnSupermicro:pnX8DTT:pvr1234567890:rvnSupermicro:rnX8DTT:rvr2.0:cvnSupermicro:ct17:cvr1234567890:
dmi.product.name: X8DTT
dmi.product.version: 1234567890
dmi.sys.vendor: Supermicro

Revision history for this message
Matthew Anderson (matthewa) wrote :

The bug occurred when upgrading from 12.04.1 (3.2 kernel) to 14.04 (3.13 kernel). Booting with the old 3.2 kernel resolves the problem.

Guest command line -
qemu-system-x86_64 -enable-kvm -name test -S-machine pc-i440fx-1.4,accel=kvm,usb=off -cpu qemu64,hv_relaxed -m 4096 -realtimemlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid e00d6bde-ca3e-8e6c-2c92-d5b5ec632b9a -no-user-config -nodefaults -chardevsocket,id=charmonitor,path=/var/lib/libvirt/qemu/test.monitor,server,nowait -monchardev=charmonitor,id=monitor,mode=contro l-rtcbase=localtime,driftfix=slew -no-hpet -no-shutdown -bootstrict=on -devicepiix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=rbd:ssd/test:auth_supported=none,if=none,id=drive-virtio-disk0-devicevirtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw-deviceide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=1 -netdevtap,fd=24,id=hostnet0,vhost=on,vhostfd=25 -devicevirtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:48:81:8a,bus=pci.0,addr=0x3-chardevpty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0-vnc127.0.0.1:0 -devicecirrus-vga,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4

Attached is a perf record of the qemu process whilst the fault was occuring.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1322441

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Matthew Anderson (matthewa) wrote : BootDmesg.txt

apport information

tags: added: apport-collected trusty
description: updated
Revision history for this message
Matthew Anderson (matthewa) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : IwConfig.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : Lspci.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : Lsusb.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : ProcEnviron.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : ProcModules.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : UdevDb.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : UdevLog.txt

apport information

Revision history for this message
Matthew Anderson (matthewa) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Matthew Anderson, thank you for reporting this and helping make Ubuntu better. As per http://www.supermicro.com/products/motherboard/QPI/5500/X8DTT.cfm an update to your BIOS is available. If you update to this following https://help.ubuntu.com/community/BiosUpdate does it change anything? If it doesn't, could you please both specify what happened, and provide the output of the following terminal command:
sudo dmidecode -s bios-version && sudo dmidecode -s bios-release-date

Please note your current BIOS is already in the Bug Description, so posting this on the old BIOS would not be helpful.

For more on BIOS updates and linux, please see https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette .

Thank you for your understanding.

summary: - Windows guest unstable after random amount of time
+ QEMU Windows guest unstable after random amount of time
tags: added: bios-outdated
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Matthew Anderson (matthewa) wrote :

BIOS has now been updated
# sudo dmidecode -s bios-version && sudo dmidecode -s bios-release-date
2.1c
04/22/2014

I've been able to reproduce the problem using a fresh Windows 2008R2 server image. The guest functions correctly for a period of time then starts to 'freeze' for around 10 seconds every 5-20 seconds. While frozen the CPU usage of QEMU jumps to around 105%.

Attached is the updated perf record.

 11.74% qemu-system-x86 [kernel.kallsyms] [k] flush_tlb_page
  6.27% qemu-system-x86 [kernel.kallsyms] [k] kvm_handle_hva_range
  4.98% qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_lock
  4.76% qemu-system-x86 [kernel.kallsyms] [k] try_to_unmap_ksm
  4.62% qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_unlock
  4.30% qemu-system-x86 [kernel.kallsyms] [k] up_read
  4.05% qemu-system-x86 [kernel.kallsyms] [k] down_read
  3.97% qemu-system-x86 [kernel.kallsyms] [k] try_to_unmap_one
  3.85% qemu-system-x86 [kernel.kallsyms] [k] generic_exec_single
  3.72% qemu-system-x86 [kernel.kallsyms] [k] _raw_spin_lock
  3.45% qemu-system-x86 [kernel.kallsyms] [k] __srcu_read_unlock
  3.18% qemu-system-x86 [kernel.kallsyms] [k] mm_find_pmd
  3.00% qemu-system-x86 [kernel.kallsyms] [k] rmap_walk_ksm
  2.90% qemu-system-x86 [kernel.kallsyms] [k] cpumask_any_but
  2.87% qemu-system-x86 [kernel.kallsyms] [k] remove_migration_pte
  2.25% qemu-system-x86 [kernel.kallsyms] [k] __anon_vma_interval_tree_subtree_search
  1.64% qemu-system-x86 [kernel.kallsyms] [k] kvm_unmap_rmapp

penalvch (penalvch)
tags: added: latest-bios-2.1c
removed: bios-outdated
Revision history for this message
penalvch (penalvch) wrote :

Matthew Anderson, did this problem not occur in a release prior to Trusty?

affects: linux (Ubuntu) → qemu (Ubuntu)
Revision history for this message
Matthew Anderson (matthewa) wrote :

Not this exact problem but one with similar symptoms. From kernel 3.5 onwards there was a problem with guests not receiving RTC ticks which was reported here - http://lists.gnu.org/archive/html/qemu-devel/2013-02/msg03827.html

The 3.2 kernel from 12.04.1 never had any issues but the 3.5 kernel from 12.04.2 onwards hit the RTC bug whoch was fixed in later releases of QEMU.

penalvch (penalvch)
Changed in qemu (Ubuntu):
importance: Medium → High
status: Incomplete → New
Revision history for this message
Matthew Anderson (matthewa) wrote :

Based on the fact that flush_tlb_page() appears to be taking up most of the CPU time when QEMU is 'frozen' I started playing with memory settings. Disabling automatic NUMA balancing appears to have solved the problem. echo 0 > /proc/sys/kernel/numa_balancing

The guest has been running 4+ hours now without an issue. I currently have KSM and transparent huge pages enabled.

penalvch (penalvch)
description: updated
Changed in qemu (Ubuntu):
importance: High → Medium
Revision history for this message
Serge Hallyn (serge-hallyn) wrote :

@penalvch, I'd suggest that since booting with the old 3.2 kernel prevented this bug, this bug should still be marked as affecting the kernel.

Lowering priority to medium since there is a workaround (technically 2)

penalvch (penalvch)
affects: qemu (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.15 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.15-rc7-utopic/

penalvch (penalvch)
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Matthew Anderson (matthewa) wrote :

Tested with kernel - Linux 3.15.0-031500rc7-generic

After 24 hours there's no sign of the problem.

After watching numastat for a while I noticed that with the 3.13.0 kernel the allocated memory for qemu does a really sudden drop from 6GB down to 2GB. QEMU then freezes with 100% CPU until the allocation goes back up to 6GB again. On 3.15 the memory stays almost constant and is migrated towards one numa node as you'd expect. Just speculating but there must be some kind of page invalidation bug in the numa balancer.

Revision history for this message
penalvch (penalvch) wrote :

Matthew Anderson, the next step would be to fully reverse commit bisect in order to identify the fix commit. Could you please do this following https://wiki.ubuntu.com/Kernel/KernelBisection#How_do_I_reverse_bisect_the_upstream_kernel.3F ?

tags: added: kernel-fixed-upstream-3.15-rc7 needs-reverse-bisect
Revision history for this message
Thiago Martins (martinx) wrote :

Hey guys,

I'm facing the following problem with Trusty:

---
Ubuntu 14.04 + QEmu 2.0 + KSM = 1, makes Windows 2008 R2 guests to crash (BSOD):

https://bugs.launchpad.net/qemu/+bug/1338277
---

Maybe those problems are related to each other?!

Best,
Thiago

Chris J Arges (arges)
tags: added: ksm-numa-guest-perf
Revision history for this message
Chris J Arges (arges) wrote :

I believe I've found the fix for this issue on 3.13.
If you can, please test the kernel posted on comment #1 on this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917
Make sure KSM is enabled; and any workarounds for this bug are disabled.

If this fixes the issue for you, you are welcome to mark this bug as a duplicate of 1346917.

Thanks!

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Marking incomplete until requested testing is complete.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
importance: Medium → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Jon Taylor (dosadi82) wrote :

Changing the status of this bug from 'expired' to 'confirmed', since all the way up here in Ubuntu 20 land (in 2020), I have found this bug when development testing my research OS. Random lockups and segfaults, after which the CPU usage of the qemu-system-x86 process sticks at over 100%.

Changed in linux (Ubuntu):
status: Expired → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.