Ubuntu 14.04 + QEmu 2.0 + KSM = 1, makes Windows 2008 R2 guests to crash (BSOD)

Bug #1338277 reported by Thiago Martins on 2014-07-06
24
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

Guys,

I'm trying to run Windows 2008 as a QEmu guest on my Ubuntu 14.04 but, after lots of tests, I figured out that it doesn't work, QEmu makes Windows 2008 to crash, and it is not a Windows fault, I'm pretty sure that it is a QEmu bug.

Lab environment (5 servers):

3 physical servers: Dell R610

2 physycal servers: IBM x3650

* Where Windows crash (5 servers tested) ?

Ubuntu 14.04 + QEmu 2.0 + VirtIO 0.1-81 = Windows 2008 crash every hour

- Installed with "apt-get install ubuntu-virt-server".

* Where Windows do not crash (5 servers tested) ?

Ubuntu 14.04 + Xen 4.4 + gplpv_Vista2008x64_1.0.1092.9 = Windows working smoothly

- Installed with "apt-get install xen-system-amd64".

So, after removing QEmu from my environment, and using Xen instead, all Windows guests are now running without any crash.

What kind of information, can I provide for you guys, to deep debug this QEmu problem ?

Plus, it is interesting to note that a lot of times, all Windows guests (on top of QEmu / KVM) crashes at the exactly the same time! So, it can not be a problem within each Windows guest, but at the Hypervisor itself! Something happen there, that affects almost all Windows guests simultaneously.

Also, it worth to mention that this problem is probably affecting clouds based on OpenStack IceHouse, on top of Ubuntu + QEmu 2.0...

Screenshots:

http://i.imgur.com/vnJSTgg.png

http://i.imgur.com/34nADWr.png

NOTE: I'm using KSM (Kernel Samepage Merging) with QEmu, to save RAM. It seems that when with Xen (+QEmu / HVM), KSM is not used :'( , but it is enabled ( 1 > /sys/kernel/mm/ksm/run at Dom0's kernel). I did not tried to disable KSM to see if Windows becomes more stable on QEmu 2.0...

Also, I did not run tests on this environment with Ubuntu 12.04.4 (or 12.04.4 with Ubuntu Cloud Archives, to get newer versions of QEmu (but not 2.0) for old LTS).

CURIOSITY: On older hardware, like Dell R1950, and at my old Intel Desktop Core i7, I'm running Windows 2008 and 7, on Ubuntu 14.04 with QEmu 2.0 without any crash... I really like to figure out why QEmu is crashing Windows guests on Dell R610 and on IBM x3650...

Attaching the VM's configuration files on next posts...

Best,
Thiago
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Jul 15 08:32 seq
 crw-rw---- 1 root audio 116, 33 Jul 15 08:32 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.2
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
InstallationDate: Installed on 2014-06-23 (22 days ago)
InstallationMedia: Ubuntu-Server 14.04 LTS "Trusty Tahr" - Release amd64 (20140416.2)
IwConfig: Error: [Errno 2] No such file or directory
MachineType: Dell Inc. PowerEdge R610
Package: qemu
PciMultimedia:

ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-30-generic root=UUID=6a76ba24-a056-468d-b620-84a49c71d873 ro
ProcVersionSignature: Ubuntu 3.13.0-30.55-generic 3.13.11.2
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-30-generic N/A
 linux-backports-modules-3.13.0-30-generic N/A
 linux-firmware 1.127.4
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty
Uname: Linux 3.13.0-30-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 07/23/2013
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 6.4.0
dmi.board.name: 0DFXXD
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr6.4.0:bd07/23/2013:svnDellInc.:pnPowerEdgeR610:pvr:rvnDellInc.:rn0DFXXD:rvrA00:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R610
dmi.sys.vendor: Dell Inc.

Thiago Martins (martinx) on 2014-07-06
tags: added: 2008 crash qemu windows
tags: added: bsod
Thiago Martins (martinx) on 2014-07-06
description: updated
Thiago Martins (martinx) on 2014-07-06
description: updated
description: updated
Thiago Martins (martinx) on 2014-07-07
description: updated
Thiago Martins (martinx) on 2014-07-07
description: updated
Thiago Martins (martinx) on 2014-07-07
summary: - QEmu makes Windows 2008 guests to crash (BSOD)
+ QEmu 2.0 makes Windows 2008 guests to crash (BSOD)

Guys!

I can confirm that, after disabling KSM, all "guest problems" disappeared!! All Windows 2008 R2 guests are now very stable under QEmu 2.0 but, KSM is disabled.

* Windows 2008 R2 guests running for about 6 hours without any crash *

Workaround - disabling KSM:

---
root@hyper-kvm-1:~# cat /etc/default/qemu-kvm
KSM_ENABLED=0
SLEEP_MILLISECS=200
VHOST_NET_ENABLED=1
KVM_HUGEPAGES=0
---

---
root@hyper-kvm-1:~# cat /sys/kernel/mm/ksm/run
0
---

But, I'm wasting a lot RAM memory, since my guests are clones of each other, I really need QEmu with KSM... :-(

Is there any way to re-enable KSM while preserving system stability? Maybe updating to mainline Kernel? New QEmu version!? Backports?!

-
NOTE:

 Do you guys thinks that this problem might be related to the following BUG:

 QEMU Windows guest unstable after random amount of time:
 https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1322441

 ???
-

Also, Windows 7, when with KSM enabled, crashes during the installation, every time. After disabling KSM, I was able to install and use Windows 7 without any crash, and with SPICE! So, maybe, with Windows 7 ISO CD in hands, this problem might be easy to reproduce, while with Windows 2008 R2, it randomly crashes within ~2 hours...

Cheers!
Thiago

summary: - QEmu 2.0 makes Windows 2008 guests to crash (BSOD)
+ Ubuntu 14.04 + QEmu 2.0 + KSM = 1, makes Windows 2008 R2 guests to crash
+ (BSOD)
Matthew Anderson (matthewa) wrote :

Hi Thiago,

Your issue probably isn't related to the one I reported [1322441]. My issue was with automatic NUMA balancing. You could certainly try disabling numa balancing ( echo 0 > /proc/sys/kernel/numa_balancing ) and enabling KSM again to see if the fault occurs and if the crash is resolved it may point a conflict between NUMA balancing and KSM.

I have noticed however that on one of my hosts virtual machines running 2008r2 has BSOD'd with PAGE_FAULT_IN_NON_PAGED_AREA. It's only two guests in particular and the others run just fine (including 2012R2). I had a theory it may have something to do with the new VAPIC in the v2 Sandy Bridge processors so I've dropped the CPU option from host to QEMU64 and haven't been able to replicate the issue (yet). I know there is a current issue on the Red Hat bug tracker that causes Windows not to boot when the hv_apic and x2apic features are used together.

Hopefully that helps

Thiago Martins (martinx) on 2014-07-08
description: updated
affects: qemu → qemu (Ubuntu)
Thiago Martins (martinx) wrote :

Quoting myself from comment #1:

"Also, Windows 7, when with KSM enabled, crashes during the installation, every time. After disabling KSM, I was able to install and use Windows 7 without any crash, and with SPICE! So, maybe, with Windows 7 ISO CD in hands, this problem might be easy to reproduce, while with Windows 2008 R2, it randomly crashes within ~2 hours..."

My old Windows 7 ISO CD (collection) have some incompatibility with new QEmu / Linux, after download it again, from Micro$oft.com (file name: "7600.16385.090713-1255_x64fre_enterprise_en-us_EVAL_Eval_Enterprise-GRMCENXEVAL_EN_DVD.iso"), Windows 7 isn't crashing anymore on QEmu 2.0, it works now.

But, KSM is still disabled. I can not enable KSM anymore because it crashes...

Best,
Thiago

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1338277

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty

apport information

tags: added: apport-collected
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in qemu (Ubuntu):
status: New → Confirmed
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Thiago Martins (martinx) wrote :

Just for the record...

The Windows 2008 R2 (and Windows 7) guests are all 64-bit, with 4G of RAM each.

I'm trying it using VirtIO for both Disk and Network, configured during Windows installation (the second Virtual CD Drive contains: http://alt.fedoraproject.org/pub/alt/virtio-win/latest/virtio-win-0.1-81.iso )

Even when IDLE, W2K8-R2 crashes and, for example, if you try to install it as a Secondary Active Directory of another Windows, then, it crashes more often...

I'm using SPICE / QXL for guest video.

Anyway, I tried it with almost all virtual hardware combinations, starting from VirtIO devices, to IDE and e1000, VGA and etc... Every time it crashes.

Also, VirtIO for Network is very unstable, both Windows 7 and Windows 2008 crashes when using it with OpenvSwitch bridges, when using that "default NAT from libvirt", it works better, so, I'm using e1000 from now on.

Disabling KSM make it far more stable but, as I said, VirtIO for Net is still unstable (it gives windows BSOD).

If you guys want (Serge), I can give root access to 1 or 2 Dell R610 with Ubuntu 14.04 + Qemu 2.0, so, you guys will be able to catch this problem while it happens... Since this problem is hard to reproduce, I'm available to help debug this, just let me know.

BTW, I'm planning to try it with new QEmu 2.1 ASAP and/or with new Linux 3.16, on different servers, to see which one becomes more stable... Honestly, I don't know where the problem is located, I mean, it this a Linux or QEmu BUG?!

Cheers!

tags: added: latest-bios-6.4.0
removed: 2008 bsod crash qemu windows

Hi,

you mention the virtio drivers several times - can you reproduce this at
all without using the windows virtio drivers?

Thiago Martins (martinx) wrote :

Hi Serge,

 At first, yes, I was trying with VirtIO.

 Later, I disabled VirtIO for Disk, didn't solve.

 Then, I disabled VirtIO for Net, didn't solve.

 Again, I disabled VirtIO for both Disk and Net, didn't solve.

 But, thinking on it again now, I did not tested it with a Windows 2008 R2 that did not touched VirtIO drivers, I mean, in all of my tests, I was using at least, the ballon RAM driver (and probable VirtIO Serial, or something like that).

 Next week, I'll try it with again, without any VirtIO, from the beginning, with KSM=1.

Best,
Thiago

Ante Karamatić (ivoks) wrote :

Hm, this is also NUMA node...

Serge Hallyn (serge-hallyn) wrote :

Hi Thiago,

> Next week, I'll try it with again, without any VirtIO, from the
> beginning, with KSM=1.

Thanks, but I don't think that'll be necessary - seems virtio is completely
unrelated.

Serge Hallyn (serge-hallyn) wrote :

Quoting Serge Hallyn (<email address hidden>):
> Hi Thiago,
>
> > Next week, I'll try it with again, without any VirtIO, from the
> > beginning, with KSM=1.
>
> Thanks, but I don't think that'll be necessary - seems virtio is completely
> unrelated.

On second thought, I may have spoken too soon. I'm not sure we've done any
tests without virtio. So reproduction without it would be informative.

Chris J Arges (arges) on 2014-07-21
tags: added: ksm-numa-guest-freeze
Chris J Arges (arges) wrote :

I believe I've found the fix for this issue on 3.13.
If you can, please test the kernel posted on comment #1 on this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917
Make sure KSM is enabled; and any workarounds for this bug are disabled.

If this fixes the issue for you, you are welcome to mark this bug as a duplicate of 1346917.

Thanks!

no longer affects: qemu (Ubuntu)
Thiago Martins (martinx) wrote :

Okay, I'll try it today!

Brooks Warner (brookswarner) wrote :

Thiago,
Any updates on the testing? Our tests are still showing this fix as working and looking for outside input. Thanks!

Thiago Martins (martinx) wrote :

Warner,

I installed a new Dell R610 server ~6 hours ago, with Trusty's default kernel, to see Windows crashing again (and it crashed after ~3 hours of uptime - KSM=1).

Now, I just installed (from: http://people.canonical.com/~arges/lp1346917/):

--
linux-headers-3.13.0-33_3.13.0-33.58~lp1346917v201407220903_all.deb
linux-image-3.13.0-33-generic_3.13.0-33.58~lp1346917v201407220903_amd64.deb
linux-headers-3.13.0-33-generic_3.13.0-33.58~lp1346917v201407220903_amd64.deb
linux-image-extra-3.13.0-33-generic_3.13.0-33.58~lp1346917v201407220903_amd64.deb
--

I'll let it running this all night, if tomorrow morning the Windows guests are still alive (uptime higher than ~10 hours), I'll mark it as "duplicated of 1346917 / fixed".

Thank you my friend! :-D

Thiago Martins (martinx) wrote :

Guys,

The following patch http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=64a9a34e22896dad430e21a28ad8cb00a756fefc fixed the problem!

--
Lab env:

1 R610 with 24G RAM
6 Windows 2008 R2 guests, 6G RAM each (min 4G - max 6G)

~17G of RAM memory being shared by KSM... YAY!! :-)

* All Windows guests are stable for about ~17 hours *
--

Accounting of RAM saved by KSM (ksmstat - shell script): https://gist.github.com/wankdanker/1206923

No signs of instability.

Nice job!!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers