Multiple CPUs causes blue screen on Windows guest (14.04 regression)

Bug #1308341 reported by Hein Gustavsen on 2014-04-16
98
This bug affects 18 people
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned
linux (Ubuntu)
Undecided
Unassigned
qemu (Ubuntu)
High
Unassigned

Bug Description

Configuring a Windows 7 guest using more than one CPU cases the guest to fail. This happens after a few hours after guest boot. This is the error on the blue screen:
"A clock interrupt was not received on a secondary processor within the allocated time interval"

After resetting, the guest will never boot and a new bluescreen with the error "STOP: 0x0000005c" appears. Shutting down the guest completely and restarting it will allow it to boot and run for a few hours again.

The guest was created using virt-manager. The error happens with or without virtio devices and with both 32-bit and 64-bit Windows 7 guests.

I am using Ubuntu 14.04 release candidate.

qemu-kvm version 2.0.0~rc1+dfsg-0ubuntu3

Hein Gustavsen (hein-gustavsen) wrote :
Hein Gustavsen (hein-gustavsen) wrote :
Hein Gustavsen (hein-gustavsen) wrote :

The command line used to start the guest (from log file):
LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/bin:/usr/sbin:/sbin:/bin QEMU_AUDIO_DRV=none /usr/bin/kvm-spice -name win7-test -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -uuid bc6a3c93-2221-4b61-ed29-07edda0a2043 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/win7-test.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/mnt/sw-test-nas/win7-test.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=34 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d6:60:55,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:2 -device VGA,id=video0,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

description: updated
description: updated
description: updated
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu-kvm (Ubuntu):
status: New → Confirmed

I have Windows 7 32bit, and Windows 2008 R2 both expirence this problem, info from Windows 7 BSOD
Host system for this VM is Dell R510, qemu-kvm_2.0.0~rc1+dfsg-0ubuntu3_amd64.deb

VM command line:

qemu-system-x86_64 -enable-kvm -name win7_kc -S -machine pc-1.0,accel=kvm,usb=off -cpu kvm64,+rdtscp,+pdpe1gb,+dca,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -m 2048 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 6358fe75-bef9-3b4a-da4e-d0842e880d4f -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/win7_kc.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive file=/home/VM/win7_kc.img,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/mnt/a/virtio-win-0.1-74.iso,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=34 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:09:18:1c,bus=pci.0,addr=0x3 -chardev spicevmc,id=charchannel0,name=vdagent -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.spice.0 -device usb-tablet,id=input0 -spice port=5901,addr=10.50.0.11,disable-ticketing,plaintext-channel=main,image-compression=auto_glz,seamless-migration=on -k pl -device qxl-vga,id=video0,ram_size=67108864,vram_size=67108864,bus=pci.0,addr=0x2 -device intel-hda,id=sound0,bus=pci.0,addr=0x6 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -chardev spicevmc,id=charredir0,name=usbredir -device usb-redir,chardev=charredir0,id=redir0 -chardev spicevmc,id=charredir1,name=usbredir -device usb-redir,chardev=charredir1,id=redir1 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

libvirt xml

Hein Gustavsen (hein-gustavsen) wrote :

BTW, my installation was an upgrade from Ubuntu 10.04 to 12.04. Motherboard is a dual socket Xeon fra ASUS with two E5-2630 v2 CPUs.

Hein Gustavsen (hein-gustavsen) wrote :

Sorry, I meant from 12.04 to 14.04. 12.04 was a fresh installation. Hyper-threading is enabled.

My instalation was upgraded from 12.04 to 14.04, as well. My machine have 2 CPU, so I set Windows 7 VM to be the only guest using CPU2 (1,3,5,7,9,11,13,15) , the error still persists.

It look like adding "hyperv" in "features" section to guest definition helps, my Win7 VM now is running for ~12h, when without "hyperv" it was like 3-4 hour. I will test it for few days and will post here again.

  <features>
    <acpi/>
    <apic/>
    <pae/>
    <hyperv>
      <relaxed state='on'/>
    </hyperv>
  </features>

Adding "hyperv" seemed to work for me too.

Serge Hallyn (serge-hallyn) wrote :

Thanks, it sounds like at least we should have that enabled by default when, in virt-manager, a windows guest is selected.

Changed in qemu (Ubuntu):
importance: Undecided → High
status: New → Confirmed
no longer affects: qemu-kvm (Ubuntu)
summary: - Multiple CPUs causes blue screen on Windows guest
+ Multiple CPUs causes blue screen on Windows guest (14.04 regression)
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in virt-manager (Ubuntu):
status: New → Confirmed

After adding "hyperv" feature, the guest freezes regularly. This happens on both and Windows 7 64-bit and Windows 2012 R2 guests. When removing the "hyperv" feature the guest acts normally, but fails with a blue screen as before. This may be a completely different issue, but this renders the workaround unusable for me at least.

Gordon Kaltofen (kaltofen) wrote :

Hallo to all, this is my first post here.

I have exactly the same problem occurred after Distribution Update Ubuntu Server x64 from 12.04.4 to 14.04.

1. I have Windows 7 32/64-Bit and Windows 2008 Server 64-Bit VMs, all show the same error with two dedicated cores (no pinning). In combination with the other statements I would say it is a general Windows problem - not specific.

2. I have an AMD Opteron 6272 (fam: 15, model: 01, stepping: 02, 16 cores) system. Therefore, this problem does not seem to be Intel/AMD architecture-specific.

3. I configured a couple of VMs ONE core and let it run over the weekend. They didn't crashing, but they reacted only very slowly an choppy. It seems that there is a fundamental error, which is responsible for the multi-core errors. After restarting the VM, the error is initially gone, even though the VM is still slow due to only one core.

4. I have the latest virtio drivers are installed in the Windows guest systems and use the devices Red Hat VirtIO SCSI and Ethernet (vers. 61.65.104.7400) drivers. Are these drivers installed in your VMs or do you use the IDE/SATA and RTL/Intel-NIC standard driver?

5. The VM images (qcow2) are located on a mdadm Raid1 volume of two SSDs. Since Linux kernel 3.7 ATA TRIM is possible with Linux software RAID, so I use the mount option 'discard'. I do not want to completely exclude the possibility that the error has to do with it.

Is there now an indication of the cause of the failure and possibly even a workaround?

I have done clean install of the server and yes, Windows VM freezes with hyperv before as well as after reinstall. I'm have reverted my servers to 12.04.4 until this is solved.
Krzysiek

Serge Hallyn (serge-hallyn) wrote :

Tried to reproduce this overnight with a windows 8 instance run by hand with 4 cores, but no hang. I'll keep trying with some more options added from your command line.

Serge Hallyn (serge-hallyn) wrote :

-smp 4 -realtime mlock=off -rtc base=localtime does not seem to help me reproduce this.

Does the system have to be under stress?

Can you reproduce this without virtio?

Steve (lp-z) wrote :

I was able to work around this by downgrading the kernel on a Ubuntu 14 box to 3.12.20-031220-generic #201405160935 (and of course wasn't seeing this with Ubuntu 12).

I've periodically tried booting back to the standard Ubuntu 14 3.13 kernel to see if it's been fixed (and also tried 3.13-lowlatency) but I get a W2k8R2 server hang with KVM within the first ~24 hours of boot each time.

This is a dual-processor machine. Also, with 3.13, I was getting these messages on a semi-periodic basis (may be related):

May 30 20:23:53 kernel: [ 0.000000] Linux version 3.13.0-27-lowlatency (buildd@akateko) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #50-Ubuntu SMP PREEMPT Thu May 15 18:36:04 UTC 2014 (Ubuntu 3.13.0-27.50-lowlatency 3.13.11

May 31 14:15:40 kernel: [64348.760175] INFO: task qemu-system-x86:4151 blocked for more than 120 seconds.
May 31 14:15:40 kernel: [64348.767491] Not tainted 3.13.0-27-lowlatency #50-Ubuntu
May 31 14:15:40 kernel: [64348.773291] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
May 31 14:15:40 kernel: [64348.781205] qemu-system-x86 D ffff881fffc34600 0 4151 1 0x00000000
May 31 14:15:40 kernel: [64348.781210] ffff881fcf5e3de8 0000000000000002 ffff881fbf140000 ffff881fcf5e3fd8
May 31 14:15:40 kernel: [64348.781215] 0000000000014600 0000000000014600 ffff881fbf140000 ffff881fbf140000
May 31 14:15:40 kernel: [64348.781218] ffff883fcfac7060 ffff883fcfac7068 00007f3809e00000 ffff881fbf140000
May 31 14:15:40 kernel: [64348.781221] Call Trace:
May 31 14:15:40 kernel: [64348.781230] [<ffffffff81722b89>] schedule+0x29/0x70
May 31 14:15:40 kernel: [64348.781237] [<ffffffff8172552d>] rwsem_down_read_failed+0xcd/0x130
May 31 14:15:40 kernel: [64348.781243] [<ffffffff81374b04>] call_rwsem_down_read_failed+0x14/0x30
May 31 14:15:40 kernel: [64348.781247] [<ffffffff81725007>] ? down_read+0x17/0x20
May 31 14:15:40 kernel: [64348.781252] [<ffffffff810a0db2>] task_numa_work+0xd2/0x300
May 31 14:15:40 kernel: [64348.781254] [<ffffffff8109f87b>] ? account_user_time+0x8b/0xa0
May 31 14:15:40 kernel: [64348.781259] [<ffffffff81089e87>] task_work_run+0xa7/0xe0
May 31 14:15:40 kernel: [64348.781264] [<ffffffff81014e57>] do_notify_resume+0x97/0xb0
May 31 14:15:40 kernel: [64348.781268] [<ffffffff8172e52a>] int_signal+0x12/0x17

I'm not seeing any kernel errors with the 3.12 kernel.

Serge Hallyn (serge-hallyn) wrote :

Thanks, given that info it seems clear to be a kernel and not a qemu bug.

no longer affects: virt-manager (Ubuntu)
Serge Hallyn (serge-hallyn) wrote :

(Removed the task against virt-manager since hyperv is apparently *not* a safe workaround in all cases)

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1308341

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: trusty
Steve (lp-z) wrote :

marking as confirmed, see bug 1332409 with the apport-collect information.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed

Re-installing 14.04 fixed my problem. Running with the same virtual machine configurations on the same hardware without any problems. No hyperv feature needed.

I agree. This seems to me like a duplicate of bug 1307473.

Fred Thoma (drulenberg) wrote :

Just wanted to add that upgrading my kernel to a newer version fixed the problem for me, too.

Host: 2x E5-2620V2, Ubuntu 14.04 LTS
Guest: 24 virtual cores, Windows Server 2008 R2

Before fix:
sudo uname -a
Linux x.contabo.net 3.13.0-44-generic #73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Bluescreen stop 0x0000005c every few hours

After fix:
sudo uname -a
Linux x.contabo.net 3.16.0-23-generic #31-Ubuntu SMP Tue Oct 21 17:56:17 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
No Bluescreens or other crashes since 7 days under full load

Upgraded with this tutorial http://askubuntu.com/questions/541775/how-can-i-install-ubuntu-14-10s-kernel-in-ubuntu-14-04-lts

Fred Thoma (drulenberg) wrote :

Same bluescreen again on day 9 after the kernel upgrade.

So upgrading Kernel from 3.13 to 3.16 did not help.

Still looking for a fix.

Peter Mráz (etki) wrote :

I have same problem after crash not help restarting virtual pc on next boot bsod with c5 code persist. I must force off machine and pover on.

Procion (klebed) wrote :

Same issue there. 2 VMs with 2008 sp2 x86, and 2008 R2 sp1 x64 hanging simultaneously with BSOD stop 0x0000005c (0x0000010b 0x00000003 0x00000000)
Issue arrised after upgrading kernel from 3.12 to 3.13.
Nothing helps to workaround this issue so far.

Cristian Aires (caires-droid) wrote :

Same problem
I using kernel 3.16.0-55-generic, Ubuntu 14.04

Serge Hallyn (serge-hallyn) wrote :

Hi,

could you please file a new bug with debugging information as per https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917/comments/11 ?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers