guest hang due to missing clock interrupt

Bug #1307473 reported by Damjan Marion on 2014-04-14
72
This bug affects 12 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned

Bug Description

I noticed on 2 different systems that after upgrade from precise to latest trusty VMs are crashing:

- in case of Windows VMs I'm getting BSOD with error message: "A clock interrupt was not received on a secondary processor within the allocated time interval."
- On linux VMs I'm noticing "hrtimer: interrupt took 2992229 ns" messages
- On some proprietary virtual appliances I'm noticing crashes an due to missing timer interrupts

QEMU version is:
QEMU emulator version 1.7.91 (Debian 2.0.0~rc1+dfsg-0ubuntu3)

Full command line:

qemu-system-x86_64 -enable-kvm -name win7eval -S -machine pc-i440fx-1.7,accel=kvm,usb=off -cpu host -m 4096 -realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -uuid 05e5089a-4aa1-6bb2-ef06-a6666b4d020a -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/win7eval.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/var/vm/win7eval.qcow2,if=none,id=drive-virtio-disk0,format=qcow2 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -drive file=/home/damarion/iso/7600.16385.090713-1255_x86fre_enterprise_en-us_EVAL_Eval_Enterprise-GRMCENEVAL_EN_DVD.iso,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -drive file=/home/damarion/iso/virtio-win-0.1-74.iso,if=none,id=drive-ide0-1-0,readonly=on,format=raw -device ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,fd=24,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:38:31:0a,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:1 -device VGA,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

Changed in qemu (Ubuntu):
importance: Undecided → High
Damjan Marion (dmarion) wrote :

I left over night following simple app which runs inside linux VM (pinned to CPU1). and displays how much ticks happened during the 1 second sleep. I found several occasions where sleep was taking much longer.

code:

#include<sys/time.h>
#include<time.h>
#include<stdio.h>
#include<stdint.h>

#define CPUSPEED 2533422000

static __inline__ uint64_t getticks(void)
{
     unsigned a, d;
     asm("cpuid");
     asm volatile("rdtsc" : "=a" (a), "=d" (d));
     return (((uint64_t)a) | (((uint64_t)d) << 32));
}
int main()
{
 uint64_t t0,t1;
 while (1) {
  t0 = getticks();
  sleep(1);
  t1 = getticks();
  printf("Ticks: %lu delta:%lu\n",t1-t0, t1-t0-CPUSPEED);
 }
 return 0;
}

Sample1:
Ticks: 2533748354 delta:326354
Ticks: 2533785458 delta:363458
Ticks: 2533889852 delta:467852
Ticks: 13309910165 delta:10776488165
Ticks: 2533823762 delta:401762
Ticks: 2533817164 delta:395164
Ticks: 2533894302 delta:472302

Sample2:
Ticks: 2533896753 delta:474753
Ticks: 2533876689 delta:454689
Ticks: 2533783931 delta:361931
Ticks: 20528401242 delta:17994979242
Ticks: 2533904102 delta:482102
Ticks: 2533740733 delta:318733
Ticks: 2533856266 delta:434266

Sample3:
Ticks: 2533761095 delta:339095
Ticks: 2533652242 delta:230242
Ticks: 2533855141 delta:433141
Ticks: 18943955180 delta:16410533180
Ticks: 2533780954 delta:358954
Ticks: 2533923283 delta:501283
Ticks: 2533909033 delta:487033

Great, thanks for the test case!

Tried this with current git.qemu.org git HEAD on a trusty
kernel, was not able to reproduce. Trying on another host.

Serge Hallyn (serge-hallyn) wrote :

I tried using 2.0.0~rc1+dfsg-0ubuntu3, using a trusty livecd iso, using the command

kvm -hda x.img -cdrom ubuntu-13.10-desktop-amd64.iso -m 1024 -realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -rtc base=localtime

but still have not seen this.

Serge Hallyn (serge-hallyn) wrote :

However, you mention that you have your VM pinned to CPU 1, while the command line is doing '-cpu 4'. When I run a VM with -cpu 4 locked to a single physical cpu, it definately does not do well. I'm not sure whether to call that a bug or mis-use.

Example:

cgm create cpuset qemu
cgm setvalue cpuset qemu cpuset.cpus 0
cgm movepid cpuset qemu $$
kvm -hda x.img -cdrom ubuntu-13.10-desktop-amd64.iso -m 1024 -realtime mlock=off -smp 4,sockets=1,cores=4,threads=1 -rtc base=localtime

(resulting VM hangs; without the -smp 4,sockets=1,cores=4,threads=1' it runs fine.)

Serge Hallyn (serge-hallyn) wrote :

Reproduced just as easily with qemu.org git HEAD.

Again, this appears to only be a case when using -smp 4 while locking to 1 cpu with cpuset.

Damjan Marion (dmarion) wrote :

just to clarify, i was pinning my test code inside the guest with "taskset -c 1". There was no pinning on the host side.

Also, i see the same issue with -smp 2.

Serge Hallyn (serge-hallyn) wrote :

So the only thing you ran under taskset was the program in comment #1?

And if you do not run that under taskset, then it doesn't skip?

Damjan Marion (dmarion) wrote :

Both systems I mentioned above were upgraded from precise to trusty. After reinstalling them with clean install issue disappear and VMs are not crashing anymore.

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in qemu (Ubuntu):
status: New → Confirmed

It seem to be related to https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1291321, there is solution for Windows VM there.

urusha (urusha) wrote :
Download full text (4.0 KiB)

I have the same symptoms with two trusty-amd64 virtual hosts:
 * win2003, linux guests hang for a period of time (~5 seconds, half of a minute and more)
 * win2008 blue screen with the same message

This happens with kernels (host):
Linux vsrv7 3.13.0-27-generic #50-Ubuntu SMP Thu May 15 18:06:16 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Linux vsrv9 3.13.0-29-generic #53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
Qemu version: 2.0.0+dfsg-2ubuntu1.1

Here are qemu params of guests that definately hang:
* precise with 3.11:
qemu-system-x86_64 -enable-kvm -name m -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 4096 -realtime mlock=off -smp 2,sockets=2,cores=1,threads=1 -uuid ab7f1e0b-e82e-ddb7-b743-903b8732e333 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/m.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -boot order=c,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x4 -drive file=/dev/vg00/kvm_m_1,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,aio=native -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=29 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:3a:76:ad,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -vnc 127.0.0.1:3 -device VGA,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5
* win 2008 r2:
qemu-system-x86_64 -enable-kvm -name ts2 -S -machine pc-1.0,accel=kvm,usb=off -m 10000 -realtime mlock=off -smp 16,sockets=16,cores=1,threads=1 -uuid 4df29f97-7e47-8af3-0009-a5395c28e3c5 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/ts2.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=c,menu=on,strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-scsi-pci,id=scsi0,bus=pci.0,addr=0x8 -drive file=/dev/vg00/kvm_ts2_1,if=none,id=drive-scsi0-0-0-0,format=raw,cache=none,aio=native -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -drive file=/dev/vg00/kvm_ts2_2,if=none,id=drive-scsi0-0-0-1,format=raw,cache=none,aio=native -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=1,drive=drive-scsi0-0-0-1,id=scsi0-0-0-1 -drive file=/dev/vg00/kvm_ts2_3,if=none,id=drive-scsi0-0-0-2,format=raw,cache=none,aio=native -device scsi-hd,bus=scsi0.0,channel=0,scsi-id=0,lun=2,drive=drive-scsi0-0-0-2,id=scsi0-0-0-2 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:ac:28:3a,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -device VGA,id=video0,bus=pci.0,addr=0x2 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x4
* win 2003:
qemu-system-x86_64 -enable-kvm -name ts4 -S -machine pc-i440fx-trusty,accel=kvm,usb=off -m 8192 -realtime mlock=off -smp 4,sockets=4,cores=1,threads...

Read more...

urusha (urusha) wrote :
Serge Hallyn (serge-hallyn) wrote :

Thanks, the soft lockup message in that dmesg may be helpful. Marking as affecting the kernel.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1307473

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 июня 30 18:31 seq
 crw-rw---- 1 root audio 116, 33 июня 30 18:31 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.2
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
HibernationDevice: RESUME=UUID=ae5e2d0f-021c-46c2-8bad-0cecbdfaff95
InstallationDate: Installed on 2012-11-14 (593 days ago)
InstallationMedia: Ubuntu-Server 12.10 "Quantal Quetzal" - Release amd64 (20121017.2)
MachineType: Intel Corporation S5500BC
Package: qemu 2.0.0+dfsg-2ubuntu1.1
PackageArchitecture: amd64
PciMultimedia:

ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.13.0-30-generic root=UUID=33d72c51-8774-4af2-9549-29b9c3bd2b62 ro nomdmonddf nomdmonisw nomdmonddf nomdmonisw
ProcVersionSignature: Ubuntu 3.13.0-30.54-generic 3.13.11.2
RelatedPackageVersions:
 linux-restricted-modules-3.13.0-30-generic N/A
 linux-backports-modules-3.13.0-30-generic N/A
 linux-firmware 1.127.4
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty trusty
Uname: Linux 3.13.0-30-generic x86_64
UpgradeStatus: Upgraded to trusty on 2014-06-26 (4 days ago)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 09/09/2011
dmi.bios.vendor: Intel Corp.
dmi.bios.version: S5500.86B.01.00.0060.090920111354
dmi.board.asset.tag: ....................
dmi.board.name: S5500BC
dmi.board.vendor: Intel Corporation
dmi.board.version: E25124-456
dmi.chassis.asset.tag: ....................
dmi.chassis.type: 17
dmi.chassis.vendor: ..............................
dmi.chassis.version: ..................
dmi.modalias: dmi:bvnIntelCorp.:bvrS5500.86B.01.00.0060.090920111354:bd09/09/2011:svnIntelCorporation:pnS5500BC:pvr....................:rvnIntelCorporation:rnS5500BC:rvrE25124-456:cvn..............................:ct17:cvr..................:
dmi.product.name: S5500BC
dmi.product.version: ....................
dmi.sys.vendor: Intel Corporation

tags: added: apport-collected trusty

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
urusha (urusha) wrote :

After installing kernel 3.15.1-031501-generic from kernel-ppa, both machines work without issues from 2014-06-25. Seems it's kernel issue that have already been solved upstream.

I can confirm that it's more kernel issue than qemu. I run kernel 3.11.0-24-generic which is left after upgrade from Saucy and have no issues for at least two days. Before that with current 3.13.0-30-generic kernel my Windows guests crashed every 3-4 hours.

Serge Hallyn (serge-hallyn) wrote :

Thanks, that's great to know!

Ondergetekende (kvdveer) wrote :

I'm not confident yet we're seeing the exact same problem, but it is pretty close. We're running a somewhat wide range of hyperisor kernels, these are our observations so far.

node-1-1 3.13.0-24-generic is affected for 0% of vms
node-1-3 3.13.0-24-generic is affected for 0% of vms
node-1-5 3.13.0-24-generic is affected for 0% of vms
node-1-6 3.13.0-27-generic is affected for 0% of vms
node-1-7 3.13.0-29-generic is affected for 0% of vms
node-2-3 3.13.0-30-generic is affected for 0% of vms
node-2-4 3.13.0-27-generic is affected for 0% of vms
node-2-5 3.13.0-24-generic is affected for 0% of vms
node-1-8 3.13.0-27-generic is affected for 2% of vms
node-1-10 3.13.0-30-generic is affected for 33% of vms
node-1-2 3.13.0-29-generic is affected for 48% of vms
node-1-9 3.13.0-30-generic is affected for 32% of vms
node-2-1 3.13.0-30-generic is affected for 20% of vms
node-2-2 3.13.0-30-generic is affected for 7% of vms
node-1-4 3.13.0-29-generic is affected for 61% of vm

Ondergetekende (kvdveer) wrote :

Note that my list of affected nodes also include migrated VMs, so there are some false positives (VMs that came from an affected node). The affected VMs on node 1-8 all seem to be migrated from another node.

John Johansen (jjohansen) wrote :

Ondergetekende, can you provide further details to why you believe Bug #1326367 is causing this? Would you be willing to test a 3.11.0-24-generic kernel (reported stable) + the futex fix, or a chosen stable version of the 3.13 or 3.15 kernel with just the futex fix. To verify that the futex fix is the problem?

Ondergetekende (kvdveer) wrote :

We haven't been able to reproduce the issues under lab conditions, and I'm not willing to use our production setup as a guinypig anymore. These issues have cost me too much credibility already.

We believe #1326367 is causing this, as we've bisected this issue to be between 3.13.0-27.50 and 3.13.0-29.53 (see our results earlier). #1326367 is the only change which felt relevant, but admittedly, this is just a hunch.

Ondergetekende: Physically is there *anything* different between the nodes in your #33 that exhibited no errors and those that exhibited a lot? CPU model/vendor, number of sockets, system vendor etc?
(I'm wondering about a synchronised/unsynchronised tsc type issue).

Mike Lowe (jomlowe) wrote :

I believe I have the same problem, place a guest under any amount of load, let's say 'yum upgrade' and the network stack goes out to lunch for 1-5 seconds. Here is a sample of the ping statistics (host to guest) from doing such an operation on a 3.13.0-30.55 kernel:

213 packets transmitted, 213 received, 0% packet loss, time 211998ms
rtt min/avg/max/mdev = 0.136/106.283/2651.359/428.403 ms, pipe 3

And a 3.11.0-19.33 kernel:

62 packets transmitted, 62 received, 0% packet loss, time 61074ms
rtt min/avg/max/mdev = 0.189/0.434/1.987/0.228 ms

Mike Lowe (jomlowe) wrote :

I can confirm that rolling back to 3.13.0-27 from 3.13.0-30 alleviated my symptoms.

Ondergetekende (kvdveer) wrote :

We've resolved our issues by disabling KSM on the affected nodes. All of the non-affected nodes didn't have KSM enabled (due to a packaging bug elsewhere). After disabling KSM, our problems went away gradually in ~3 days.

This means we're no longer affected by this issue (and given the other reports, probably never were).

Quoting Ondergetekende (<email address hidden>):
> We've resolved our issues by disabling KSM on the affected nodes. All of
> the non-affected nodes didn't have KSM enabled (due to a packaging bug
> elsewhere). After disabling KSM, our problems went away gradually in ~3
> days.
>
> This means we're no longer affected by this issue (and given the other
> reports, probably never were).

And which specific kernel are you on?

Jeff Wilson (wilson-3) wrote :

I have a similar or the same problem with my Windows Server 2008 R2 virtual machines. The virtual machine stops with a Blue Screen error 101, clock interrupt was not received on a secondary processor. The error only occurs when the VM has 2 cpus. The error seems to occur when the VM is some load, over time (hours), or when I RDP to the VM after a few hours of it running.

The same VM ran perfect under Ubuntu 12.04.

Host Server
ubuntu 14.04 LTS updated from 12.04 LTS
kernel: 3.13.0-30

Virtual Machine
Windows Server 2008 R2
2 cpus (when the error occurs)

Attached is the VM xml configuration file.

I did try adding the hyperv code and it seemed to help at first, but then errored in hours.

I did boot to kernel 3.13.0-24 and the same error occurred within an hour under some load.

Do people expect this problem to be resolved soon?

Thank you for the help.

Mike Lowe (jomlowe) wrote :

I need to amend comment #39, moving from 3.13.0-30 to 3.13.0-27 did not eliminate the problem. It would seem that it takes a couple of hours following a reboot for the symptoms to manifest with 3.13.0-27.

Jan Müller (jm-3) wrote :

dup of #1332409?

seems to be a 3.13 only bug.

Paolo Bonzini (bonzini) on 2014-07-17
no longer affects: qemu
Jeff Wilson (wilson-3) wrote :

I have resolved my problem by running kernel 3.14.1-031401 from http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14.1-trusty/, Ubuntu 14.04 LTS. The host has been running solid for a good 24 hours with 1 Windows Server 2008 R2, 2 cpu, VM running and two additional VMs running for three hours.

The pertinent xml entries that were changed or not included in the original xml configuration file are

  <os>
    <type arch='x86_64' machine='pc-i440fx-trusty'>hvm</type> #changed from <type arch='x86_64' machine='pc-i440fx-1.7'>hvm</type>
  </os>

  <features>
    <hap/> # added entry
  </features>

  <clock offset='localtime'>
    <timer name='hypervclock'/> # added entry
  </clock>

I'm not sure what will happen when a kernel 3.14 is included in the main distribution. Will the future kernel 3.14 from the distribution replace the 3.14 kernel that was installed via dpkg?

Thank you for everyone's messages.

Chris J Arges (arges) on 2014-07-21
tags: added: ksm-numa-guest-freeze
Chris J Arges (arges) wrote :

I believe I've found the fix for this issue on 3.13.
If you can, please test the kernel posted on comment #1 on this bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1346917
Make sure KSM is enabled; and any workarounds for this bug are disabled.

If this fixes the issue for you, you are welcome to mark this bug as a duplicate of 1346917.

Thanks!

no longer affects: qemu (Ubuntu)

I confirm, like Jeff Wilson, that I had the same issue with 3.13 and got resolved with 3.14.1.

I cannot right now test the kernel suggested in #46.

Fred Thoma (drulenberg) wrote :

Same here, had issues with 3.13.0-44-generic, upgraded to 3.16.0-23-generic and the problem was solved. Followed this tutorial http://askubuntu.com/questions/541775/how-can-i-install-ubuntu-14-10s-kernel-in-ubuntu-14-04-lts

Fred Thoma (drulenberg) wrote :

Same bluescreen STOP: 0x0000005c again on day 9. So it has not been fixed by a kernel upgrade to 3.16.0-23-generic with above method from askubuntu.com.

Paolo Bonzini (bonzini) wrote :

Fred, this bug is for STOP 0x101, not STOP 0x5c.

STOP 0x101 cannot be fixed by an upgrade. You have to disable the watchdog using QEMU option hv_relaxed or the equivalent in libvirt.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers