KVM SMP Linux Guests Hang on AMD

Bug #714335 reported by Brian Knoll on 2011-02-07
30
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Lucid
Medium
Stefan Bader

Bug Description

Binary package hint: qemu-kvm
=========================================
SRU Justification
1. impact: KVM smp guests hang on AMD
2. how was the bug addressed: Two upstream commits (plus a third auxiliary) were re-implemented on lucid's kernel.
3. patch: see comment #27
4. TEST CASE: in lucid, on an AMD box, start a guest with '-smp 2'.
5. regression: Honestly this one scares me a bit because it was a tricky cherrypick, and affects timekeeping for all kvm users on x86 lucid. In theory timekeeping on guests could be adversely affected, or guests could fail to boot. However, guests have been tested on both AMD and intel hosts.
=========================================

SMP Linux guests are hanging under KVM. This does not happen always, but at least 50% of the time or so.

If I start the guests with "-smp 1" or just completely omit the "-smp" parameter this doesn't happen.

I can also say that it doesn't seem to happen on a Nehalem-based 8-core Xeon-based machine I've tried it on, but it does happen on both an AMD Phenom II 965 machine (the one this bug report is from) as well as an AMD Phenom II 1090T-based machine.

I am starting the VMs from the command line, manually running qemu-kvm with a command like the following:

qemu-system-x86_64 -m 1024 -smp 4 -cpu host -drive file=/var/local/kvm/machine.qcow2,if=virtio,cache=off,boot=on -net nic,model=virtio,macaddr=77:88:99:12:34:56 -net tap -nographic -daemonize

Additional Information:

Description: Ubuntu 10.04.2 LTS
Release: 10.04

qemu-kvm:
  Installed: 0.12.3+noroms-0ubuntu9.3
  Candidate: 0.12.3+noroms-0ubuntu9.3
  Version table:
 *** 0.12.3+noroms-0ubuntu9.3 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
        100 /var/lib/dpkg/status
     0.12.3+noroms-0ubuntu9 0
        500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: qemu-kvm 0.12.3+noroms-0ubuntu9.3
ProcVersionSignature: Ubuntu 2.6.32-28.55-server 2.6.32.27+drm33.12
Uname: Linux 2.6.32-28-server x86_64
Architecture: amd64
Date: Sun Feb 6 19:11:57 2011
InstallationMedia: Ubuntu 10.04.1 LTS "Lucid Lynx" - Release amd64 (20100816.1)
KvmCmdLine: Error: command ['ps', '-C', 'kvm', '-F'] failed with exit code 1: UID PID PPID C SZ RSS PSR STIME TTY TIME CMD
MachineType: System manufacturer System Product Name
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-28-server root=UUID=941072c0-f822-44b4-b61c-09b6daadcb7c ro quiet splash clocksource=pit
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: qemu-kvm
dmi.bios.date: 08/18/2010
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1006
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: M4A785-M
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1006:bd08/18/2010:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKComputerINC.:rnM4A785-M:rvrRevX.0x:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: System Product Name
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Brian Knoll (brianknoll) wrote :
Brian Knoll (brianknoll) on 2011-02-07
description: updated
Brian Knoll (brianknoll) wrote :

I also want to mention that the guests are fully-updated 64-bit Ubuntu Lucid Server VMs.

Brian Knoll (brianknoll) on 2011-02-07
tags: added: kvm qemu-kvm virtualization
Serge Hallyn (serge-hallyn) wrote :

Thanks for taking the time to report this bug and helping to make Ubuntu better, Brian. Unfortunately AMD support definately seems to have some holes compared to intel. Though until now all the breakages I've seen have been with 32-bit support, so this is interesting.

Could you try with the back-ported kvm from https://launchpad.net/~ubuntu-virt/+archive/ppa/+packages and see if you fare any better?

Changed in qemu-kvm (Ubuntu):
status: New → Incomplete
Brian Knoll (brianknoll) wrote :

Thank you for your response. I did try the backported Maverick qemu-kvm packages from the PPA you mentioned, and they did not help the problem at all. Actually, they seem to have made it much worse. Now, instead of SMP VMs failing about 50% of the time on my AMD-based machines, it fails 100% of the time. If I start the VMs with a console, the last messages shown on the console are:

BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
EDD information not available.
Freeing initrd memory: 11176k freed

After that, the guest freezes and doesn't ever come back.

If I boot the VM guest without SMP support in KVM, then it boots and runs just fine, just like the version from Lucid.

Thanks, Brian. I'm going to play a bit more with the latest current
usptream to see what I can reproduce there. If that fixes it, then
I'll provide a backport to lucid for those running into problems with
AMD. In the more likely event that there are still problems, then we
can at least work with upstream to figure out what's going on.

Thank you, Serge. Please let me know how I can help. I am wiling to test packages and help however I can. Thanks again for looking into this!

Hi, i have exactly the same problem with my quad opteron 6174 hosts. (no problem on intel ).

my hosts are running proxmox 1.7 distrib (debian lenny with 2.6.32 kernel backported from squeeze).

my ubuntu guests can boot with 2.6.32-24 kernel in smp mode, but since 2.6.32-25 they are hanging at grub.

2.6.35 maverik kernels doesn't work.

i had try to upgrade my host kernel to 2.6.35, same problem with my guest.

also, i had tried debian squeeze guests, i have the same problem with debian 2.6.32 kernel.

Tell me if I can help .....

Brian Knoll (brianknoll) wrote :

Someone else has confirmed the same bug so I am changing the status to "Confirmed".

Changed in qemu-kvm (Ubuntu):
status: Incomplete → Confirmed
Brian Knoll (brianknoll) wrote :

I also want to add that I am not 100% clear where the bug is at this point. My best guess, based on the information we have available in this ticket, is that there is an upstream bug in the Linux kernel itself, with regard to the way it handles AMD64 processors when running under KVM. If that's the case, perhaps we should add the Linux kernel to the affected packages list for the ticket. I think an Ubuntu developer should comment on this.

Serge Hallyn (serge-hallyn) wrote :

@Alexandre:

I'm not quite following your comment. Is all of your experience on a Debian host? Are you able to boot a Debian guest with 2.6.33 through 2.6.35 kernel?

Brian Knoll (brianknoll) wrote :

Okay, I have done some more research and testing and I now think the following upstream KVM bug is the problem:

http://sourceforge.net/tracker/?func=detail&aid=2968899&group_id=180599&atid=893831

Upstream KVM bug 2968899 describes a bug which causes the SMP guest to lock up when setting the time. I disabled NTP on my guests to test this, and my guests boot fine without NTP. Obviously, that's not an acceptable "workaround" because having managed time is a requirement for any well-managed machine, especially a server.

But it does look to me like the bug is an upstream bug. However, it doesn't look like it's getting any attention from anyone, unfortunately.

Brian Knoll (brianknoll) wrote :

I also want to mention that in my last comment, Launchpad turned the upstream bug number into a hyperlink and referenced an Ubuntu bug of the same number; that link is not correct, but I can't edit it because I don't have access. Please follow the first link in post #11 (the one on SourceForge) for information about the upsteam bug I am describing.

Serge Hallyn (serge-hallyn) wrote :

@Brian

awesome, thanks for finding the upstream bug. I'm afraid I won't be able to do so today, but I'll look in detail tomorrow and try to devise a fix.

Thanks again.

same bug here,
https://bugs.launchpad.net/ubuntu/+source/qemu-kvm/+bug/669818

i'had marked it as duplicate

i had just try linux-image-2.6.38-3-server from natty and it's booting fine.

Serge Hallyn (serge-hallyn) wrote :

I wonder whether one of:

1d5f066e0b63271b67eac6d3752f8aa96adcbddb: KVM: x86: Fix a possible backwards warp of kvmclock
28e4639adf0c9f26f6bb56149b7ab547bf33bb95: KVM: x86: Fix kvmclock bug

could fix it.

Serge Hallyn (serge-hallyn) wrote :

Is it possible in your environment to test with the kernel package from https://launchpad.net/~kernel-ppa/+archive/ppa?

The package is called linux-lts-backport-natty, and should include all kvm fixes which are currently upstream. If that does not fix it, then the two commits cited above in comment #16 can't fix it.

Download full text (3.9 KiB)

What is the fastest and easiest way for me to add these Natty backports packages to test them on my affected machines?  I went to the PPA you mentioned but I see a very large list of packages.  I would prefer to add some repo to my apt sources list and get them that way, but I'm not 100% certain exactly what to add at this point.

--- On Mon, 2/14/11, Serge Hallyn <email address hidden> wrote:

From: Serge Hallyn <email address hidden>
Subject: [Bug 714335] Re: KVM SMP Linux Guests Hang
To: <email address hidden>
Date: Monday, February 14, 2011, 4:24 PM

Is it possible in your environment to test with the kernel package from
https://launchpad.net/~kernel-ppa/+archive/ppa?

The package is called linux-lts-backport-natty, and should include all
kvm fixes which are currently upstream.  If that does not fix it, then
the two commits cited above in comment #16 can't fix it.

--
You received this bug notification because you are a direct subscriber
of the bug.
https://bugs.launchpad.net/bugs/714335

Title:
  KVM SMP Linux Guests Hang

Status in “qemu-kvm” package in Ubuntu:
  Confirmed

Bug description:
  Binary package hint: qemu-kvm

  SMP Linux guests are hanging under KVM.  This does not happen always,
  but at least 50% of the time or so.

  If I start the guests with "-smp 1" or just completely omit the "-smp"
  parameter this doesn't happen.

  I can also say that it doesn't seem to happen on a Nehalem-based
  8-core Xeon-based machine I've tried it on, but it does happen on both
  an AMD Phenom II 965 machine (the one this bug report is from) as well
  as an AMD Phenom II 1090T-based machine.

  I am starting the VMs from the command line, manually running qemu-kvm
  with a command like the following:

  qemu-system-x86_64 -m 1024 -smp 4 -cpu host -drive
  file=/var/local/kvm/machine.qcow2,if=virtio,cache=off,boot=on -net
  nic,model=virtio,macaddr=77:88:99:12:34:56 -net tap -nographic
  -daemonize

  Additional Information:

  Description:    Ubuntu 10.04.2 LTS
  Release:    10.04

  qemu-kvm:
    Installed: 0.12.3+noroms-0ubuntu9.3
    Candidate: 0.12.3+noroms-0ubuntu9.3
    Version table:
   *** 0.12.3+noroms-0ubuntu9.3 0
          500 http://us.archive.ubuntu.com/ubuntu/ lucid-updates/main Packages
          100 /var/lib/dpkg/status
       0.12.3+noroms-0ubuntu9 0
          500 http://us.archive.ubuntu.com/ubuntu/ lucid/main Packages

  ProblemType: Bug
  DistroRelease: Ubuntu 10.04
  Package: qemu-kvm 0.12.3+noroms-0ubuntu9.3
  ProcVersionSignature: Ubuntu 2.6.32-28.55-server 2.6.32.27+drm33.12
  Uname: Linux 2.6.32-28-server x86_64
  Architecture: amd64
  Date: Sun Feb  6 19:11:57 2011
  InstallationMedia: Ubuntu 10.04.1 LTS "Lucid Lynx" - Release amd64 (20100816.1)
  KvmCmdLine: Error: command ['ps', '-C', 'kvm', '-F'] failed with exit code 1: UID        PID  PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
  MachineType: System manufacturer System Product Name
  ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.32-28-server root=UUID=941072c0-f822-44b4-b61c-09b6daadcb7c ro quiet splash clocksource=pit
  ProcEnviron:
   LANG=en_US.UTF-8
   SHELL=/bin/bash
  SourcePackage: qemu-kvm
  dmi.bios.date: 08/18/2010
  dmi.bios.vendor...

Read more...

I think you should be able to just:

sudo add-apt-repository ppa:kernel-ppa/ppa
sudo apt-get install linux-lts-backport-natty

I'm waiting for a lucid vm to build so I can test that to make sure.

Brian Knoll (brianknoll) wrote :

It ended up being "apt-get install linux-image-server-backport-natty" after doing an update. But you told me what I needed to get it working. I'll try it out and let you know how it works. Thanks for all the help!

Hey there... same issue here, on a 6-core Phenom X6 1055T.

I already have 3 VMs running perfectly that I built with vm-builder, but a fourth just isn't going to work.

vmbuilder kvm ubuntu --suite maverick --flavour virtual --arch amd64 -o --libvirt qemu:///system --ip 203.4.172.214 --hostname xxxx-dev --mask 255.255.255.240 --net 203.4.172.208 --bcast 203.4.172.223 --gw 203.4.172.209 --dns 203.4.172.209 --bridge br0 --user richard --pass zzyyaa --addpkg openssh-server --addpkg unattended-upgrades --addpkg linux-image-generic --addpkg nfs-common --addpkg vim -d /opt/kvm/xxxx-dev --mirror http://apt-mirror.zzzzz.com.au/pub/ubuntu/ --mem=2048 --rootsize=8192 --cpus=1

Haven't posted on launchpad before, so if I'm missing any crucial details I should be including, please let me know.

altivec (altivec) wrote :

My configuration might be a little off-topic, but I'm experiencing a similar problem so I thought I'd leave a comment here anyway.

I'm using qemu 0.13 on a Debian Squeeze host (kernel 2.6.32-5-amd64) running on a 64-bit Intel Xeon. The guest is a 64-bit Win 2008 R2. When I run the VM using qemu-system-x86_64 with -enable-kvm and no -smp options, the guest boots and installs fine. When I enable more than one core through -smp, the guest boots but completely freezes very soon. It seems to hang always, but not exactly at the same point. After the guest hangs, the qemu process keeps using 100% CPU. The same behavior happens on a different 64-bit Intel Xeon host running Debian Lenny with a custom 2.6.36 kernel. Also, the same behavior happens with qemu 0.14-rc1. I haven't tested yet if a Linux guest kernel hangs, but will do that.

Serge Hallyn (serge-hallyn) wrote :

Hi Brian,

have you had a chance to test with the new kernel?

Changed in qemu-kvm (Ubuntu):
importance: Undecided → Medium
Brian Knoll (brianknoll) wrote :

Hi Serge,

Yes, that PPA kernel does indeed fix it. I used the PPA kernel in the guest and it made everything work perfectly. So I'm thinking one of those fixes you mentioned solves the problem.

Note that I am still using the standard Lucid kernel in the host, but I don't think that's relevant since it seems like the issue is in the guest-mode kernel, not the host. In any case, I think it would be very helpful to have those fixes backported to the Lucid kernel, so those running the LTS in their guests can have stable SMP configurations.

Thanks for all of your help!

Serge Hallyn (serge-hallyn) wrote :

Unfortunately those commits do not cleanly cherrypick. I'll try to rewrite them from scratch, time permitting.

agent 8131 (agent-8131) wrote :

I believe this bug has been affecting me for some time. Just to be clear on my situation:

* Host system running Ubuntu 10.04 64-bit
* Guest systems running Ubuntu 10.04 64-bit
* Guests freeze a large percentage of time, though not always.
* They always freeze immediately after the "Freeing initrd memory" line
* They work fine if set to "-smp 1"

summary: - KVM SMP Linux Guests Hang
+ KVM SMP Linux Guests Hang on AMD
Serge Hallyn (serge-hallyn) wrote :

@Brian,

I've uploaded a kernel .deb which (hopefully correctly) re-implements the two fixes which looked most likely to address the bug, at linux-image-2.6.32-29-generic_2.6.32-29.58ubuntu1_amd64.deb.

I've verified that I can at least install that kernel package on a lucid intel system and continue to use KVM. Could you test to see if it fixes your problems on AMD?

Brian Knoll (brianknoll) wrote :

@Serge,

Thanks for the work on these patches. I did look at the PPA and all I see are the 2.6.38 packages, and since the 2.6.32-29 kernel isn't in the mainstream repository yet I am unsure of where to get it. Could you please point me to where I can get this new kernel with the fixes, so I can test it? Thanks!

Serge Hallyn (serge-hallyn) wrote :

@Brian,

jinkeys, sorry, I don't know what I was thinking. The kernel is at

http://people.canonical.com/~serge/linux-image-2.6.32-29-generic_2.6.32-29.58ubuntu1_amd64.deb

Brian Knoll (brianknoll) wrote :

Hi Serge,

Yes, this kernel does fix the problem! Thank you!! I was able to boot successfully into the kernel with SMP enabled on the systems that couldn't do this before.

Let me know if there is additional testing I can do to help you get this into the Lucid stream.

Thanks again for all of your help, Serge. Great work!

Serge Hallyn (serge-hallyn) wrote :

Great, thanks Brian. I'll forward the patch along with SRU justification to the kernel team.

affects: qemu-kvm (Ubuntu) → linux (Ubuntu)
description: updated
Brian Knoll (brianknoll) wrote :

I am concerned that this patch possibly didn't make it into the -server kernel (or possibly the -virtual kernel), only the -generic kernel:

-rw-r--r-- 1 root root 4052960 2011-02-28 18:37 vmlinuz-2.6.32-29-generic
-rw-r--r-- 1 root root 4110656 2011-02-11 16:52 vmlinuz-2.6.32-29-server

I think it's important for this patch to be included in both, as I actually originally saw the problem in the -server kernel. I have to check the behavior as soon as I can, and report back, but I suspect the 2.6.32-29-server kernel did not get these patches and still has the bad behavior. Also, possibly, the -virtual kernel.

Brian Knoll (brianknoll) wrote :

I have tested this on the latest -server kernel and the problem is indeed NOT fixed. It looks like the patch only made it into the -generic kernel. It does work in the -generic kernel, but since only the -generic kernel was patched, anyone using -server will still experience this issue.

To clarify, my concern is that anyone using the -server or -virtual kernels will still have this problem, and I suspect that many people running virtual machines are probably using one of those two kernels.

Serge Hallyn (serge-hallyn) wrote :

Thanks, Brian. This fix is awaiting review from the kernel team, so should not yet be even in -generic. After review it will go into all the flavours.

Changed in linux (Ubuntu Lucid):
importance: Undecided → Medium
status: New → Triaged
Serge Hallyn (serge-hallyn) wrote :

(Note that an updated and split-up version of the patch in comment #27 is in the hands of the kernel team.)

Stefan Bader (smb) wrote :

I would assume that, since the requested patches are upstream already in the 2.6.37 timeframe, this problem is fixed already in Natty.

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Lucid):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
status: Triaged → In Progress
Stefan Bader (smb) wrote :

Noticed that this had been committed already and would be in one of the next proposed. An attempt to submit to stable upstream was made but does not seem to get anywhere, despite of some ok in review.

Changed in linux (Ubuntu Lucid):
status: In Progress → Fix Committed
tags: added: testcase

FYI, +1 on this even when using the patched kernels:

Host system AMD Ubuntu 10.04 2.6.32-34-server

Guest system Ubuntu 10.04, have tried several kernels, including:

2.6.38-11-server (From backport as per above)
2.6.32-34-server
2.6.32-33-server
2.6.32-29-generic (From Deb package above)
2.6.32-24-server

In all cases I see performance reduced to a crawl and an eventual lock on the guest when using the SMP flag.

I've spent a few hours trying different kernels, this is a pre-production system with a few beta users on it so although I can do a reasonable amount of messing around I do need to leave it working sometimes!

I can however easily create new images and play with those, which I'm about to do - see if it's something that's been installed that's causing the problem and not the kernel.

If you want me to test anything happy to assist!

Update to above - I've just installed a fresh 10.04 guest, tried 2.6.32.33-server and 2.6.32.34-server, no problem running the guest with smp greater than 1. Must be something else I've installed.

Stefan Bader (smb) wrote :

At least this specific problem should be no issue anymore. The patches went into 2.6.32-32.62. Unfortunately the were missing the magic to automatically cause the bug report to be closed. Doing that now manually.

Changed in linux (Ubuntu Lucid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.