VM is suspended after live migrate in Karmic

Bug #448674 reported by EAB
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Ubuntu Karmic 9.10
libvirt-bin 0.7.0-1ubuntu10
qemu-kvm 0.11.0-0ubuntu1
2.6.31-13-server
VM running Ubuntu Jaunty 9.04

On hostA:
virsh migrate fqdn.com qemu+ssh://hostb.fqdn.com/system
Migration completed in about 8 seconds.

Virsh tells me the VM is running:
virsh list | grep fqdn.com
Connecting to uri: qemu:///system
  1 fqdn.com running

The VM seems to be frozen after migration on hostB.
After executing this on hostB the VM is working fine:
virsh suspend fqdn.com
virsh resume fqdn.com

It's expected behavior that the VM is suspended before migration, but it needs to be resumed when the migration is completed.

Revision history for this message
Chuck Short (zulcss) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. Please answer these questions:
1. Is this reproducible?
2. If so, what specific steps should we take to recreate this bug? Be as detailed as possible.
This will help us to find and resolve the problem.

Changed in libvirt (Ubuntu):
importance: Undecided → Low
status: New → Incomplete
Revision history for this message
EAB (erwin-true) wrote :
Download full text (3.3 KiB)

Hosts:
CPU: Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz
RAM: 2GB
Disk: Gbit NFS-mount on NetApp FAS3040 (/etc/libvirt/qemu)
10.0.40.100:/vol/hl/disk_images /etc/libvirt/qemu/disks nfs rsize=32768,wsize=32768,hard,intr,tcp,timeo=600,rw 0 0

Installed both hosts with Ubuntu Jaunty 9.04.
aptitude install libvirt-bin qemu kvm host sysstat iptraf iptables portmap nfs-common realpath bridge-utils vlan ubuntu-virt-server python-vm-builder whois postfix hdparm

After some testing with migration (all failed because of several errors/bugs) I upgraded to Ubuntu Karmic 9.10 Beta.

cat /etc/network/interfaces:
auto lo
iface lo inet loopback

auto eth1
iface eth1 inet manual
        up ifconfig eth1 0.0.0.0 up
        up ip link set eth1 promisc on

auto eth1.1503
iface eth1.1503 inet manual
        up ifconfig eth1.1503 0.0.0.0 up
        up ip link set eth1.1503 promisc on

auto br_extern
iface br_extern inet static
        address 123.123.32.252 # HOSTA
        address 123.123.32.253 # HOSTB
        network 123.123.32.0
        netmask 255.255.252.0
        broadcast 123.123.35.255
        gateway 123.123.32.1
        bridge_ports eth0.1503
        bridge_stp off

/etc/resolv.conf is correct
/etc/hosts is correct
Hostnames are correct and resolvable

VM running Ubuntu Jaunty 9.04:
fqdn.com.xml:
<?xml version="1.0"?>
<domain type="kvm">
  <name>fqdn.com</name>
  <uuid>70a1c1f2-9a3e-4ee5-9f95-69e7e2682e15</uuid>
  <memory>1048576</memory>
  <currentMemory>1048576</currentMemory>
  <vcpu>1</vcpu>
  <features>
    <acpi/>
    <apic/>
    <pae/>
  </features>
  <os>
    <type>hvm</type>
    <boot dev="cdrom"/>
    <boot dev="hd"/>
  </os>
  <clock offset="utc"/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/kvm</emulator>
    <disk type="file" device="disk">
      <source file="/etc/libvirt/qemu/disks/1378/fqdn.com/disk0.qcow2"/>
      <target dev="hda" bus="ide"/>
      <driver cache="writethrough"/>
    </disk>
    <interface type="bridge">
      <mac address="56:16:43:76:ab:09"/>
      <source bridge="br_extern"/>
    </interface>
    <disk type="file" device="cdrom">
      <target dev="hdc" bus="ide"/>
      <readonly/>
    </disk>
    <input type="mouse" bus="ps2"/>
    <graphics type="vnc" port="-1" listen="127.0.0.1"/>
  </devices>
</domain>

Define instance:
/usr/bin/virsh define /etc/libvirt/qemu/xml/1378/fqdn.com.xml

Start instance:
/usr/bin/virsh start fqdn.com

ps auxf | grep kvm:
/usr/bin/kvm -S -M pc-0.11 -m 1024 -smp 1 -name fqdn.com -uuid 70a1c1f2-9a3e-4ee5-9f95-69e7e2682e15 -monitor unix:/var/run/libvirt/qemu/fqdn.com.monitor,server,nowait -boot dc -
drive file=/etc/libvirt/qemu/disks/1378/fqdn.com/disk0.qcow2,if=ide,index=0,boot=on -drive file=,if=ide,media=cdrom,index=2 -net nic,macaddr=56:16:43:76:ab:09,vlan=0,name=nic.0 -net tap,fd=17,vlan=0
,name=tap.0 -serial none -parallel none -usb -vnc 127.0.0.1:0 -vga cirrus

Migrate instance:
/usr/bin/virsh migrate fqdn.com qemu+ssh://hostb.fqdn.com/system

Migration will complete but the instance seems to be suspended.
On HostB to resume the instance:
/usr/bin/virsh...

Read more...

Revision history for this message
Dmitry Ljautov (dljautov) wrote :
Download full text (3.2 KiB)

I have reproduced bug.
I have "asus" and "kvm" with karmic as host os (it's ok for jaunty).
# uname -a
Linux kvm 2.6.31-14-generic #48-Ubuntu SMP Fri Oct 16 14:05:01 UTC 2009 x86_64 GNU/Linux

There's no problem with DNS: "asus" and "kvm" resolved corectly on both hosts.
Both hosts has
1.
listen_tls = 0
listen_tcp = 1
auth_tcp = "none"
in /etc/libvirt/libvirtd.conf
2.
libvirtd_opts="-d -l"
in /etc/default/libvirt-bin
3.
Turned off apparmor with command `sudo invoke-rc.d apparmor stop`

I have fresh installed XP as guest (also tried with Win 2008 x64 with same results).

# virsh --connect=qemu+tcp://kvm/system list
Connecting to uri: qemu+tcp://kvm/system
 Id Name State
----------------------------------
  5 xp running

It pings (rdp session work too), and of course it works through vnc.

When I try to migrate it.
# virsh --connect=qemu+tcp://kvm/system migrate --live xp qemu+tcp://asus/system

I get in /var/log/syslog and /var/log/libvirt/qemu/xp.log (time on both host is syncronized)

Oct 29 12:31:39 asus kernel: [ 7868.432787] device vnet0 entered promiscuous mode
Oct 29 12:31:39 asus kernel: [ 7868.434144] breth0: port 2(vnet0) entering learning state

==> /var/log/libvirt/qemu/xp.log <==
LC_ALL=C LD_LIBRARY_PATH=/usr/local/lib PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin HOME=/root USER=root LOGNAME=root /usr/bin/kvm -S -M pc-0.11 -m 512 -smp 1 -name xp -uuid 02f32130-6933-6594-544e-7b12fa1bbd34 -monitor unix:/var/run/libvirt/qemu/xp.monitor,server,nowait -boot c -drive file=/mnt/nfs/images/xp.img,if=ide,index=0,boot=on -drive file=/mnt/nfs/iso/R10.iso,if=ide,media=cdrom,index=2 -net nic,macaddr=54:52:00:17:57:79,vlan=0,name=nic.0 -net tap,fd=18,vlan=0,name=tap.0 -serial pty -parallel none -usb -usbdevice tablet -vnc 0.0.0.0:0 -k en-us -vga cirrus -incoming tcp:0.0.0.0:49154
char device redirected to /dev/pts/0

==> /var/log/syslog <==
Oct 29 12:31:48 asus kernel: [ 7877.430637] breth0: port 2(vnet0) entering forwarding state
Oct 29 12:31:49 asus kernel: [ 7878.472528] vnet0: no IPv6 routers present

==> /var/log/syslog <==
Oct 29 12:33:06 kvm kernel: [ 4912.152966] breth0: port 2(vnet0) entering disabled state
Oct 29 12:33:06 kvm kernel: [ 4912.192109] device vnet0 left promiscuous mode
Oct 29 12:33:06 kvm kernel: [ 4912.192112] breth0: port 2(vnet0) entering disabled state

And just after migration guest xp hangs (not answer on keyboard and mouse in vnc console), no reply on `ping xp` anymore.

# virsh --connect=qemu+tcp://kvm/system list
Connecting to uri: qemu+tcp://kvm/system
 Id Name State
----------------------------------

# virsh --connect=qemu+tcp://asus/system list
Connecting to uri: qemu+tcp://asus/system
 Id Name State
----------------------------------
  2 xp running

But If we do:

# virsh --connect=qemu+tcp://asus/system suspend xp
Connecting to uri: qemu+tcp://asus/system
Domain xp suspended

# virsh --connect=qemu+tcp://asus/system resume xp
Connecting to uri: qemu+tcp://asus/system
Domain xp resumed

XP'll become alive in vnc, and begin answer for icmp requests (or rdp sessions will continue working -- no matte...

Read more...

Chuck Short (zulcss)
Changed in libvirt (Ubuntu):
importance: Low → Medium
status: Incomplete → Confirmed
Revision history for this message
EAB (erwin-true) wrote :

Seems to be a known issue and patches are available:
https://www.redhat.com/archives/libvir-list/2009-October/msg00019.html

Revision history for this message
Dmitry Ljautov (dljautov) wrote :

btw, virsh save is _very slow_ on karmic (~1Mb of RAM per second).
Is it the same bug or not?

Revision history for this message
Tessa (unit3) wrote :

I'm seeing behaviour that looks like this on karmic/amd64, only a suspend/resume doesn't "fix" the VM. It stays hard locked. I can connect to it on the virtual serial console or via VNC, and it shows what was there before the migration, but it never unlocks and starts working.

Revision history for this message
Dmitry Ljautov (dljautov) wrote :

What guest os you are running?

I've just roll back my host OS's from karmic to jaunty. And found that migration passed ok with windows xp/2003 guests and fails for ubuntu and centos guests (guests hangs after migration). I'll try to reproduce it with the same guests on karmic later...

Revision history for this message
EAB (erwin-true) wrote :

I tested migrations on Karmic with guests OS Ubuntu Hardy, Ubuntu Jaunty, Ubuntu Karmic
Guests hangs and suspend+resume fixes this.

Revision history for this message
Tessa (unit3) wrote :

This was with a hardy/amd64 guest OS. I haven't tried any other guests, because the bulk of our VMs are supposed to be LTS installs.

Revision history for this message
Jordan Desroches (jordan-d-desroches) wrote :

Some host and guest updates ago, the suspend/resume worked on my Karmic 64 hosts with a variety of guests, including Window 2008 R2, Windows XP and various Ubuntu releases. Now, when I try to suspend and resume, upon resume, the machine reboots instead of resuming.

virsh # version
Compiled against library: libvir 0.7.0
Using library: libvir 0.7.0
Using API: QEMU 0.7.0
Running hypervisor: QEMU 0.11.0

$ uname -a
Linux kvm1 2.6.31-15-server #50-Ubuntu SMP Tue Nov 10 15:50:36 UTC 2009 x86_64 GNU/Linux

Revision history for this message
Dmitry Ljautov (dljautov) wrote :

I've just tested live migration (as I wrote above) with Karmic hosts and guest one. Guest still hangs after migration but virsh suspend + virsh resume on destination host helps guest to continue work. Bug is still reproducable on Karmic...

# uname -a
Linux kvm 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
# virsh version
Connecting to uri: qemu:///system
Compiled against library: libvir 0.7.0
Using library: libvir 0.7.0
Using API: QEMU 0.7.0
Running hypervisor: QEMU 0.11.0

Revision history for this message
Mark Burgo (burgo-mark) wrote :

Is there any status update on this?

Will Lanuchpad be updating the libvirt packages with the patched described above. Or have they been released in a different repo..

Revision history for this message
Tessa (unit3) wrote :

Ok, updated to the latest libvirt packages from PPA:dnjl/virtualization. Now, the migrate, then suspend/resume suggestion works for me. Still less than ideal, as it kills the whole "live migration" thing, but at least it doesn't totally kill my VMs anymore.

Revision history for this message
EAB (erwin-true) wrote :

Finished some new tests.

Test is prety much the same as the bug description and comment 2 (https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/448674/comments/2) only hostB is Lucid.

Brought HostA up-to-date:
Ubuntu Karmic 9.10
libvirt-bin 0.7.0-1ubuntu13/.1
qemu-kvm 0.11.0-0ubuntu6.3
2.6.31-16-server

Upgraded HostB to:
Ubuntu Lucid 10.04 (development branch)
libvirt-bin 0.7.2-4ubuntu5
qemu-kvm 0.11.0-0ubuntu6.3
2.6.32-10-server

VM running Ubuntu Jaunty 9.04

- Karmic -> Lucid : Migration works without suspend/resume workaround.

- Lucid -> Lucid : Migration works without suspend/resume workaround.

For fun:
- Lucid -> Karmic (so back) : Migration works but suspend/resume workaround needed. Instance is migrated but all partitions are gone so I/O errors and everything crashes ;)

Revision history for this message
Tessa (unit3) wrote :

Here's an interesting addition to this problem:

In my cluster, one system is based on a core2duo CPU, while the other is based on an i7 920 CPU. When the VM is originally started on the i7, migration doesn't work. When it's originally started on the core2, it does. From this, I'm guessing that the VM tries to use more CPU features available on the newer CPU when starting, and then when it tries to migrate to the older CPU some of those instructions fail, and that causes the lock ups I've been seeing.

I imagine this is also why this hasn't been reported a ton, since for people with matched systems they wouldn't see the problem.

Can anyone else still experiencing this confirm / deny mismatched CPUs?

Revision history for this message
Jordan Desroches (jordan-d-desroches) wrote :

For better or worse, I've been having this problem accross four identical machines, each with dual quad core processors:

$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
stepping : 6
cpu MHz : 2000.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5653.12
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:

Revision history for this message
EAB (erwin-true) wrote :

Migrating between these CPU types:
Testserver01: 2 X Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz
Productionserver01: 16 X Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
Works for me with:
Karmic -> Lucid
Lucid -> Lucid

This was on 2010-01-13.
Now Migration to Lucid fails from Karmic. Lot of updates on Lucid on KVM/QEMU last days.

Revision history for this message
EAB (erwin-true) wrote :

Migrating from Karmic -> Karmic seems to work for some time now.
This bug can be closed

Revision history for this message
Mark Burgo (burgo-mark) wrote :

EAB --> Migrating from Karmic -> Karmic seems to work for some time now.
               This bug can be closed

Can you tell me where the updated libvirt packages are that makes this work as it is still broken on my servers.

2- Dell PE 805 Dual 6 core AMD Opteron's
      64 gig ram
      SAN Attached

Also when was this fixed

Revision history for this message
Mark Burgo (burgo-mark) wrote :

EAB --> Migrating from Karmic -> Karmic seems to work for some time now.
               This bug can be closed

Can you tell me where the updated libvirt packages are that makes this work as it is still broken on my servers.

2- Dell PE 805 Dual 6 core AMD Opteron's
      64 gig ram
      SAN Attached

Also when was this fixed

Forgot to add libvirt is 0.7.0-1ubuntu13.1

Revision history for this message
EAB (erwin-true) wrote :

Ah my bad
It's indeed not working without the suspend-resume workaround.
I used a bash-script which contained the suspend-resume workaround. I was not aware of that.

Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Based on comment #14, this appears to be fixed in Lucid. Please reopen if you're experiencing this problem there.

Revision history for this message
Mark Burgo (burgo-mark) wrote :

Sorry but I disagree,

  Lucid is only in beta Karmic needs fixed! Lucid will not be released for 30 days. But since lucid will be released before a fix for Karmic is complete I guess all of us will need to wait.

Revision history for this message
Dustin Kirkland  (kirkland) wrote : Re: [Bug 448674] Re: VM is suspended after live migrate in Karmic

Mark-

Can you reproduce this problem in Lucid?

Lucid is still open for development, whereas Karmic is not.

If you want to ensure that this gets fixed in Ubuntu, it would be best
to test it in Lucid.

Given the nature of the bug, I don't think it meets the SRU
requirements defined in:
 * https://wiki.ubuntu.com/StableReleaseUpdates

Revision history for this message
EAB (erwin-true) wrote :

In Karmic there is a workaround.
In Lucid this problem is not reproducable.

I also think it don't meet the SRU.

6 months ago when I reported this bug I hoped the bug to be fixed within a couple weeks, but now I rather wait a month and test al needed features again and again in Lucid (doing it already for some months).
The migrate feature in Karmic works 9 out of 10 times with the workarround (the failures are with random errors).
In Lucid I migrated 6 VM's hundreds of times without failures. So KVM/Qemu/Libvirt is much more stable in Lucid.

Why I also like to upgrade to Lucid? Because Lucid is LTS and features like KSM:
http://www.linux-kvm.com/content/using-ksm-kernel-samepage-merging-kvm

It's absolutely worth waiting now.

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Marking Fix Released per reporter's feedback.

Changed in libvirt (Ubuntu):
status: Confirmed → Fix Released
Revision history for this message
Mark Burgo (burgo-mark) wrote :

Well Question then?

Karmic is to be supported for another year, correct?

This is a bug with karmic correct?

the workaround is to suspend and then resume the vm this removes live migration from the mix if I have to do the suspend and resume. Why not just backport libvirt-0.7.2 from the version of lucid that was working to karmic so that live migration works as advertised?

Lucid is still at beta 1 while I will try it for you it is not usable in production because to much of the pacemaker utilities are not in the lucid release at this time requiring ppa's to be installed for fully functional systems.

Weather the bug meets the SRU is not the question the question is we have a bug on a system that redhat released the patches for months ago and it was not fixed on a ubuntu release that is to be supported for another year. This needs fixed as not everyone will install lucid the day it is released. I enjoy working with ubuntu and this is a problem that must be corrected before the release of lucid on the karmic platform.

Unless you have a full workaround that does not require someone connecting and suspending and resuming the migrated VM then it is not a valid workaround. we need the vm to migrate without human intervention

Thank You

P.S. the lucid beta 1 is being installed as we speak on a set of test boxes to attempt the test. However, as I stated above not all of the pacemaker+openais/heartbeat packages are in the beta release so it is not even a valid environment to run any thing on at this time.

Revision history for this message
EAB (erwin-true) wrote :

Mark , I fully agree.
It should be fixed in Karmic.

I'm not going to use Lucid in production for the next 3-4 months.
It has to prove to be stable first.

Is it so hard to fix this bug? Probably it's not high on the list to be fixed.

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Mark, EAB,

Dustin already commented on the status of this bug for Karmic and believes it does not qualify for an SRU. Please feel free to read https://wiki.ubuntu.com/StableReleaseUpdates and if you would like to submit a debdiff and open a task against karmic.

Revision history for this message
Mark Burgo (burgo-mark) wrote :

Jamie,

       This bug was originally defined on 10-13-2009 the release of Karmic was on 10-29-2009 This should have been fixed.

       It was confirmed on 10-20-2009 and again on 10-29-2009. Then marked as medium imporantance and confirmed.

        I understand that you don't want to fix it so I will now wait a month and see if it is in lucid correctly. Remember that the change from Alpha 2 to alpha3 broke it again. Now beta1 breaks pacemaker+openais/heartbeat hope everything is included in beta2 and the RC when it comes out.

Status of this right now is that it is broken and will not be fixed. I will get the current version of libvirt and build my own packages from source as I had to do 5 years ago.

Thank you (Hope Lucid is operational or we will need to move on)

Revision history for this message
Jamie Strandboge (jdstrand) wrote :

Mark, I understand your frustration, however keep in mind that the final week before Release will only have the highest impact bug fixes to reduce the chance of regression in the final release. Bugs are deferred to '-updates' all the time in the week(s) prior to release for exactly this reason. Also note that the bug is fixed in Lucid (or should be), so there was progress on this bug. The question is whether to fix a previous release, which was discussed before. Rather than hoping Lucid is fixed, I highly recommend trying Lucid out and confirming this is fixed and report any other bugs you might find. While I do not recommend running the development release on a production machine at this time, Beta-2 is next week and the focus of development at this point in the cycle is stability and bug fixes. If there are bugs in Lucid, now is the best time to report them so we can get them fixed.

Revision history for this message
TomaszChmielewski (mangoo-wpkg) wrote :

I see this issue with 10.10. But I also see it when:

- KVM guest is saved
- KVM guest is restored

Although "virsh list" shows the guest is running, it is not. I have to suspend / resume the guest to make it run again.

It happens with ~50% of save/restores.

Revision history for this message
frankie (frankie-etsetb) wrote :

Hi. My kvm domains didnt' migrate too until I noticed I was missing the package "kvm-pxe" in the destination server.
Now it works like a charm with Ubuntu server 10.10 64 bits.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.