VM is suspended after live migrate in Karmic
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libvirt (Ubuntu) |
Fix Released
|
Medium
|
Unassigned |
Bug Description
Ubuntu Karmic 9.10
libvirt-bin 0.7.0-1ubuntu10
qemu-kvm 0.11.0-0ubuntu1
2.6.31-13-server
VM running Ubuntu Jaunty 9.04
On hostA:
virsh migrate fqdn.com qemu+ssh:
Migration completed in about 8 seconds.
Virsh tells me the VM is running:
virsh list | grep fqdn.com
Connecting to uri: qemu:///system
1 fqdn.com running
The VM seems to be frozen after migration on hostB.
After executing this on hostB the VM is working fine:
virsh suspend fqdn.com
virsh resume fqdn.com
It's expected behavior that the VM is suspended before migration, but it needs to be resumed when the migration is completed.
Chuck Short (zulcss) wrote : | #1 |
Changed in libvirt (Ubuntu): | |
importance: | Undecided → Low |
status: | New → Incomplete |
EAB (erwin-true) wrote : | #2 |
Hosts:
CPU: Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz
RAM: 2GB
Disk: Gbit NFS-mount on NetApp FAS3040 (/etc/libvirt/qemu)
10.0.40.
Installed both hosts with Ubuntu Jaunty 9.04.
aptitude install libvirt-bin qemu kvm host sysstat iptraf iptables portmap nfs-common realpath bridge-utils vlan ubuntu-virt-server python-vm-builder whois postfix hdparm
After some testing with migration (all failed because of several errors/bugs) I upgraded to Ubuntu Karmic 9.10 Beta.
cat /etc/network/
auto lo
iface lo inet loopback
auto eth1
iface eth1 inet manual
up ifconfig eth1 0.0.0.0 up
up ip link set eth1 promisc on
auto eth1.1503
iface eth1.1503 inet manual
up ifconfig eth1.1503 0.0.0.0 up
up ip link set eth1.1503 promisc on
auto br_extern
iface br_extern inet static
address 123.123.32.252 # HOSTA
address 123.123.32.253 # HOSTB
network 123.123.32.0
netmask 255.255.252.0
broadcast 123.123.35.255
gateway 123.123.32.1
bridge_stp off
/etc/resolv.conf is correct
/etc/hosts is correct
Hostnames are correct and resolvable
VM running Ubuntu Jaunty 9.04:
fqdn.com.xml:
<?xml version="1.0"?>
<domain type="kvm">
<name>
<uuid>
<memory>
<currentMemor
<vcpu>1</vcpu>
<features>
<acpi/>
<apic/>
<pae/>
</features>
<os>
<type>
<boot dev="cdrom"/>
<boot dev="hd"/>
</os>
<clock offset="utc"/>
<on_poweroff>
<on_reboot>
<on_crash>
<devices>
<emulator>
<disk type="file" device="disk">
<source file="/
<target dev="hda" bus="ide"/>
<driver cache="
</disk>
<interface type="bridge">
<mac address=
<source bridge=
</interface>
<disk type="file" device="cdrom">
<target dev="hdc" bus="ide"/>
<readonly/>
</disk>
<input type="mouse" bus="ps2"/>
<graphics type="vnc" port="-1" listen=
</devices>
</domain>
Define instance:
/usr/bin/virsh define /etc/libvirt/
Start instance:
/usr/bin/virsh start fqdn.com
ps auxf | grep kvm:
/usr/bin/kvm -S -M pc-0.11 -m 1024 -smp 1 -name fqdn.com -uuid 70a1c1f2-
drive file=/etc/
,name=tap.0 -serial none -parallel none -usb -vnc 127.0.0.1:0 -vga cirrus
Migrate instance:
/usr/bin/virsh migrate fqdn.com qemu+ssh:
Migration will complete but the instance seems to be suspended.
On HostB to resume the instance:
/usr/bin/virsh...
Dmitry Ljautov (dljautov) wrote : | #3 |
I have reproduced bug.
I have "asus" and "kvm" with karmic as host os (it's ok for jaunty).
# uname -a
Linux kvm 2.6.31-14-generic #48-Ubuntu SMP Fri Oct 16 14:05:01 UTC 2009 x86_64 GNU/Linux
There's no problem with DNS: "asus" and "kvm" resolved corectly on both hosts.
Both hosts has
1.
listen_tls = 0
listen_tcp = 1
auth_tcp = "none"
in /etc/libvirt/
2.
libvirtd_opts="-d -l"
in /etc/default/
3.
Turned off apparmor with command `sudo invoke-rc.d apparmor stop`
I have fresh installed XP as guest (also tried with Win 2008 x64 with same results).
# virsh --connect=
Connecting to uri: qemu+tcp:
Id Name State
-------
5 xp running
It pings (rdp session work too), and of course it works through vnc.
When I try to migrate it.
# virsh --connect=
I get in /var/log/syslog and /var/log/
Oct 29 12:31:39 asus kernel: [ 7868.432787] device vnet0 entered promiscuous mode
Oct 29 12:31:39 asus kernel: [ 7868.434144] breth0: port 2(vnet0) entering learning state
==> /var/log/
LC_ALL=C LD_LIBRARY_
char device redirected to /dev/pts/0
==> /var/log/syslog <==
Oct 29 12:31:48 asus kernel: [ 7877.430637] breth0: port 2(vnet0) entering forwarding state
Oct 29 12:31:49 asus kernel: [ 7878.472528] vnet0: no IPv6 routers present
==> /var/log/syslog <==
Oct 29 12:33:06 kvm kernel: [ 4912.152966] breth0: port 2(vnet0) entering disabled state
Oct 29 12:33:06 kvm kernel: [ 4912.192109] device vnet0 left promiscuous mode
Oct 29 12:33:06 kvm kernel: [ 4912.192112] breth0: port 2(vnet0) entering disabled state
And just after migration guest xp hangs (not answer on keyboard and mouse in vnc console), no reply on `ping xp` anymore.
# virsh --connect=
Connecting to uri: qemu+tcp:
Id Name State
-------
# virsh --connect=
Connecting to uri: qemu+tcp:
Id Name State
-------
2 xp running
But If we do:
# virsh --connect=
Connecting to uri: qemu+tcp:
Domain xp suspended
# virsh --connect=
Connecting to uri: qemu+tcp:
Domain xp resumed
XP'll become alive in vnc, and begin answer for icmp requests (or rdp sessions will continue working -- no matte...
Changed in libvirt (Ubuntu): | |
importance: | Low → Medium |
status: | Incomplete → Confirmed |
EAB (erwin-true) wrote : | #4 |
Seems to be a known issue and patches are available:
https:/
Dmitry Ljautov (dljautov) wrote : | #5 |
btw, virsh save is _very slow_ on karmic (~1Mb of RAM per second).
Is it the same bug or not?
Tessa (unit3) wrote : | #6 |
I'm seeing behaviour that looks like this on karmic/amd64, only a suspend/resume doesn't "fix" the VM. It stays hard locked. I can connect to it on the virtual serial console or via VNC, and it shows what was there before the migration, but it never unlocks and starts working.
Dmitry Ljautov (dljautov) wrote : | #7 |
What guest os you are running?
I've just roll back my host OS's from karmic to jaunty. And found that migration passed ok with windows xp/2003 guests and fails for ubuntu and centos guests (guests hangs after migration). I'll try to reproduce it with the same guests on karmic later...
EAB (erwin-true) wrote : | #8 |
I tested migrations on Karmic with guests OS Ubuntu Hardy, Ubuntu Jaunty, Ubuntu Karmic
Guests hangs and suspend+resume fixes this.
Tessa (unit3) wrote : | #9 |
This was with a hardy/amd64 guest OS. I haven't tried any other guests, because the bulk of our VMs are supposed to be LTS installs.
Jordan Desroches (jordan-d-desroches) wrote : | #10 |
Some host and guest updates ago, the suspend/resume worked on my Karmic 64 hosts with a variety of guests, including Window 2008 R2, Windows XP and various Ubuntu releases. Now, when I try to suspend and resume, upon resume, the machine reboots instead of resuming.
virsh # version
Compiled against library: libvir 0.7.0
Using library: libvir 0.7.0
Using API: QEMU 0.7.0
Running hypervisor: QEMU 0.11.0
$ uname -a
Linux kvm1 2.6.31-15-server #50-Ubuntu SMP Tue Nov 10 15:50:36 UTC 2009 x86_64 GNU/Linux
Dmitry Ljautov (dljautov) wrote : | #11 |
I've just tested live migration (as I wrote above) with Karmic hosts and guest one. Guest still hangs after migration but virsh suspend + virsh resume on destination host helps guest to continue work. Bug is still reproducable on Karmic...
# uname -a
Linux kvm 2.6.31-16-generic #53-Ubuntu SMP Tue Dec 8 04:02:15 UTC 2009 x86_64 GNU/Linux
# virsh version
Connecting to uri: qemu:///system
Compiled against library: libvir 0.7.0
Using library: libvir 0.7.0
Using API: QEMU 0.7.0
Running hypervisor: QEMU 0.11.0
Mark Burgo (burgo-mark) wrote : | #12 |
Is there any status update on this?
Will Lanuchpad be updating the libvirt packages with the patched described above. Or have they been released in a different repo..
Tessa (unit3) wrote : | #13 |
Ok, updated to the latest libvirt packages from PPA:dnjl/
EAB (erwin-true) wrote : | #14 |
Finished some new tests.
Test is prety much the same as the bug description and comment 2 (https:/
Brought HostA up-to-date:
Ubuntu Karmic 9.10
libvirt-bin 0.7.0-1ubuntu13/.1
qemu-kvm 0.11.0-0ubuntu6.3
2.6.31-16-server
Upgraded HostB to:
Ubuntu Lucid 10.04 (development branch)
libvirt-bin 0.7.2-4ubuntu5
qemu-kvm 0.11.0-0ubuntu6.3
2.6.32-10-server
VM running Ubuntu Jaunty 9.04
- Karmic -> Lucid : Migration works without suspend/resume workaround.
- Lucid -> Lucid : Migration works without suspend/resume workaround.
For fun:
- Lucid -> Karmic (so back) : Migration works but suspend/resume workaround needed. Instance is migrated but all partitions are gone so I/O errors and everything crashes ;)
Tessa (unit3) wrote : | #15 |
Here's an interesting addition to this problem:
In my cluster, one system is based on a core2duo CPU, while the other is based on an i7 920 CPU. When the VM is originally started on the i7, migration doesn't work. When it's originally started on the core2, it does. From this, I'm guessing that the VM tries to use more CPU features available on the newer CPU when starting, and then when it tries to migrate to the older CPU some of those instructions fail, and that causes the lock ups I've been seeing.
I imagine this is also why this hasn't been reported a ton, since for people with matched systems they wouldn't see the problem.
Can anyone else still experiencing this confirm / deny mismatched CPUs?
Jordan Desroches (jordan-d-desroches) wrote : | #16 |
For better or worse, I've been having this problem accross four identical machines, each with dual quad core processors:
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
stepping : 6
cpu MHz : 2000.000
cache size : 6144 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 10
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips : 5653.12
clflush size : 64
cache_alignment : 64
address sizes : 38 bits physical, 48 bits virtual
power management:
EAB (erwin-true) wrote : | #17 |
Migrating between these CPU types:
Testserver01: 2 X Intel(R) Core(TM)2 CPU 6300 @ 1.86GHz
Productionserver01: 16 X Intel(R) Xeon(R) CPU X5570 @ 2.93GHz
Works for me with:
Karmic -> Lucid
Lucid -> Lucid
This was on 2010-01-13.
Now Migration to Lucid fails from Karmic. Lot of updates on Lucid on KVM/QEMU last days.
EAB (erwin-true) wrote : | #18 |
Migrating from Karmic -> Karmic seems to work for some time now.
This bug can be closed
Mark Burgo (burgo-mark) wrote : | #19 |
EAB --> Migrating from Karmic -> Karmic seems to work for some time now.
This bug can be closed
Can you tell me where the updated libvirt packages are that makes this work as it is still broken on my servers.
2- Dell PE 805 Dual 6 core AMD Opteron's
64 gig ram
SAN Attached
Also when was this fixed
Mark Burgo (burgo-mark) wrote : | #20 |
EAB --> Migrating from Karmic -> Karmic seems to work for some time now.
This bug can be closed
Can you tell me where the updated libvirt packages are that makes this work as it is still broken on my servers.
2- Dell PE 805 Dual 6 core AMD Opteron's
64 gig ram
SAN Attached
Also when was this fixed
Forgot to add libvirt is 0.7.0-1ubuntu13.1
EAB (erwin-true) wrote : | #21 |
Ah my bad
It's indeed not working without the suspend-resume workaround.
I used a bash-script which contained the suspend-resume workaround. I was not aware of that.
Dustin Kirkland (kirkland) wrote : | #22 |
Based on comment #14, this appears to be fixed in Lucid. Please reopen if you're experiencing this problem there.
Mark Burgo (burgo-mark) wrote : | #23 |
Sorry but I disagree,
Lucid is only in beta Karmic needs fixed! Lucid will not be released for 30 days. But since lucid will be released before a fix for Karmic is complete I guess all of us will need to wait.
Dustin Kirkland (kirkland) wrote : Re: [Bug 448674] Re: VM is suspended after live migrate in Karmic | #24 |
Mark-
Can you reproduce this problem in Lucid?
Lucid is still open for development, whereas Karmic is not.
If you want to ensure that this gets fixed in Ubuntu, it would be best
to test it in Lucid.
Given the nature of the bug, I don't think it meets the SRU
requirements defined in:
* https:/
EAB (erwin-true) wrote : | #25 |
In Karmic there is a workaround.
In Lucid this problem is not reproducable.
I also think it don't meet the SRU.
6 months ago when I reported this bug I hoped the bug to be fixed within a couple weeks, but now I rather wait a month and test al needed features again and again in Lucid (doing it already for some months).
The migrate feature in Karmic works 9 out of 10 times with the workarround (the failures are with random errors).
In Lucid I migrated 6 VM's hundreds of times without failures. So KVM/Qemu/Libvirt is much more stable in Lucid.
Why I also like to upgrade to Lucid? Because Lucid is LTS and features like KSM:
http://
It's absolutely worth waiting now.
Jamie Strandboge (jdstrand) wrote : | #26 |
Marking Fix Released per reporter's feedback.
Changed in libvirt (Ubuntu): | |
status: | Confirmed → Fix Released |
Mark Burgo (burgo-mark) wrote : | #27 |
Well Question then?
Karmic is to be supported for another year, correct?
This is a bug with karmic correct?
the workaround is to suspend and then resume the vm this removes live migration from the mix if I have to do the suspend and resume. Why not just backport libvirt-0.7.2 from the version of lucid that was working to karmic so that live migration works as advertised?
Lucid is still at beta 1 while I will try it for you it is not usable in production because to much of the pacemaker utilities are not in the lucid release at this time requiring ppa's to be installed for fully functional systems.
Weather the bug meets the SRU is not the question the question is we have a bug on a system that redhat released the patches for months ago and it was not fixed on a ubuntu release that is to be supported for another year. This needs fixed as not everyone will install lucid the day it is released. I enjoy working with ubuntu and this is a problem that must be corrected before the release of lucid on the karmic platform.
Unless you have a full workaround that does not require someone connecting and suspending and resuming the migrated VM then it is not a valid workaround. we need the vm to migrate without human intervention
Thank You
P.S. the lucid beta 1 is being installed as we speak on a set of test boxes to attempt the test. However, as I stated above not all of the pacemaker+
EAB (erwin-true) wrote : | #28 |
Mark , I fully agree.
It should be fixed in Karmic.
I'm not going to use Lucid in production for the next 3-4 months.
It has to prove to be stable first.
Is it so hard to fix this bug? Probably it's not high on the list to be fixed.
Jamie Strandboge (jdstrand) wrote : | #29 |
Mark, EAB,
Dustin already commented on the status of this bug for Karmic and believes it does not qualify for an SRU. Please feel free to read https:/
Mark Burgo (burgo-mark) wrote : | #30 |
Jamie,
This bug was originally defined on 10-13-2009 the release of Karmic was on 10-29-2009 This should have been fixed.
It was confirmed on 10-20-2009 and again on 10-29-2009. Then marked as medium imporantance and confirmed.
I understand that you don't want to fix it so I will now wait a month and see if it is in lucid correctly. Remember that the change from Alpha 2 to alpha3 broke it again. Now beta1 breaks pacemaker+
Status of this right now is that it is broken and will not be fixed. I will get the current version of libvirt and build my own packages from source as I had to do 5 years ago.
Thank you (Hope Lucid is operational or we will need to move on)
Jamie Strandboge (jdstrand) wrote : | #31 |
Mark, I understand your frustration, however keep in mind that the final week before Release will only have the highest impact bug fixes to reduce the chance of regression in the final release. Bugs are deferred to '-updates' all the time in the week(s) prior to release for exactly this reason. Also note that the bug is fixed in Lucid (or should be), so there was progress on this bug. The question is whether to fix a previous release, which was discussed before. Rather than hoping Lucid is fixed, I highly recommend trying Lucid out and confirming this is fixed and report any other bugs you might find. While I do not recommend running the development release on a production machine at this time, Beta-2 is next week and the focus of development at this point in the cycle is stability and bug fixes. If there are bugs in Lucid, now is the best time to report them so we can get them fixed.
TomaszChmielewski (mangoo-wpkg) wrote : | #32 |
I see this issue with 10.10. But I also see it when:
- KVM guest is saved
- KVM guest is restored
Although "virsh list" shows the guest is running, it is not. I have to suspend / resume the guest to make it run again.
It happens with ~50% of save/restores.
frankie (frankie-etsetb) wrote : | #33 |
Hi. My kvm domains didnt' migrate too until I noticed I was missing the package "kvm-pxe" in the destination server.
Now it works like a charm with Ubuntu server 10.10 64 bits.
Thank you for taking the time to report this bug and helping to make Ubuntu better. Please answer these questions:
1. Is this reproducible?
2. If so, what specific steps should we take to recreate this bug? Be as detailed as possible.
This will help us to find and resolve the problem.