libvirt: KVM live migration failed due to VIR_DOMAIN_XML_MIGRATABLE flag

Bug #1362929 reported by Qin Zhao
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

OS version: RHEL 6.5
libvirt version: libvirt-0.10.2-29.el6_5.9.x86_64

When I attempt to live migrate my KVM instance using latest Juno code on RHEL 6.5, I notice nova-compute error on source compute node:

2014-08-27 09:24:41.836 26638 ERROR nova.virt.libvirt.driver [-] [instance: 1b1618fa-ddbd-4fce-aa04-720a72ec7dfe] Live Migration failure: unsupported configuration: Target CPU model SandyBridge does not match source (null)

And this libvirt error on source compute node:

2014-08-27 09:32:24.955+0000: 17721: error : virCPUDefIsEqual:753 : unsupported configuration: Target CPU model SandyBridge does not match source (null)

After looking into the code, I notice that https://review.openstack.org/#/c/73428/ adds VIR_DOMAIN_XML_MIGRATABLE flag to dump instance xml. With this flag, the KVM instance xml will include full CPU information like this:
  <cpu mode='host-model' match='exact'>
    <model fallback='allow'>SandyBridge</model>
    <vendor>Intel</vendor>

Without this flag, the xml will not have those CPU information:
  <cpu mode='host-model'>
    <model fallback='allow'/>
    <topology sockets='1' cores='1' threads='1'/>
  </cpu>

The CPU model of my source and destination server are exactly identical. So I suspect it is a side effect of https://review.openstack.org/#/c/73428/. When libvirtd doing virDomainDefCheckABIStability(), its src domain xml does not include CPU model info, so that the checking fails.

After I remove the code change of https://review.openstack.org/#/c/73428/ from my compute node, this libvirt checking error does not occur anymore.

Tags: libvirt
Revision history for this message
Qin Zhao (zhaoqin) wrote :

I believe qemu instance live migration will not have this problem, because the domain xml of qemu does not include this CPU information.

tags: added: libvirt
Revision history for this message
Qin Zhao (zhaoqin) wrote :

Also encounter the same issue on PowerKVM compute node.

Revision history for this message
Alex Xu (xuhj) wrote :

@Qin, I tested this also. it works on my machine.

My env is:

OS: ubuntu 14.04
libvirt version: 1.2.2-0ubuntu13.1.2

Revision history for this message
Sean Dague (sdague) wrote :

Seems like legit bug, though it might be the really old version of libvirt is the issue.

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Qin Zhao (zhaoqin) wrote :

@Alex, Please keep your test environment. I hope to compare it with mine tomorrow. Thanks!

Revision history for this message
Qin Zhao (zhaoqin) wrote :

I compare the source code of libvirt 1.2.2 and libvirt 0.10.2. Now I think we hit this bug: https://bugzilla.redhat.com/show_bug.cgi?id=994364

The new code of libvirt will insert cpu information into source domain xml, before calling virDomainDefCheckABIStability() to compare the source domain xml with the xml input by migrateToURI2(), so that the checking will not fail.

Revision history for this message
Qin Zhao (zhaoqin) wrote :

I guess this problem also exists on RHEL 7.0 which ships libvirt 1.1.1-29.

Is there any way to workaround this problem in Nova driver code?

Revision history for this message
Qin Zhao (zhaoqin) wrote :

RHEL 7.0 should already include this libvirt patch.

https://bugzilla.redhat.com/show_bug.cgi?id=1076503

Revision history for this message
Qin Zhao (zhaoqin) wrote :

Report a bug to Redhat in order to backport the libvirt patch to RHEL 6.5/6.6/6.7

https://bugzilla.redhat.com/show_bug.cgi?id=1141838

Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like this introduced the change: https://review.openstack.org/#/c/73428/

If live migration is broken on RHEL 6.5 because we don't have a new enough version of libvirt or qemu on RHEL 6.5, we should put some conditional logic in the libvirt driver code probably.

Changed in nova:
milestone: none → juno-rc1
importance: Medium → High
Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm a little confused, reading https://bugzilla.redhat.com/show_bug.cgi?id=994364 it sounds like the fix is in qemu, but the comments above make it sound like something has to be patched into libvirt.

Apparently I don't have access to see https://bugzilla.redhat.com/show_bug.cgi?id=1141838.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm a little more confused now, looking at the patch here:

https://review.openstack.org/#/c/73428/

There is checking in place based on whether or not the version of libvirt being used has the VIR_DOMAIN_XML_MIGRATABLE flag available and if not, it calls _check_graphics_addresses_can_live_migrate which checks to see if your console addresses are set to acceptable values:

LOCAL_ADDRS = ('0.0.0.0', '127.0.0.1', '::', '::1')

And if not, you get a migration error.

So from reading that it sounds like you should be able to do a live migration with older libvirt as long as you have your console addresses set correctly.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Oh I think I get it now, we're going down the path that we have a new enough libvirt so the VIR_DOMAIN_XML_MIGRATABLE flag is set but it still fails, which is https://bugzilla.redhat.com/show_bug.cgi?id=1076503. That's fixed in RHEL 7 but not RHEL 6.5 so we're still exposed.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Thinking out loud, could we handle the error and then try the case where VIR_DOMAIN_XML_MIGRATABLE wouldn't be set (which will fail if the console addresses aren't set correctly, i.e. bug 1279563), but at least we'd have the check here in that case:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py?id=2014.2.b3#n4933

Revision history for this message
Solly Ross (sross-7) wrote :

That's not a bad idea, although for actual errors this might make debugging a bit more confusing.

Revision history for this message
Qin Zhao (zhaoqin) wrote :

If we can add a condition which detects libvirt and prevent from setting VIR_DOMAIN_XML_MIGRATABLE flag, we will be able to work around this issue in Nova driver code. But I do not have a very good idea. Checking version number will be too ugly. I am still hoping to get some input from danpb and sross.

Revision history for this message
Qin Zhao (zhaoqin) wrote :

@sross, if your code change can take effect on RHEL 6.5 and make vnc listen to the right ip address, that should be the best result. Since RDO also need to support RHEL 6.5, I feel it make sense to request backporting the libvirt patch. Do you think so? Will you share your position in BZ 1141838 ?

Revision history for this message
Matt Riedemann (mriedem) wrote :

@Solly, yeah, checking for just VIR_ERR_CONFIG_UNSUPPORTED in the libvirtError is super duper generic, since that's used for pretty much all failures in virDomainDefCheckABIStability which calls virCPUDefIsEqual and that raises the 'Target CPU model x does not match source y'. I guess we could narrow the scope of the hack in the libvirt driver by just checking on that string in the error.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/123811

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/123811
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=867bdedf81533f283aae4de4488d54c254bb7f07
Submitter: Jenkins
Branch: master

commit 867bdedf81533f283aae4de4488d54c254bb7f07
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 24 11:21:59 2014 -0700

    Fallback to legacy live migration if config error

    Commit ea7da5152cdca7ba674e2137c3899909995e2287 added a path to using
    migrateToURI2 for live migration if the version of libvirt used has the
    VIR_DOMAIN_XML_MIGRATABLE flag set.

    However, a bug in older versions of libvirt causes the live migration to
    fail because it's incorrectly validating the old and new domain xml's
    for ABI stability.

    Not all distros are running with the patched version of libvirt so add a
    check in place such that if we fail live migration on the new path with
    VIR_ERR_CONFIG_UNSUPPORTED, assume it's due to this issue and attempt
    the legacy migrateToURI call.

    Closes-Bug: #1362929
    Related-Bug: #1279563

    Change-Id: Ie82566121c2ed3a6d55919bc111358f4129cb404

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-rc1 → 2014.2
Revision history for this message
Qin Zhao (zhaoqin) wrote :

libvirt patch is backported to libvirt-0.10.2-46.el6_6.2.x86_64 for RHEL 6.6

https://bugzilla.redhat.com/show_bug.cgi?id=1155564

Revision history for this message
Qin Zhao (zhaoqin) wrote :

libvirt patch is shipped by libvirt-0.10.2-47.el6

https://bugzilla.redhat.com/show_bug.cgi?id=1141838

Revision history for this message
Matt Riedemann (mriedem) wrote :

I believe this is actually fixed in RHEL 6.5 here: https://rhn.redhat.com/errata/RHSA-2014-1873.html

Revision history for this message
Qin Zhao (zhaoqin) wrote :

@Matt, this bug is NOT fixed by libvirt-0.10.2-46.el6. The patch is included in libvirt-0.10.2-47.el6, which has not been published.

Revision history for this message
Daniel Berrange (berrange) wrote :

So there's confusion due to the two different release streams here.

The current RHEL6 release is RHEL-6.6, and the next planned release willbe RHEL-6.7

The libvirt-0.10.2-47.el6 package is scheduled for RHEL-6.7, hence is not available yet as 6.7 isn't released

The very same fix though is included in a bugfix update for the 6.6.z channel, as libvirt-0.10.2-46.el6_6.2

So currently users should make sure they use libvirt-0.10.2-46.el6_6.2 release from the 6.6.z channels.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually this is the BZ that says the fix is in RHEL 6.5:

https://bugzilla.redhat.com/show_bug.cgi?id=1155563

The package is supposed to be libvirt-0.10.2-29.el6_5.13 on RHEL 6.5.

Revision history for this message
Qin Zhao (zhaoqin) wrote :

I tested libvirt-0.10.2-46.el6_6.2 and libvirt-0.10.2-29.el6_5.13 on RHEL 6.5 this morning. The migration bug is fixed.

That means, even if end user create an instance on RHEL 6.5, whose vnc ip is not 0.0.0.0, he will be able to migrate this instance after updating libvirt, and instance vnc ip will be successfully changed to destination compute node ip.

@danpd, @matt, thanks for your clarification!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.