Live migration fails despite matching CPUs

Bug #1898715 reported by Andrew Bonney
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Andrew Bonney
Train
Fix Released
High
Stephen Finucane
Ussuri
Fix Released
High
Stephen Finucane
Victoria
Fix Released
High
Stephen Finucane

Bug Description

Having upgraded to Ussuri, we've noted that live migrations now always fail across our hosts with newer Intel CPUs (identified by libvirt as Cascadelake-Server-noTSX).

When processing the CPU's features, the calls made by Nova to libvirt appear to result in an XML segment which includes 'policy' keys for each feature which may be set to 'disable'. When Nova interprets this (see https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L699 and https://github.com/openstack/nova/blob/master/nova/virt/libvirt/config.py#L670) the 'policy' key does not appear to be handled, resulting in more CPU features being recorded against the hypervisor than it really has.

When a live migration is scheduled, these additional feature requirements are then passed to the remote host which compares with its running features and identifies they are incompatible, despite the CPUs being identical. As a result we're currently unable to live migrate any VMs between hosts which use these CPUs.

Further debug output is included in http://paste.openstack.org/show/798740/

Nova stable/ussuri 7d556106bfd3e64860dc26226d364876e8bce43c
Ubuntu 18.04
libvirt 6.0.0-0ubuntu8.2~cloud0

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

For reference we've temporarily worked around this with the following patch. Whilst this won't be comprehensive enough to contribute, it has proved sufficient to resolve the issue across our deployments.

https://github.com/bbc/nova/commit/9d11ce63640dc08d2c69ff2176156d2887f1039f

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Hi, Andrew

Thanks for the excellent bug report, including the patch that worked for you. Good sleuthing there!

As you might've discovered, policy='require' means, in libvirt parlance, "guest creation will fail unless a given CPU feature is supported by the host CPU _or_ the hypervisor is able to emulate it."

For example, the 'x2apic' is emulated by QEMU even if the host does not support it:

    <feature policy='require' name='x2apic'/>

If your small patch solves it for you, feel free to submit it as an upstream fix, if you have time.

PS: Paste-bins expire, so I took the liberty to add the content of your paste-bin as a plain text attachment.

Changed in nova:
assignee: nobody → Kashyap Chamarthy (kashyapc)
Revision history for this message
Andrew Bonney (andrewbonney) wrote :

Sure, I've just submitted https://review.opendev.org/757577

Changed in nova:
status: New → In Progress
tags: added: libvirt
Changed in nova:
assignee: Kashyap Chamarthy (kashyapc) → Andrew Bonney (andrewbonney)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/758760

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/758761

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/758763

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/757577
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=eeeca4ceff576beaa8558360c8a6a165d716f996
Submitter: Zuul
Branch: master

commit eeeca4ceff576beaa8558360c8a6a165d716f996
Author: Andrew Bonney <email address hidden>
Date: Tue Oct 6 14:42:38 2020 +0100

    Handle disabled CPU features to fix live migration failures

    When performing a live migration between hypervisors running
    libvirt, where one or more CPU features are disabled, nova does
    not take account of these. This results in migration failures
    as none of the available hypervisor targets appear compatible.

    This patch ensures that the libvirt 'disable' poicy is taken
    account of, at least in a basic sense, by explicitly ignoring
    items flagged in this way when enumerating CPU features.

    Closes-Bug: #1898715
    Change-Id: Iaf14ca97cfac99dd280d1114123f2d4bb6292b63

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/victoria)

Reviewed: https://review.opendev.org/758760
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=45a4110d20a574d0c43431a4b87497920c9cbe06
Submitter: Zuul
Branch: stable/victoria

commit 45a4110d20a574d0c43431a4b87497920c9cbe06
Author: Andrew Bonney <email address hidden>
Date: Tue Oct 6 14:42:38 2020 +0100

    Handle disabled CPU features to fix live migration failures

    When performing a live migration between hypervisors running
    libvirt, where one or more CPU features are disabled, nova does
    not take account of these. This results in migration failures
    as none of the available hypervisor targets appear compatible.

    This patch ensures that the libvirt 'disable' poicy is taken
    account of, at least in a basic sense, by explicitly ignoring
    items flagged in this way when enumerating CPU features.

    Closes-Bug: #1898715
    Change-Id: Iaf14ca97cfac99dd280d1114123f2d4bb6292b63
    (cherry picked from commit eeeca4ceff576beaa8558360c8a6a165d716f996)

tags: added: in-stable-victoria
Lee Yarwood (lyarwood)
Changed in nova:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 21.1.2

This issue was fixed in the openstack/nova 21.1.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.6.0

This issue was fixed in the openstack/nova 20.6.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.0.0.0rc1

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.