Live migration fails despite matching CPUs

Bug #1898715 reported by Andrew Bonney on 2020-10-06
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Andrew Bonney
Train
High
Stephen Finucane
Ussuri
High
Stephen Finucane
Victoria
High
Stephen Finucane

Bug Description

Having upgraded to Ussuri, we've noted that live migrations now always fail across our hosts with newer Intel CPUs (identified by libvirt as Cascadelake-Server-noTSX).

When processing the CPU's features, the calls made by Nova to libvirt appear to result in an XML segment which includes 'policy' keys for each feature which may be set to 'disable'. When Nova interprets this (see https://github.com/openstack/nova/blob/master/nova/virt/libvirt/host.py#L699 and https://github.com/openstack/nova/blob/master/nova/virt/libvirt/config.py#L670) the 'policy' key does not appear to be handled, resulting in more CPU features being recorded against the hypervisor than it really has.

When a live migration is scheduled, these additional feature requirements are then passed to the remote host which compares with its running features and identifies they are incompatible, despite the CPUs being identical. As a result we're currently unable to live migrate any VMs between hosts which use these CPUs.

Further debug output is included in http://paste.openstack.org/show/798740/

Nova stable/ussuri 7d556106bfd3e64860dc26226d364876e8bce43c
Ubuntu 18.04
libvirt 6.0.0-0ubuntu8.2~cloud0

Andrew Bonney (andrewbonney) wrote :

For reference we've temporarily worked around this with the following patch. Whilst this won't be comprehensive enough to contribute, it has proved sufficient to resolve the issue across our deployments.

https://github.com/bbc/nova/commit/9d11ce63640dc08d2c69ff2176156d2887f1039f

Kashyap Chamarthy (kashyapc) wrote :

Hi, Andrew

Thanks for the excellent bug report, including the patch that worked for you. Good sleuthing there!

As you might've discovered, policy='require' means, in libvirt parlance, "guest creation will fail unless a given CPU feature is supported by the host CPU _or_ the hypervisor is able to emulate it."

For example, the 'x2apic' is emulated by QEMU even if the host does not support it:

    <feature policy='require' name='x2apic'/>

If your small patch solves it for you, feel free to submit it as an upstream fix, if you have time.

PS: Paste-bins expire, so I took the liberty to add the content of your paste-bin as a plain text attachment.

Changed in nova:
assignee: nobody → Kashyap Chamarthy (kashyapc)
Andrew Bonney (andrewbonney) wrote :

Sure, I've just submitted https://review.opendev.org/757577

Changed in nova:
status: New → In Progress
tags: added: libvirt
Changed in nova:
assignee: Kashyap Chamarthy (kashyapc) → Andrew Bonney (andrewbonney)

Reviewed: https://review.opendev.org/757577
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=eeeca4ceff576beaa8558360c8a6a165d716f996
Submitter: Zuul
Branch: master

commit eeeca4ceff576beaa8558360c8a6a165d716f996
Author: Andrew Bonney <email address hidden>
Date: Tue Oct 6 14:42:38 2020 +0100

    Handle disabled CPU features to fix live migration failures

    When performing a live migration between hypervisors running
    libvirt, where one or more CPU features are disabled, nova does
    not take account of these. This results in migration failures
    as none of the available hypervisor targets appear compatible.

    This patch ensures that the libvirt 'disable' poicy is taken
    account of, at least in a basic sense, by explicitly ignoring
    items flagged in this way when enumerating CPU features.

    Closes-Bug: #1898715
    Change-Id: Iaf14ca97cfac99dd280d1114123f2d4bb6292b63

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/758760
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=45a4110d20a574d0c43431a4b87497920c9cbe06
Submitter: Zuul
Branch: stable/victoria

commit 45a4110d20a574d0c43431a4b87497920c9cbe06
Author: Andrew Bonney <email address hidden>
Date: Tue Oct 6 14:42:38 2020 +0100

    Handle disabled CPU features to fix live migration failures

    When performing a live migration between hypervisors running
    libvirt, where one or more CPU features are disabled, nova does
    not take account of these. This results in migration failures
    as none of the available hypervisor targets appear compatible.

    This patch ensures that the libvirt 'disable' poicy is taken
    account of, at least in a basic sense, by explicitly ignoring
    items flagged in this way when enumerating CPU features.

    Closes-Bug: #1898715
    Change-Id: Iaf14ca97cfac99dd280d1114123f2d4bb6292b63
    (cherry picked from commit eeeca4ceff576beaa8558360c8a6a165d716f996)

tags: added: in-stable-victoria
Lee Yarwood (lyarwood) on 2021-02-03
Changed in nova:
importance: Undecided → High

This issue was fixed in the openstack/nova 21.1.2 release.

This issue was fixed in the openstack/nova 20.6.0 release.

This issue was fixed in the openstack/nova 23.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers