Live-migrating an instance from 'Queens' (CentOS-7) to 'Train' (CentOS-8) fails during libvirt's compareCPU() check

Bug #1913716 reported by Kashyap Chamarthy
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Kashyap Chamarthy

Bug Description

[This bug was originally reported by Lukas Bezdicka when testing Red
Hat's OpenStack (OSP); but this should be reproducible in upstream
context as well. I'm writing this report based on the root cause
analysis in the environment where the bug occcurred. Thanks to Daniel
Berrangé for the debugging help!]

Description
-----------

Live-migrating a guest from 'Queens' compute node (running CentOS 7) to
a 'Train' compute node (running CentOS 8) fails with:

-----------------------------------------------------------------------
[...]
 _compare_cpu /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8559
2021-01-26 23:30:25.169 7 ERROR nova.virt.libvirt.driver [req-774be110-7fb6-4865-a177-d624a821cf9e 19ec0130b8714aac8c64a5c2ee5b914b 352675f5f34d45d59bdd61fde58e4bd0 - default default] CPU doesn't have compatibility.

0

Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.rpc.server [req-774be110-7fb6-4865-a177-d624a821cf9e 19ec0130b8714aac8c64a5c2ee5b914b 352675f5f34d45d59bdd61fde58e4bd0 - default default] Exception during message handling: nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility.

[...]

2021-01-26 23:30:25.242 7 ERROR oslo_messaging.rpc.server block_migration, disk_over_commit)
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8258, in check_can_live_migrate_destination
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.rpc.server self._compare_cpu(None, source_cpu_info, instance)
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py", line 8575, in _compare_cpu
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.rpc.server raise exception.InvalidCPUInfo(reason=m % {'ret': ret, 'u': u})
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.rpc.server nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility.
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.rpc.server
[...]
-----------------------------------------------------------------------

Environment
-----------

The bug was reported by testing in a nested KVM environment, running on
Intel hardware (Xeon(R) Gold 5218R CPU @ 2.10GHz), with the entire
OpenStack setup in VMs. So the Nova instances themselves will be nested
guests.

  - Source: a CentOS-7 compute node (a level-1 guest), running OpenStack
    'Queens'

  - Destination: a CentOS-8 compute node (a level-1 guest), running
    OpenStack 'Train'

Steps to reproduce
------------------

Live-migrate a guest from source to host.

Expected result
---------------

Live migration should've succeeded.

Actual result
-------------

Live migration fails during compareCPU() check on the destination host
with:

[...]
 _compare_cpu /usr/lib/python3.6/site-packages/nova/virt/libvirt/driver.py:8559
2021-01-26 23:30:25.169 7 ERROR nova.virt.libvirt.driver [req-774be110-7fb6-4865-a177-d624a821cf9e 19ec0130b8714aac8c64a5c2ee5b914b 352675f5f34d45d59bdd61fde58e4bd0 - default default] CPU doesn't have compatibility.
[...]

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :
Download full text (3.2 KiB)

Root cause analysis
-------------------

Generate two files called "guest-full.xml" and "hostcaps-full.xml" as follows:

  - guest-full.xml: this is generated in two steps: (1) from the
    nova-compute.log, pick the <cpu> ... </cpu> element that Nova
    unsuccessfully supplied to compareCPU() API and place it into a file
    called "guest.xml"; then (2) run: `virsh cpu-baseline --features
    guest.xml > guest-full.xml`

  - hostcaps-full.xml: this is generated by running `virsh
    domcapabilities |& tee hostcaps-full.xml` on the baremetal host.

Then compare the two, and the root cause is the following difference:

    $> diff -u guest-full.xml hostcaps-full.xml | grep -E '^(-|\+)'
    --- guest-full.xml 2021-01-27 17:22:20.655831989 +0000
    +++ hostcaps-full.xml 2021-01-27 17:22:27.262779417 +0000
    - <model fallback='forbid'>Skylake-Server-IBRS</model>
    + <model fallback='forbid'>Cascadelake-Server</model>
    + <feature policy='require' name='acpi'/>
    + <feature policy='require' name='arch-capabilities'/>
    + <feature policy='require' name='dca'/>
    + <feature policy='require' name='ds'/>
    + <feature policy='require' name='ds_cpl'/>
    + <feature policy='require' name='dtes64'/>
    + <feature policy='require' name='est'/>
    - <feature policy='require' name='hypervisor'/>
    - <feature policy='require' name='ibpb'/>
    + <feature policy='require' name='ht'/>
    + <feature policy='require' name='ibrs-all'/>
    + <feature policy='require' name='intel-pt'/>
    + <feature policy='require' name='invtsc'/>
    + <feature policy='require' name='mds-no'/>
    + <feature policy='require' name='monitor'/>
    + <feature policy='require' name='pbe'/>
    + <feature policy='require' name='pdcm'/>
    + <feature policy='require' name='rdctl-no'/>
    + <feature policy='require' name='skip-l1dfl-vmentry'/>
    + <feature policy='require' name='smx'/>
    + <feature policy='require' name='tm'/>
    + <feature policy='require' name='tm2'/>
    - <feature policy='require' name='umip'/>
    + <feature policy='require' name='tsx-ctrl'/>
    + <feature policy='require' name='xtpr'/>

Those three missing features ['hypervisor', 'ibpb', 'umip'] are what
will cause the compareCPU() method to return failure.

                - - -

The reason for the above difference (I'm slightly paraphrasing Daniel
Berrangé from an OSP bug):

    This difference reflects the design limitations of the original
    compareCPU() [libvirt] API that Nova is using. This API compares
    against the host physical CPUID. There are features in this CPUID
    that KVM doesn't expose, and there are also features KVM emulates
    which are not in the host CPUID. The latter is what's causing the
    problem.

So the recommendation from the libvirt folks is:

  - If Nova simply didn't call compareCPU at all, then libvirt would do
    the CPU comparison itself during migration and "do the right thing".

  - If Nova absolutely must do a CPU comparison itself, then it needs to
    change to use compareHypervisorCPU() instead which reflects the
    CPUID that KVM is actually able to expose

    (More on this here:
    https://op...

Read more...

summary: - Live-migrating a guest from 'Queens' to 'Train' fails during libvirt's
- compareCPU() check
+ Live-migrating an instance from 'Queens' (CentOS-7) to 'Train'
+ (CentOS-8) fails during libvirt's compareCPU() check
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Changed in nova:
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Kashyap Chamarthy (kashyapc)
tags: added: live-migration
tags: added: libvirt
Revision history for this message
chengsheng (chengsheng) wrote :

The original purpose was to find an available host for live migration among all available hosts.
If delete the check, the live migration will be aborted on the first unsupported node. there are deviations from actual needs.

Revision history for this message
Kashyap Chamarthy (kashyapc) wrote :

Chengsheng, on further thinking, I agree — even if libvirt would do the "right thing" when not calling the compareCPU(), that has the unwanted side-effect of abort on the first unsupported host.

So we should indeed not drop the comparison of APIs altogether, but go with the improved API, compareHypervisorCPU().

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Kashyap Chamarthy <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/772917
Reason: In favor of the fuller series: https://review.opendev.org/q/topic:bp%252Fcpu-selection-with-hypervisor-consideration

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/838926

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/838926
Committed: https://opendev.org/openstack/nova/commit/267a40663cd8d0b94bbc5ebda4ece55a45753b64
Submitter: "Zuul (22348)"
Branch: master

commit 267a40663cd8d0b94bbc5ebda4ece55a45753b64
Author: Kashyap Chamarthy <email address hidden>
Date: Thu Jan 28 16:35:10 2021 +0100

    libvirt: Add a workaround to skip compareCPU() on destination

    Nova's use of libvirt's compareCPU() API served its purpose
    over the years, but its design limitations break live migration in
    subtle ways. For example, the compareCPU() API compares against the
    host physical CPUID. Some of the features from this CPUID aren not
    exposed by KVM, and then there are some features that KVM emulates that
    are not in the host CPUID. The latter can cause bogus live migration
    failures.

    With QEMU >=2.9 and libvirt >= 4.4.0, libvirt will do the right thing in
    terms of CPU compatibility checks on the destination host during live
    migration. Nova satisfies these minimum version requirements by a good
    margin. So, provide a workaround to skip the CPU comparison check on
    the destination host before migrating a guest, and let libvirt handle it
    correctly. This workaround will be removed once Nova replaces the older
    libvirt APIs with their newer and improved counterparts[1][2].

                    - - -

    Note that Nova's libvirt driver calls compareCPU() in another method,
    _check_cpu_compatibility(); I did not remove its usage yet. As it needs
    more careful combing of the code, and then:

      - where possible, remove the usage of compareCPU() altogether, and
        rely on libvirt doing the right thing under the hood; or

      - where Nova _must_ do the CPU comparison checks, switch to the better
        libvirt CPU APIs -- baselineHypervisorCPU() and
        compareHypervisorCPU() -- that are described here[1]. This is work
        in progress[2].

    [1] https://opendev.org/openstack/nova-specs/commit/70811da221035044e27
    [2] https://review.opendev.org/q/topic:bp%252Fcpu-selection-with-hypervisor-consideration

    Change-Id: I444991584118a969e9ea04d352821b07ec0ba88d
    Closes-Bug: #1913716
    Signed-off-by: Kashyap Chamarthy <email address hidden>
    Signed-off-by: Balazs Gibizer <email address hidden>

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/nova/+/845045

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/nova/+/845045
Committed: https://opendev.org/openstack/nova/commit/277f88e3872ea41bce02d09c4537946a74d74533
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 277f88e3872ea41bce02d09c4537946a74d74533
Author: Kashyap Chamarthy <email address hidden>
Date: Thu Jan 28 16:35:10 2021 +0100

    libvirt: Add a workaround to skip compareCPU() on destination

    Nova's use of libvirt's compareCPU() API served its purpose
    over the years, but its design limitations break live migration in
    subtle ways. For example, the compareCPU() API compares against the
    host physical CPUID. Some of the features from this CPUID aren not
    exposed by KVM, and then there are some features that KVM emulates that
    are not in the host CPUID. The latter can cause bogus live migration
    failures.

    With QEMU >=2.9 and libvirt >= 4.4.0, libvirt will do the right thing in
    terms of CPU compatibility checks on the destination host during live
    migration. Nova satisfies these minimum version requirements by a good
    margin. So, provide a workaround to skip the CPU comparison check on
    the destination host before migrating a guest, and let libvirt handle it
    correctly. This workaround will be removed once Nova replaces the older
    libvirt APIs with their newer and improved counterparts[1][2].

                    - - -

    Note that Nova's libvirt driver calls compareCPU() in another method,
    _check_cpu_compatibility(); I did not remove its usage yet. As it needs
    more careful combing of the code, and then:

      - where possible, remove the usage of compareCPU() altogether, and
        rely on libvirt doing the right thing under the hood; or

      - where Nova _must_ do the CPU comparison checks, switch to the better
        libvirt CPU APIs -- baselineHypervisorCPU() and
        compareHypervisorCPU() -- that are described here[1]. This is work
        in progress[2].

    [1] https://opendev.org/openstack/nova-specs/commit/70811da221035044e27
    [2] https://review.opendev.org/q/topic:bp%252Fcpu-selection-with-hypervisor-consideration

    Change-Id: I444991584118a969e9ea04d352821b07ec0ba88d
    Closes-Bug: #1913716
    Signed-off-by: Kashyap Chamarthy <email address hidden>
    Signed-off-by: Balazs Gibizer <email address hidden>
    (cherry picked from commit 267a40663cd8d0b94bbc5ebda4ece55a45753b64)

tags: added: in-stable-yoga
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 26.0.0.0rc1

This issue was fixed in the openstack/nova 26.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 25.1.0

This issue was fixed in the openstack/nova 25.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/871975

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/xena)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/871975
Reason: stable/xena branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/xena if you want to further work on this patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.