compareHypervisorCPU() incompatibility during live migration

Bug #2039803 reported by Balazs Gibizer
74
This bug affects 16 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
Unassigned

Bug Description

Description
===========
Live migration fails with

Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 9636, in check_can_live_migrate_destination
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server self._compare_cpu(None, source_cpu_info, instance)
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 10013, in _compare_cpu
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server raise exception.InvalidCPUInfo(reason=m % {'ret': ret, 'u': u})
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility.

[...]

2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 9640, in check_can_live_migrate_destination
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server raise exception.MigrationPreCheckError(reason=e)
2023-10-17 08:28:15.301 2 ERROR oslo_messaging.rpc.server nova.exception.MigrationPreCheckError: Migration pre-check error: Unacceptable CPU info: CPU doesn't have compatibility.

If skip_cpu_compare_on_dest is set to True the the live migration succeeds. So the issue seems to be only in the check nova does and the hypervisors are actually compatible.

Steps to reproduce
==================
* boot a simple cirros VM
* openstack server migrate --live --block-migration <vm>

Environment
===========
OpenStack: 2023.1.
libvirt version: 9.5.0
QEMU: 8.1.0
Hypervisors: two centos stream 9 VMs with nested KVM enabled
nova compute is configured with cpu_mode=host-model

Triage
======

During the pre_live_migration check running on the destination node nova sees that in the DB the guest has no vcpu_model set and therefore falls back to do host CPU model based comparison[1]. The host cpu_info used there is collected with the getCapabilities() from libvirt [2]. And in this system that returns SandyBridge. In the other hand the guest VM is running as Broadwell (note nova is configured with cpu_mode=host-model) and also virsh domcapabilities returns Broadwell as the host model.

There are two reasons for the failure:
1) nova uses getCapabilities() to determine the host CPU model but use the model from the domCapabilities for the guest VM using host-model. According to the libvirt maintainers nova should never use getCapabilities for anything any more.

2) nova falls back to do a host CPU based comparison if the guest vcpu_model is not filled in the nova DB. But for live migration the guest CPU model should be available as the guest exists and running on the source node.

[1] https://github.com/openstack/nova/blob/a869ab17c095cbff2c942ab94247b0c30723b230/nova/virt/libvirt/driver.py#L9960-L9975
[2] https://github.com/openstack/nova/blob/a869ab17c095cbff2c942ab94247b0c30723b230/nova/virt/libvirt/host.py#L793-L796

Revision history for this message
sean mooney (sean-k-mooney) wrote :

setting this to medium instead of high as we have a workaround config option that will provide an escape hatch for those how encounter this on upgrade however this is important to fix and backport.

the bug is cause because we are using a different method to select the gpus CPU when creating the VM vs when live migrating the VM

we should use the same method in both cases and since we have moved to using domcaps as the new way for gust CPU selection we should consolidate on that for CPU compare too.

Changed in nova:
importance: Undecided → Medium
status: New → Triaged
tags: added: libvirt live-migration
Revision history for this message
Kashyap Chamarthy (kashyapc) wrote (last edit ):

I just did a quick audit of getCapabilities() in Nova upstream (`git describe`: 28.0.0-26-g6094a682917), and here is it.

We have to carefully comb through each of this usage and explore what can be replaced. All without introducing any regressions or other subtle bugs.

!IMPORTANT! We have to be careful to only touch the usage where host CPU definition from getCapabilities() is being used rather then eliminating the use of getCapabilities() altogether!

    * * *

Usage of getCapabilities
------------------------

The following is the usage of getCapabilities(), either directly or via
the wrapper method, get_capabilities() in libvirt/host.py:

  - get_domain_capabilities()

  - get_canonical_machine_type()

  - has_hyperthreading()

  - supports_uefi()

  - supports_secure_boot()

  - supports_amd_sev()

  - get_cpu_model_names()

    "Get the cpu models based on host CPU arch -- :returns: a list of
    cpu models which supported by the given CPU arch"

And the usage of host.get_capabilities() in libvirt/driver.py:

  - _do_quality_warnings()

  - _get_guest_cpu_model_config()

  - _match_cpu_model_by_flags()

  - _get_guest_cpu_config()

  - _get_host_sysinfo_serial_hardware()

  - _check_uefi_support()

  - _configure_guest_by_virt_type()

  - _guest_needs_pcie()

    * * *

I think we can start with evaluating these methods and see what can be replaced there:

  - get_cpu_model_names()
  - get_guest_cpu_model_config()
  - _get_guest_cpu_config()
  - _match_cpu_model_by_flags()
  - And probably _configure_guest_by_virt_type()

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/899185

Changed in nova:
status: Triaged → In Progress
Revision history for this message
sean mooney (sean-k-mooney) wrote :
Revision history for this message
sean mooney (sean-k-mooney) wrote :

adding an extra datapoing for this bug basedon 2023.1
https://paste.opendev.org/show/b8BM3HtnTKlQEqtv6i4I/

Revision history for this message
Lars Erik Pedersen (pedersen-larserik) wrote :

I'm having issues with SapphireRapids in nova 27.1.0:

virsh capabilities:
https://paste.opendev.org/show/bpuK1gMJ2Q3BbC7SwObA/

virsh domcapabilities --arch x86_64 --machine q35
https://paste.opendev.org/show/b1FFU0u0UjDj3hpjgsSa/

Revision history for this message
sean mooney (sean-k-mooney) wrote :

so just some quick observations

virsh capabilities is identifying the closed host model as SapphireRapids-noTSX

which si not an upstream CPU model i believe this is a downstream cannoncal/ubuntu CPU model

virsh capabilities shows that model requests

 <feature name='tsx-ctrl'/>

so that looks like a bug on there end.

the domscabality API does not list SapphireRapids-noTSX or even SapphireRapids as a supported CPU model
it belives Icelake-Server is the closed match because of tsk

so it feels like there is a disto specific issue happening in that specific case.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.