2023.1: Live migration fails with new compare_hypervisor_cpu method

Bug #2023035 reported by Maxim Monin
50
This bug affects 9 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

We have test setup with 3 identical compute nodes. Recently test setup upgraded from Zed to 2023.1 Openstack Version, with nova 27.0.0, libvirt 8.0.0, qemu 6.2

During live migration we always getting error:

2023-06-05 13:33:06.158 1458525 ERROR nova.virt.libvirt.driver [None req-6295f150-f0cf-41d1-8dd4-60c0ac32223f 59de9e2e2a8a413384be5ee27e027fc1 185764021e19409dae135a967f032fa4 - - b589410bd7e14872bf3ac74c45057691 b589410bd7e14872bf3ac74c45057691] CPU doesn't have compatibility.

0

Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server [None req-6295f150-f0cf-41d1-8dd4-60c0ac32223f 59de9e2e2a8a413384be5ee27e027fc1 185764021e19409dae135a967f032fa4 - - b589410bd7e14872bf3ac74c45057691 b589410bd7e14872bf3ac74c45057691] Exception during message handling: nova.exception.MigrationPreCheckEr
ror: Migration pre-check error: Unacceptable CPU info: CPU doesn't have compatibility.

0

Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 9637, in check_can_live_migrate_destination
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server self._compare_cpu(None, source_cpu_info, instance)
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 10014, in _compare_cpu
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server raise exception.InvalidCPUInfo(reason=m % {'ret': ret, 'u': u})
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server nova.exception.InvalidCPUInfo: Unacceptable CPU info: CPU doesn't have compatibility.
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server 0
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server Refer to http://libvirt.org/html/libvirt-libvirt-host.html#virCPUCompareResult
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 65, in wrapped
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server self.force_reraise()
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server raise self.value
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/exception_wrapper.py", line 63, in wrapped
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/utils.py", line 1439, in decorated_function
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 213, in decorated_function
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception():
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server self.force_reraise()
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server raise self.value
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 203, in decorated_function
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/compute/manager.py", line 8408, in check_can_live_migrate_destination
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server dest_check_data = self.driver.check_can_live_migrate_destination(ctxt,
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server File "/usr/lib/python3/dist-packages/nova/virt/libvirt/driver.py", line 9641, in check_can_live_migrate_destination
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server raise exception.MigrationPreCheckError(reason=e)
2023-06-05 13:33:06.264 1458525 ERROR oslo_messaging.rpc.server nova.exception.MigrationPreCheckError: Migration pre-check error: Unacceptable CPU info: CPU doesn't have compatibility.

Zed setup works ok with this nodes.

More info:

1. https://github.com/openstack/nova/commit/468b03e0ee4a917ae26106f6e57081bcd9e7a65b
If revert libvirt to old cpu_compare instead cpu_hypervisor_compare in libvirt driver
ret = self._host.compare_cpu(cpu_xml)
#ret = self._host.compare_hypervisor_cpu(cpu_xml)
live migration works ok.
ret = self._host.compare_cpu(cpu_xml) - always returns 2 (VIR_CPU_COMPARE_SUPERSET)
ret = self._host.compare_hypervisor_cpu(cpu_xml) - always returns 0 (VIR_CPU_COMPARE_INCOMPATIBLE)

2. Disabling cpu Comparation with changing nova.conf:

[workarounds]
skip_cpu_compare_on_dest = True

makes live migration work ok.

3. Entry xml to cpu_compare looks like:
<cpu>
  <arch>x86_64</arch>
  <model>Skylake-Client-noTSX-IBRS</model>
  <vendor>Intel</vendor>
  <topology sockets="1" cores="4" threads="2"/>
  <feature name="3dnowprefetch"/>
  <feature name="abm"/>
  <feature name="acpi"/>
  <feature name="adx"/>
  <feature name="aes"/>
  <feature name="apic"/>
  <feature name="arat"/>
  <feature name="arch-capabilities"/>
  <feature name="avx"/>
  <feature name="avx2"/>
  <feature name="bmi1"/>
  <feature name="bmi2"/>
  <feature name="clflush"/>
  <feature name="clflushopt"/>
  <feature name="cmov"/>
  <feature name="cx16"/>
  <feature name="cx8"/>
  <feature name="de"/>
  <feature name="ds"/>
  <feature name="ds_cpl"/>
  <feature name="dtes64"/>
  <feature name="erms"/>
  <feature name="est"/>
  <feature name="f16c"/>
  <feature name="fma"/>
  <feature name="fpu"/>
  <feature name="fsgsbase"/>
  <feature name="fxsr"/>
  <feature name="ht"/>
  <feature name="intel-pt"/>
  <feature name="invpcid"/>
  <feature name="invtsc"/>
  <feature name="lahf_lm"/>
  <feature name="lm"/>
  <feature name="mca"/>
  <feature name="mce"/>
  <feature name="md-clear"/>
  <feature name="mmx"/>
  <feature name="monitor"/>
  <feature name="movbe"/>
  <feature name="mpx"/>
  <feature name="msr"/>
  <feature name="mtrr"/>
  <feature name="nx"/>
  <feature name="pae"/>
  <feature name="pat"/>
  <feature name="pbe"/>
  <feature name="pcid"/>
  <feature name="pclmuldq"/>
  <feature name="pdcm"/>
  <feature name="pdpe1gb"/>
  <feature name="pge"/>
  <feature name="pni"/>
  <feature name="popcnt"/>
  <feature name="pse"/>
  <feature name="pse36"/>
  <feature name="rdrand"/>
  <feature name="rdseed"/>
  <feature name="rdtscp"/>
  <feature name="rsba"/>
  <feature name="sep"/>
  <feature name="smap"/>
  <feature name="smep"/>
  <feature name="smx"/>
  <feature name="spec-ctrl"/>
  <feature name="ss"/>
  <feature name="ssbd"/>
  <feature name="sse"/>
  <feature name="sse2"/>
  <feature name="sse4.1"/>
  <feature name="sse4.2"/>
  <feature name="ssse3"/>
  <feature name="stibp"/>
  <feature name="syscall"/>
  <feature name="tm"/>
  <feature name="tm2"/>
  <feature name="tsc"/>
  <feature name="tsc-deadline"/>
  <feature name="tsc_adjust"/>
  <feature name="vme"/>
  <feature name="vmx"/>
  <feature name="x2apic"/>
  <feature name="xgetbv1"/>
  <feature name="xsave"/>
  <feature name="xsavec"/>
  <feature name="xsaveopt"/>
  <feature name="xsaves"/>
  <feature name="xtpr"/>
</cpu>

nova.conf
[libvirt]
cpu_mode = host-model

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

We are seeing the same issue having upgraded from Zed to 2023.1. This applies to freshly created VMs and to ones which existed prior to the upgrade in the same way. We have not changed libvirt or qemu versions as part of the upgrade.

libvirt: 8.0.0-1ubuntu7.6~cloud0
qemu: 1:4.2-3ubuntu6.27
nova: b9089ac

Revision history for this message
Andrew Bonney (andrewbonney) wrote :
Download full text (4.3 KiB)

A little more information from our case. Having extracted the Guest CPU and Host CPU data from the _compare_cpu method, these are identical and are as follows:

<cpu><arch>x86_64</arch><model>Skylake-Server-IBRS</model><vendor>Intel</vendor><topology sockets="1" cores="18" threads="2"/><feature name="3dnowprefetch"/><feature name="abm"/><feature name="acpi"/><feature name="adx"/><feature name="aes"/><feature name="apic"/><feature name="arat"/><feature name="avx"/><feature name="avx2"/><feature name="avx512bw"/><feature name="avx512cd"/><feature name="avx512dq"/><feature name="avx512f"/><feature name="avx512vl"/><feature name="bmi1"/><feature name="bmi2"/><feature name="clflush"/><feature name="clflushopt"/><feature name="clwb"/><feature name="cmov"/><feature name="cx16"/><feature name="cx8"/><feature name="dca"/><feature name="de"/><feature name="ds"/><feature name="ds_cpl"/><feature name="dtes64"/><feature name="erms"/><feature name="est"/><feature name="f16c"/><feature name="fma"/><feature name="fpu"/><feature name="fsgsbase"/><feature name="fxsr"/><feature name="hle"/><feature name="ht"/><feature name="intel-pt"/><feature name="invpcid"/><feature name="invtsc"/><feature name="lahf_lm"/><feature name="lm"/><feature name="mca"/><feature name="mce"/><feature name="md-clear"/><feature name="mmx"/><feature name="monitor"/><feature name="movbe"/><feature name="mpx"/><feature name="msr"/><feature name="mtrr"/><feature name="nx"/><feature name="pae"/><feature name="pat"/><feature name="pbe"/><feature name="pcid"/><feature name="pclmuldq"/><feature name="pdcm"/><feature name="pdpe1gb"/><feature name="pge"/><feature name="pku"/><feature name="pni"/><feature name="popcnt"/><feature name="pse"/><feature name="pse36"/><feature name="rdrand"/><feature name="rdseed"/><feature name="rdtscp"/><feature name="rtm"/><feature name="sep"/><feature name="smap"/><feature name="smep"/><feature name="smx"/><feature name="spec-ctrl"/><feature name="ss"/><feature name="ssbd"/><feature name="sse"/><feature name="sse2"/><feature name="sse4.1"/><feature name="sse4.2"/><feature name="ssse3"/><feature name="stibp"/><feature name="syscall"/><feature name="tm"/><feature name="tm2"/><feature name="tsc"/><feature name="tsc-deadline"/><feature name="tsc_adjust"/><feature name="vme"/><feature name="vmx"/><feature name="x2apic"/><feature name="xgetbv1"/><feature name="xsave"/><feature name="xsavec"/><feature name="xsaveopt"/><feature name="xsaves"/><feature name="xtpr"/></cpu>

As a result, even calling self._host.compare_hypervisor_cpu with the host CPU XML results in a return code stating it is incompatible.

The same occurs if passing this XML to virsh at the command line:

# virsh cpu-compare cpu.xml
Host CPU is a superset of CPU described in cpu.xml

# virsh hypervisor-cpu-compare cpu.xml
CPU described in cpu.xml is incompatible with the CPU provided by hypervisor on the host

Could it be that the host CPU model data is inappropriate for use with this new method? I'm not too familiar with libvirt, but 'virsh capabilities' lists far fewer features against the host CPU:

    <cpu>
      <arch>x86_64</arch>
      <model>Skylake-Server-IBRS</model>
      <vendor>Intel</vendor>...

Read more...

Revision history for this message
Maxim Monin (maximmonin) wrote (last edit ):

You are right made some tests - format of cpu2.xml file differs from cpu.xml file

```
virsh cpu-compare cpu.xml
Host CPU is a superset of CPU described in cpu.xml

virsh hypervisor-cpu-compare cpu.xml
CPU described in cpu.xml is incompatible with the CPU provided by hypervisor on the host

virsh hypervisor-cpu-compare cpu.xml --error
error: Failed to compare hypervisor CPU with cpu.xml
error: the CPU is incompatible with host CPU: Host CPU does not provide required features: ds, acpi, ht, tm, pbe, dtes64, monitor, ds_cpl, smx, est, tm2, xtpr, intel-pt

virsh domcapabilities > cpu2.xml
virsh hypervisor-cpu-compare cpu2.xml
CPU described in cpu2.xml is identical to the CPU provided by hypervisor on the host

virsh cpu-compare cpu2.xml
CPU described in cpu2.xml is incompatible with host CPU

virsh cpu-compare cpu2.xml --error
error: Failed to compare host CPU with cpu2.xml
error: the CPU is incompatible with host CPU: Host CPU does not provide required features: hypervisor, umip, ibpb, ibrs, amd-stibp, amd-ssbd, skip-l1dfl-vmentry, pschange-mc-no
```

Revision history for this message
Amit Uniyal (auniyal) wrote :
tags: added: libvirt live-migration
Revision history for this message
Andrew Bonney (andrewbonney) wrote :

As requested during the most recent Nova meeting, I've tried reverting 468b03e0ee4a917ae26106f6e57081bcd9e7a65b from stable/2023.1 and confirmed that this does resolve the issue. This matches what was found in the first post of this bug report.

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

I should also add that in our case we have not explicitly configured an option for 'cpu_mode' in nova.conf.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

I feel like it is similar to https://bugs.launchpad.net/nova/+bug/2039803 but there cpu_mode=host-model was set

Revision history for this message
Andrew Bonney (andrewbonney) wrote :

Yes, I spotted that. I'll happily test out patches linked to that issue to see if this issue is also resolved.

Revision history for this message
Maxim Monin (maximmonin) wrote :

still waiting for fix....

Revision history for this message
Tyler Stachecki (tstachecki) wrote (last edit ):

Not sure if this helps anyone* -- but:
Saw a similar backtrace today after upgrading to a newer version of qemu/libvirt.

Observed that Nova was attempting to live-migrate a CPU model that was incompatible with the destination hypervisor and was matching on "exact". So in this case, the failure was actually warranted from the prospective of the destination hypervisor.

I have to look further at our configuration to understand why ComputeCapabilitiesFilter did not filter out the migration before it got this far.

Revision history for this message
Gökhan (skylightcoder) wrote :

We are seeing the same issue having upgraded from yoga to 2023.1. we are also waiting for fix. For quick workaround https://github.com/bbc/nova/commit/159869cde16fbd3e780a2a5bfa59e999890e6511 worked for us.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.