Live-migrating an instance from 'Queens' (CentOS-7) to 'Train' (CentOS-8) fails during libvirt's compareCPU() check
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Kashyap Chamarthy |
Bug Description
[This bug was originally reported by Lukas Bezdicka when testing Red
Hat's OpenStack (OSP); but this should be reproducible in upstream
context as well. I'm writing this report based on the root cause
analysis in the environment where the bug occcurred. Thanks to Daniel
Berrangé for the debugging help!]
Description
-----------
Live-migrating a guest from 'Queens' compute node (running CentOS 7) to
a 'Train' compute node (running CentOS 8) fails with:
-------
[...]
_compare_cpu /usr/lib/
2021-01-26 23:30:25.169 7 ERROR nova.virt.
0
Refer to http://
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.
[...]
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.
2021-01-26 23:30:25.242 7 ERROR oslo_messaging.
[...]
-------
Environment
-----------
The bug was reported by testing in a nested KVM environment, running on
Intel hardware (Xeon(R) Gold 5218R CPU @ 2.10GHz), with the entire
OpenStack setup in VMs. So the Nova instances themselves will be nested
guests.
- Source: a CentOS-7 compute node (a level-1 guest), running OpenStack
'Queens'
- Destination: a CentOS-8 compute node (a level-1 guest), running
OpenStack 'Train'
Steps to reproduce
------------------
Live-migrate a guest from source to host.
Expected result
---------------
Live migration should've succeeded.
Actual result
-------------
Live migration fails during compareCPU() check on the destination host
with:
[...]
_compare_cpu /usr/lib/
2021-01-26 23:30:25.169 7 ERROR nova.virt.
[...]
Root cause analysis
-------------------
Generate two files called "guest-full.xml" and "hostcaps-full.xml" as follows:
- guest-full.xml: this is generated in two steps: (1) from the compute. log, pick the <cpu> ... </cpu> element that Nova
nova-
unsuccessfully supplied to compareCPU() API and place it into a file
called "guest.xml"; then (2) run: `virsh cpu-baseline --features
guest.xml > guest-full.xml`
- hostcaps-full.xml: this is generated by running `virsh
domcapabilities |& tee hostcaps-full.xml` on the baremetal host.
Then compare the two, and the root cause is the following difference:
$> diff -u guest-full.xml hostcaps-full.xml | grep -E '^(-|\+)' 'forbid' >Skylake- Server- IBRS</model> 'forbid' >Cascadelake- Server< /model> capabilities' /> l1dfl-vmentry' />
--- guest-full.xml 2021-01-27 17:22:20.655831989 +0000
+++ hostcaps-full.xml 2021-01-27 17:22:27.262779417 +0000
- <model fallback=
+ <model fallback=
+ <feature policy='require' name='acpi'/>
+ <feature policy='require' name='arch-
+ <feature policy='require' name='dca'/>
+ <feature policy='require' name='ds'/>
+ <feature policy='require' name='ds_cpl'/>
+ <feature policy='require' name='dtes64'/>
+ <feature policy='require' name='est'/>
- <feature policy='require' name='hypervisor'/>
- <feature policy='require' name='ibpb'/>
+ <feature policy='require' name='ht'/>
+ <feature policy='require' name='ibrs-all'/>
+ <feature policy='require' name='intel-pt'/>
+ <feature policy='require' name='invtsc'/>
+ <feature policy='require' name='mds-no'/>
+ <feature policy='require' name='monitor'/>
+ <feature policy='require' name='pbe'/>
+ <feature policy='require' name='pdcm'/>
+ <feature policy='require' name='rdctl-no'/>
+ <feature policy='require' name='skip-
+ <feature policy='require' name='smx'/>
+ <feature policy='require' name='tm'/>
+ <feature policy='require' name='tm2'/>
- <feature policy='require' name='umip'/>
+ <feature policy='require' name='tsx-ctrl'/>
+ <feature policy='require' name='xtpr'/>
Those three missing features ['hypervisor', 'ibpb', 'umip'] are what
will cause the compareCPU() method to return failure.
- - -
The reason for the above difference (I'm slightly paraphrasing Daniel
Berrangé from an OSP bug):
This difference reflects the design limitations of the original
compareCPU() [libvirt] API that Nova is using. This API compares
against the host physical CPUID. There are features in this CPUID
that KVM doesn't expose, and there are also features KVM emulates
which are not in the host CPUID. The latter is what's causing the
problem.
So the recommendation from the libvirt folks is:
- If Nova simply didn't call compareCPU at all, then libvirt would do
the CPU comparison itself during migration and "do the right thing".
- If Nova absolutely must do a CPU comparison itself, then it needs to orCPU() instead which reflects the
change to use compareHypervis
CPUID that KVM is actually able to expose
(More on this here: /op...
https:/