default cpu (qemu64) no more capable of nesting

Bug #1868692 reported by Christian Ehrhardt  on 2020-03-24
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Undecided
Christian Ehrhardt 
libvirt (Ubuntu)
Undecided
Unassigned
qemu (Ubuntu)
Undecided
Christian Ehrhardt 

Bug Description

TL;DR this is the time to decide to either drop debian/patches/ubuntu/expose-vmx_qemu64cpu.patch or to update it.

Default nesting issue:
uvt-kvm create --memory 2048 --cpu 4 --disk 16 --password=ubuntu focal-kvm release=focal arch=amd64 label=daily
Default CPU used is:
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>qemu64</model>
    <feature policy='require' name='vmx'/> <-- even has VMX enabled
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='lahf_lm'/>
    <feature policy='disable' name='svm'/>
  </cpu>
Guest:
uvt-kvm create --disk 5 --machine-type ubuntu --password=ubuntu focal-2nd-lvm release=focal arch=amd64 label=daily

It comes down to non-loadable module in the lvl1 guest:

$ sudo modprobe kvm_intel
modprobe: ERROR: could not insert 'kvm_intel': Input/output error

Try the same with host-passthrough to check if it is the (default) cpu type

  <cpu mode='host-passthrough' check='none'/>

$ kvm-ok
INFO: /dev/kvm exists
KVM acceleration can be used

  <cpu mode='host-model' check='none'/>

Even adapting the qemu64 type to represent the features of Haswell didn't work.
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>qemu64</model>
    <feature policy='require' name='vmx'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='lahf_lm'/>
    <feature policy='disable' name='svm'/>
    <feature policy='require' name='aes'/>
    <feature policy='require' name='avx'/>
    <feature policy='require' name='avx2'/>
    <feature policy='require' name='bmi1'/>
    <feature policy='require' name='bmi2'/>
    <feature policy='require' name='erms'/>
    <feature policy='require' name='fma'/>
    <feature policy='require' name='fsgsbase'/>
    <feature policy='require' name='invpcid'/>
    <feature policy='require' name='movbe'/>
    <feature policy='require' name='pcid'/>
    <feature policy='require' name='pclmuldq'/>
    <feature policy='require' name='popcnt'/>
    <feature policy='require' name='rdtscp'/>
    <feature policy='require' name='smep'/>
    <feature policy='require' name='spec-ctrl'/>
    <feature policy='require' name='sse4.1'/>
    <feature policy='require' name='sse4.2'/>
    <feature policy='require' name='ssse3'/>
    <feature policy='require' name='tsc-deadline'/>
    <feature policy='require' name='xsave'/>
    <feature policy='require' name='ss'/>
    <feature policy='require' name='vme'/>
    <feature policy='require' name='pat'/>
    <feature policy='require' name='rdrand'/>
    <feature policy='require' name='f16c'/>
    <feature policy='require' name='arat'/>
    <feature policy='require' name='tsc_adjust'/>
    <feature policy='require' name='umip'/>
    <feature policy='require' name='md-clear'/>
    <feature policy='require' name='stibp'/>
    <feature policy='require' name='arch-capabilities'/>
    <feature policy='require' name='ssbd'/>
    <feature policy='require' name='xsaveopt'/>
    <feature policy='require' name='pdpe1gb'/>
    <feature policy='require' name='abm'/>
    <feature policy='require' name='ibpb'/>
    <feature policy='require' name='amd-ssbd'/>
    <feature policy='require' name='skip-l1dfl-vmentry'/>
  </cpu>

The reason is that VMX now is set in subfeatures and therefore even with the same "input" definition the guest looses features.

60a63,68
> tpr_shadow
> vnmi
> flexpriority
> ept
> vpid
> ept_ad

This is just dependent on the userspace stack (qemu upgrade) due to the change:
https://git.qemu.org/?p=qemu.git;a=commit;h=0723cc8a5558c94388db75ae1f4991314914edd3

Even the same commandline will deliver different results:

Eoan vs Focal
E:
-cpu qemu64,vmx=on,x2apic=on,hypervisor=on,lahf_lm=on,svm=off,aes=on,avx=on,avx2=on,bmi1=on,bmi2=on,erms=on,fma=on,fsgsbase=on,invpcid=on,movbe=on,pcid=on,pclmuldq=on,popcnt=on,rdtscp=on,smep=on,spec-ctrl=on,sse4.1=on,sse4.2=on,ssse3=on,tsc-deadline=on,xsave=on,ss=on,vme=on,pat=on,rdrand=on,f16c=on,arat=on,tsc_adjust=on,umip=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaveopt=on,pdpe1gb=on,abm=on,ibpb=on,amd-ssbd=on
F:
-cpu qemu64,vmx=on,x2apic=on,hypervisor=on,lahf-lm=on,svm=off,aes=on,avx=on,avx2=on,bmi1=on,bmi2=on,erms=on,fma=on,fsgsbase=on,invpcid=on,movbe=on,pcid=on,pclmulqdq=on,popcnt=on,rdtscp=on,smep=on,spec-ctrl=on,sse4.1=on,sse4.2=on,ssse3=on,tsc-deadline=on,xsave=on,ss=on,vme=on,pat=on,rdrand=on,f16c=on,arat=on,tsc-adjust=on,umip=on,md-clear=on,stibp=on,arch-capabilities=on,ssbd=on,xsaveopt=on,pdpe1gb=on,abm=on,ibpb=on,amd-ssbd=on

Just remaining differences:
-lahf_lm=on
+lahf-lm=on
-pclmuldq=on
+pclmulqdq=on
-tsc_adjust=on
+tsc-adjust=on
=> args renamed

But CPU flags change a lot:
-tpr_shadow
-vnmi
-flexpriority
-ept
-vpid
-ept_ad

Due to the commit above our old Delta in debian/patches/ubuntu/expose-vmx_qemu64cpu.patch which exposed VMX by default on qemu64 (for ease of use) isn't working as-is anymore.

We'll need to accept the degradation (to be closer to upstream) or - this also will be an upgrade regression for some users - fix the bug by changing it to what was added to the kvm64 type in the commit above.

+ /* VMX features from Cedar Mill/Prescott */
+ .features[FEAT_VMX_ENTRY_CTLS] = VMX_VM_ENTRY_IA32E_MODE,
+ .features[FEAT_VMX_EXIT_CTLS] = VMX_VM_EXIT_ACK_INTR_ON_EXIT,
+ .features[FEAT_VMX_MISC] = MSR_VMX_MISC_ACTIVITY_HLT,
+ .features[FEAT_VMX_PINBASED_CTLS] = VMX_PIN_BASED_EXT_INTR_MASK |
+ VMX_PIN_BASED_NMI_EXITING,
+ .features[FEAT_VMX_PROCBASED_CTLS] = VMX_CPU_BASED_VIRTUAL_INTR_PENDING |
+ VMX_CPU_BASED_USE_TSC_OFFSETING | VMX_CPU_BASED_HLT_EXITING |
+ VMX_CPU_BASED_INVLPG_EXITING | VMX_CPU_BASED_MWAIT_EXITING |
+ VMX_CPU_BASED_RDPMC_EXITING | VMX_CPU_BASED_RDTSC_EXITING |
+ VMX_CPU_BASED_CR8_LOAD_EXITING | VMX_CPU_BASED_CR8_STORE_EXITING |
+ VMX_CPU_BASED_TPR_SHADOW | VMX_CPU_BASED_MOV_DR_EXITING |
+ VMX_CPU_BASED_UNCOND_IO_EXITING | VMX_CPU_BASED_USE_IO_BITMAPS |
+ VMX_CPU_BASED_MONITOR_EXITING | VMX_CPU_BASED_PAUSE_EXITING,
         .xlevel = 0x80000008,
         .model_id = "Common KVM processor"

Related branches

Changed in qemu (Ubuntu):
status: New → Triaged
assignee: nobody → Christian Ehrhardt  (paelzer)

With the fix applied, the most reduced case is:

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>qemu64</model>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='lahf_lm'/>
    <feature policy='disable' name='svm'/>
  </cpu>

Which ends up as:
 -cpu qemu64

That is without VMX (as expected)

If adding FMX as feature (didn't work to enable it before)

  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>qemu64</model>
    <feature policy='require' name='vmx'/>

=> Nested KVM works
=> It got a smaller subset of cpuflags than in the past - just vmx + tpr_shadow

And if started with <cpu> tag at all it gets:
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>qemu64</model>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='lahf_lm'/>
    <feature policy='disable' name='svm'/>
  </cpu>

So we are missing to get VMX auto added here as we did in the past.
We can enable VMX now (ok for upgraders of old guests), but the same XML/commandline still is too different.

Maybe I need to re-add the main flag as well, lets do a rebuild over night.

With the current build:
sudo qemu-system-x86_64 --enable-kvm --nographic --nodefaults -S -qmp-pretty stdio
{"execute":"qmp_capabilities"}
{"execute":"query-cpu-definitions"}
        {
            "name": "qemu64",
            "typename": "qemu64-x86_64-cpu",
            "unavailable-features": [
                "svm"
            ],
            "static": false,
            "migration-safe": true
        },

vmx isn't in /usr/share/libvirt/cpu_map/x86_qemu64.xml but it wasn't before and got auto-added. So lets recheck the new build before tweaking too many knobs at once (and XML changes can be done at runtime without rebuild).

With the next build I got something even closer to the old behavior.

No CPU tag gave me:
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>qemu64</model>
    <feature policy='require' name='vmx'/>
    <feature policy='require' name='x2apic'/>
    <feature policy='require' name='hypervisor'/>
    <feature policy='require' name='lahf_lm'/>
    <feature policy='disable' name='svm'/>
  </cpu>

I'm able to disable it via
   <feature policy='disable' name='vmx'/>
which becomes
   -cpu qemu64,vmx=off

With that change the former upgrade bug would be fixed.
The one decision we have to make is which features we announce on this cpu type.
In the past it was taking all features that it could and by that could have been different per host cpu. Now it takes a rather conservative small amount (but enough to do nesting).
The new default is more compatible and that is what the default type is for.
For a multitude of features it always was better to puck a defined (modern) named type.
Therefore it is the best choice to keep this small set. If people upgrade on new systems that is a slight loss in cpu guest features, but only speedup-bit of VMX. There is no set of features we could choose to avoid that, if we'd pick more older chips would trigger issues.

There seems to be no perfect answer, but the current patch at least fixes the identified bigger upgrade issue and is the most compatible one. The more recommended modern named types (also the default in any higher level management tool) provide a way to get consistent and feature rich guest CPUs.

I wanted another round of tests and it seems that was good.
With the subset of VMX features it fails to start the nested guest.

KVM: entry failed, hardware error 0x80000021

That is a known issue on older HW due to a lack of VMX features and bugs in the kernel around it.
It came back and was fixed multiple times over the generations of VMX support.

Since I aligned my changes to those in kvm64 I tried that type if - on the HW/kernel I have - this crashes as well.

As expected it does - by default - only have the VMX subfeatures, but not the feature itself enabled. That is the same as I have seen on qmeu64 before I added back the main flag.
Enabling vmx via the XML like:
  <feature policy='require' name='vmx'/>
Gives a qmeu commandline like:
  -cpu kvm64,vmx=on

With that the kvm module can be loaded in the guest and it works.
So it seem I can only make qemu64 as-default-vmx as the kvm64 type - otherwise things break (by mismatching sub/main features).

Old upgraders will have qemu64 + vmx=on in the XML.
We can save those with the qemu fix.

After this fix we will be less broken than we are right now (vmx can be enabled on qemu64 and upgraders mostly work), but are not 100% the same (you need to enable VMX on cmdline or XML.

To further smooth this we might add vmx to /usr/share/libvirt/cpu_map/x86_qemu64.xml.
Giving that a try after a bunch of rebuilds.

Changed in libvirt (Ubuntu):
status: New → Triaged

The libvirt type doesn't add vmx=on (as expected since it defines the existing features).

Test:
- no type - qemu64 type withotu vmx=on, no nesting as-is
- qemu64 type - qemu64 type withotu vmx=on, no nesting as-is
- qemu64 with vmx - nested working
- kvm64 type - kvm64 type withotu vmx=on, no nesting as-is
- kvm64 type with vmx - nested working

So with the fixes in qemu we can:
- again enable VMX on this cpu type
- nesting works (not running into the crashes mentioned further above)

It does not yet make nesting available if you don't do anything.
Since overall this change makes things (e.g. exposed guest features) safer lets add a comment to the release notes - might be a good chance to mention host-model and named CPU types anyway, but otherwise keep it at that level.

Changed in libvirt (Ubuntu):
status: Triaged → Invalid
Changed in ubuntu-release-notes:
assignee: nobody → Christian Ehrhardt  (paelzer)

TODO: I want to do some further tests on this with the intended uploads

Q: Have I seen this crash appear and vanish by accident and is it still there with VMX flag not set? This uses the code that no more adds the VMX flag (and due to that doesn't enable it by default)

- qemu64
  - nested can't be used
- qemu64,vmx=on
  - nested can be used
  => Still Crashes!
- kvm64
  - nested can't be used
- kvm64,vmx=on
  - nested can be used
  - does not crash

So qemu64 has a deficiency (as intended by upstream) and is unable to run VMX.
Our former change adding the full VMX just had it set vmx=on in cmdline and due to that exposed the bug quicker. We need to identify the bug in qemu64 OR use this chance to stop our qemu64 to be different.

qemu64 also appears as AMD/SVM cpu - I think it really is time to stop our qemu64 type to be special, mention that (and better alternatives) in the release notes and close this issue.

All other tools but uvtool have modern default already.
- virt-manager and such use a host-model like named cpu-type by default
- openstack uses named types as well
- multipass uses host-passthrough (no migration in mind)

The only thing open would be uvtool and to eventually resolve that I filed bug 1869185

I was recently seeing a bug report that had a very similar signature but used host-passthrough. I was concerned enough on that failure to double check if that happens on the new builds as well.

host-model gave me
  <cpu mode='custom' match='exact' check='full'>
    <model fallback='forbid'>Haswell-noTSX-IBRS</model>
    <vendor>Intel</vendor>
    ... long list of extra features ...
=> no crash

host-passthrough
=> worked as well

Ok, so far so good - the report must really have been buggy HW/FW then (they resolved it by disabling some kvm_intel options). But that re-check was needed to feel a bit safer.

I'll run a full regression check (which found the initial issue) again before upload.

Tests are good still/again - the one failure is known and completely unrelated to Focal

prep (x86_64) : Pass 20 F/S/N 0/0/0 - RC 0 (15 min 55036 lin)
migrate (x86_64) : Pass 288 F/S/N 0/0/0 - RC 0 (60 min 214809 lin)
cross (x86_64) : Pass 24 F/S/N 0/1/3 - RC 0 (51 min 47738 lin)
misc (x86_64) : Pass 103 F/S/N 0/0/0 - RC 0 (39 min 72305 lin)

prep (s390x) : Pass 20 F/S/N 0/0/0 - RC 0 (14 min 43627 lin)
migrate (s390x) : Pass 268 F/S/N 0/5/0 - RC 0 (66 min 161708 lin)
cross (s390x) : Pass 19 F/S/N 1/1/2 - RC 1 (47 min 45401 lin)
misc (s390x) : Pass 67 F/S/N 0/0/0 - RC 0 (24 min 32119 lin)

And we got the FFe ack.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qemu - 1:4.2-3ubuntu4

---------------
qemu (1:4.2-3ubuntu4) focal; urgency=medium

  * d/p/ubuntu/lp-1835546-*: backport the s390x protvirt feature (LP: #1835546)
  * remove d/p/ubuntu/expose-vmx_qemu64cpu.patch: Stop adding VMX to qemu64
    to avoid broken nesting (LP: #1868692)

 -- Christian Ehrhardt <email address hidden> Fri, 20 Mar 2020 08:02:16 +0100

Changed in qemu (Ubuntu):
status: Triaged → Fix Released
Changed in ubuntu-release-notes:
status: New → In Progress
Changed in ubuntu-release-notes:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers