VMware nested under KVM stopped working on 5.19 kernel on Ryzen: Invalid VMCB.

Bug #2008583 reported by dfdfdf
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

My setup:

Host: Ubuntu Linux 22.04.1, kernel 5.15, Ryzen 5900x
Hypervisor: KVM
Windows VM with VMware Workstation 16

Running nested VMs in VMware works ok.

It fails if Linux host kernel is updated to 5.19 (22.04.2) with

MONITOR PANIC: Invalid VMCB.
VMware Workstation unrecoverable error: (vcpu-0)
vcpu-0:Invalid VMCB.

If I update to Workstation 17 it fails with

2023-02-24T15:21:35.112Z In(05) vmx The following features are required for SVM support in VMware Workstation; however, these features are not available on this host:
2023-02-24T15:21:35.112Z In(05) vmx Flush by ASID.
This host supports AMD-V, but the AMD-V implementation is incompatible with VMware Workstation.
VMware Workstation does not support the user level monitor on this host.
Module 'MonitorMode' power on failed.

The same VM works ok on 5.19 kernel on my Intel PC.

So it has to do with updated Linux kernel and AMD virtualization and VMware.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 2008583

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
dfdfdf (lindt)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Juan S. Perales (jsperales) wrote :

Hi Everyone!

Found the same bug on my system:

Host: Ubuntu 22.04.02, kernel 5.15, Ryzen 5950x
Hypervisor: KVM
VM: ESXi 7.0u3

ESXi works fine and I'm able to start VMs on it (nested virtualization)

After updating the kernel to 5.19 or 6.0-oem I get this error anytime I try to start a VM on the ESXi:
vcpu-0:Invalid VMCB

I also found the same error/problems on other forums after the kernel update (it seems to affect only AMD platforms).

Revision history for this message
dfdfdf (lindt) wrote (last edit ):

Ok, I've checked mainline ubuntu kernels as of today:
WORKS 5.15.110 linux-image-unsigned-5.15.110-0515110-generic_5.15.110-0515110.202304302037_amd64.deb
BROKEN 5.16rc1 linux-image-unsigned-5.16.0-051600rc1-generic_5.16.0-051600rc1.202111142330_amd64.deb

so rc1 changes of 5.16 broke something here

PS and it's still broken on 6.3.1

Revision history for this message
dfdfdf (lindt) wrote (last edit ):

ok, found the culprit:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=174a921b6975ef959dd82ee9e8844067a62e3ec1

"nSVM: Check for reserved encodings of TLB_CONTROL in nested VMCB"

VMware errors out when it asks with tlb_ctl = 3 and gets a 'no' response.

I'll email author to revert it or fix this.

Revision history for this message
dfdfdf (lindt) wrote :

you can check it with bpftrace on kernel 5.19.0-41-generic #42~22.04.1-Ubuntu:
sudo bpftrace -e 'kprobe:__nested_vmcb_check_controls { printf("tlb_ctl: %d\n", *((uint8 *)arg1+60) )}'
and it ends like this:
...
tlb_ctl: 1
tlb_ctl: 0
tlb_ctl: 0
tlb_ctl: 0
tlb_ctl: 0
tlb_ctl: 0
tlb_ctl: 0
tlb_ctl: 0
tlb_ctl: 3 <<<<<<<<<<

Revision history for this message
Ricardo Esteves (mvrk-k) wrote :

Hi,

Does anyone knows if there is already a fix for this?
I have have same problem with kernel 6.2 with AMD EPYC 7313P CPU.

Revision history for this message
Nicolas Ballet (continuum) wrote :

Hi, I’ve got the exact same problem on an AMD Epyc 7763.
Can do a nested ESXi 7.0+ (and start VMs on top of it) with kernel 5.15.
No longer works with kernel 6.2 (haven’t tried with 5.19).
Can’t even virtualize ESXi 8.0: this host supports AMD-V, but the AMD-V implementation is incompatible

I was also able to reproduce the issue with Debian 11 and 12.
Any help would be greatly appreciated, thanks

Revision history for this message
Nicolas Ballet (continuum) wrote :

Sorry for the noob question but how do I compile the custom kernel?
What is the fix please?

Any help would be greatly appreciated.
Thank you.

Revision history for this message
NoOverflow (nooverflow) wrote :

Can confirm, bug occurs with my setup too:

Host: Centos 9 Stream (5.14.0-352.el9.x86_64)
Hypervisor: KVM (Openstack)
VM: ESXi 7.0u3

Patching the commit @lindt found, and recompiling a custom kernel allowed me to start a VM in an ESXi VM nested in Openstack.

Be aware I needed to delete the previously crashing VM in ESXI and create a new one. Failing to do so resulted in ESXi crashing completely and restarting.

Can provide additional logs or data if needed.

Revision history for this message
Ozzie (ozzie72) wrote :

So what's the current status with this bug? Is it now fixed in 6.8rc1 kernel and above? Any older kernel has to be patched?

Revision history for this message
Mattia Rizzolo (mapreri) wrote :

hwe is not handled by the backports team, so I'm unsubscribing us.

Revision history for this message
dfdfdf (lindt) wrote :

still not in jammy as of 6.5.0-35

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.