VMs running under qemu+kvm can't use xsaves processor feature

Bug #1950706 reported by Aaron Vodney
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned
qemu (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Virtual machines running under qemu using KVM acceleration run with `-cpu host` can't use the xsaves feature of the processor. An example guest dmesg log is attached, see errors about "unchecked MSR access error". Using a `-cpu` model that doesn't advertise xsaves such as `EPYC` fixes the problem on the guest. The host cpu does support xsaves and Ubuntu doesn't seem to have any issues with it.

The guest dmesg provided is running linux mainline 5.15, but the issue happens with every kernel version I've tried, including centos on 4.18.0.

The host cpu is a Ryzen 5 3600

nested guests have trouble running in a guest with `-cpu host` or even `-cpu EPYC`. They tend to run, but freeze before they finish booting at random times. Not sure if it's related. Nested guests do seem to be able to run with `-cpu kvm64`. There are no dmesg errors when a nested guest freezes.

The Ubuntu host version:
Description: Ubuntu 20.04.3 LTS
Release: 20.04

# apt-cache policy qemu-system-x86
qemu-system-x86:
  Installed: 1:4.2-3ubuntu6.18
  Candidate: 1:4.2-3ubuntu6.18
  Version table:
 *** 1:4.2-3ubuntu6.18 500
        500 http://us.archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     1:4.2-3ubuntu6.17 500
        500 http://us.archive.ubuntu.com/ubuntu focal-security/main amd64 Packages
     1:4.2-3ubuntu6 500
        500 http://us.archive.ubuntu.com/ubuntu focal/main amd64 Packages

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: qemu (not installed)
ProcVersionSignature: Ubuntu 5.4.0-89.100-generic 5.4.143
Uname: Linux 5.4.0-89-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu27.21
Architecture: amd64
CasperMD5CheckResult: pass
Date: Fri Nov 12 00:19:10 2021
InstallationDate: Installed on 2021-10-20 (22 days ago)
InstallationMedia: Ubuntu-Server 20.04.3 LTS "Focal Fossa" - Release amd64 (20210824)
MachineType: System manufacturer System Product Name
ProcEnviron:
 SHELL=/bin/bash
 LANG=en_US.UTF-8
 TERM=screen-256color
 XDG_RUNTIME_DIR=<set>
 PATH=(custom, no user)
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.4.0-89-generic root=UUID=15509878-abfa-45b1-b8c2-c99db1ececf7 ro
SourcePackage: qemu
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 06/15/2021
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 4002
dmi.board.asset.tag: Default string
dmi.board.name: PRIME X570-P
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev X.0x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr4002:bd06/15/2021:svnSystemmanufacturer:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnPRIMEX570-P:rvrRevX.0x:cvnDefaultstring:ct3:cvrDefaultstring:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.sku: SKU
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

Revision history for this message
Aaron Vodney (tavi-vi) wrote :
Revision history for this message
Aaron Vodney (tavi-vi) wrote :

I upgraded to 21.04 to see if it's fixed there, and it is. So this may just need to be closed considering the imminent retiring of 20.04

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

If there is an individual fix not regressing other functionality we can consider backporting it to 20.04 so it depends what this is about.

If OTOH you just want 20.04 in general but a newer stack of the virtualization software to run your newer HW you might consider trying [1].

I have not seen the very same on other EPYC Hardware last time I used it, but such issues often depend on the specific CPU and firmware levels.

If you could help isolating what we actually look for that would be great.
To do that I'd ask you to start with that PPA [1] I linked and the HWE kernels [2].
With that please upgrade kernel / qemu / libvirt individually - hopefully just one of them is needed to fix this for you, in that case this is the component we need to look in for individual changes and then provide you builds of that for 20.04+fix to test.

[1]: https://launchpad.net/~canonical-server/+archive/ubuntu/server-backports
[2]: https://wiki.ubuntu.com/Kernel/LTSEnablementStack

Changed in qemu (Ubuntu):
status: New → Incomplete
Revision history for this message
Aaron Vodney (tavi-vi) wrote (last edit ):

I installed the HWE kernel on 20.04, and the errors in guests are gone. Nested virtualization also works on the HWE kernel.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Thanks for trying these different things Aaron,
by that test result that means that we are actually looking for a kernel commit/fix.

The HWE kernel is, among other things, meant to provide new features and better support for new hardware - so in your case you can just continue to use that.

@Kernel Team - I'd still ask you to check if there is an individual patch for xsaves on those chips that would make sense to backport to the non-HWE Focal kernel - added a bug task for that?

Changed in qemu (Ubuntu):
status: Incomplete → Invalid
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

To help this a bit, the two commits I've seen (there might be others) are
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=52297436199dd
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7204160eb7809

Of those the former is in Focals 5.4 since Ubuntu-kvm-5.4.0-1036.37
The latter is missing.

Unless the kernel Team says other commits are needed making this too complex maybe a Focal 5.4 kernel with the second commit backported would allo Aaron to test if this commit is enough on his HW?

OTOH the second commit should actually only cache the value, so maybe there are other commits needed anyway, I haven't have more time to look for less obvious commits and the kernel team will be better&faster in doing so anyway.

Revision history for this message
Jeffrey Hugo (quic-jhugo) wrote :

This issue is still present today, on 5.4.0-153.170
Seems like the resolution is stalled. What can be done to get a fix out?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.