KVM VM with GPU passthrough won't start

Bug #2107285 reported by Oliver Bambrough
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux-firmware (Ubuntu)
New
Undecided
Unassigned

Bug Description

Host OS:
Ubuntu 24.04.2 LTS
Kernel 6.11.0-21-generic
CPU: AMD Ryzen 9 5900X
Software Firmware version: F2
GPU 1: AMD Radeon RX 6400 (Used by Host OS)
GPU 2: AMD Radeon RX 6800 (Used by VMs via GPU passthrough, on PCI bus 10:00.0)

$ apt-cache policy linux-firmware
linux-firmware:
  Installed: 20240318.git3b128b60-0ubuntu2.11
  Candidate: 20240318.git3b128b60-0ubuntu2.11
  Version table:
 *** 20240318.git3b128b60-0ubuntu2.11 500
        500 http://us.archive.ubuntu.com/ubuntu noble-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu noble-security/main amd64 Packages
        100 /var/lib/dpkg/status
     20240318.git3b128b60-0ubuntu2 500
        500 http://us.archive.ubuntu.com/ubuntu noble/main amd64 Packages

What should have happened:

VM with GPU passthrough should start

What happend instead:

VM with GPU passthrough wouldn't start. I tried running 'lspci -nns 0000:10:00.0' but this hung the terminal. Virtual Machine Manager was now showing it couldn't connect to the KVM daemon. I rebooted the Host OS but running 'lspci -nns 0000:10:00.0' again hung and I still couldn't start the VM with GPU passthrough.

Extra info:

After installing updates to the Host OS on 2025-4-10, VMs without GPU passthrough worked fine. On 2025-4-12 I tried to start a VM with GPU passthrough but it wouldn't start.

On 2025-4-10 one of the Host OS updates was linux-firmware:amd64 (20240318.git3b128b60-0ubuntu2.10 -> 20240318.git3b128b60-0ubuntu2.11).

I wanted to test downgrading the linux-firmware back to version 2.10 but that is no longer available. I was able to find, from this launchpad, the files that were in the 2.10 and 2.11 versions of linux-firmware. I found the differences between the files for the amdgpu firmware files. I overwrote the /lib/firmware/amdgpu files on my host OS with the files from 2.10 and rebooted - the VM with GPU passthrough was able to start (and the lspci command worked.)

The list of amdgpu firmware files I overwrote was:

gc_11_5_1_imu.bin.zst
gc_11_5_1_me.bin.zst
gc_11_5_1_mec.bin.zst
gc_11_5_1_mes1.bin.zst
gc_11_5_1_mes_2.bin.zst
gc_11_5_1_pfp.bin.zst
gc_11_5_1_rlc.bin.zst
isp_4_1_1.bin.zst
psp_14_0_1_ta.bin.zst
psp_14_0_1_toc.bin.zst
sdma_6_1_1.bin.zst
vcn_4_0_6_1.bin.zst
vcn_4_0_6.bin.zst
vpe_6_1_1.bin.zst
---
ProblemType: Bug
ApportVersion: 2.28.1-0ubuntu3.5
Architecture: amd64
CRDA: N/A
CasperMD5CheckResult: pass
CurrentDesktop: ubuntu:GNOME
Dependencies: firmware-sof-signed 2023.12.1-1ubuntu1.4
DistroRelease: Ubuntu 24.04
InstallationDate: Installed on 2024-06-01 (326 days ago)
InstallationMedia: Ubuntu 24.04 LTS "Noble Numbat" - Release amd64 (20240424)
MachineType: Gigabyte Technology Co., Ltd. X570S AORUS PRO AX
Package: linux-firmware 20240318.git3b128b60-0ubuntu2.11
PackageArchitecture: amd64
ProcEnviron:
 LANG=en_US.UTF-8
 PATH=(custom, no user)
 SHELL=/bin/bash
 TERM=xterm-256color
 XDG_RUNTIME_DIR=<set>
ProcFB: 0 amdgpudrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-6.11.0-21-generic root=/dev/mapper/ubuntu--vg-ubuntu--lv ro quiet splash amd_iommu=on iommu=pt vt.handoff=7
ProcVersionSignature: Ubuntu 6.11.0-21.21~24.04.1-generic 6.11.11
RelatedPackageVersions:
 linux-restricted-modules-6.11.0-21-generic N/A
 linux-backports-modules-6.11.0-21-generic N/A
 linux-firmware 20240318.git3b128b60-0ubuntu2.11
Tags: noble wayland-session
Uname: Linux 6.11.0-21-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip kvm libvirt libvirt-dnsmasq lpadmin plugdev storage sudo users
_MarkForUpload: True
dmi.bios.date: 07/08/2021
dmi.bios.release: 5.17
dmi.bios.vendor: American Megatrends International, LLC.
dmi.bios.version: F2
dmi.board.asset.tag: Default string
dmi.board.name: X570S AORUS PRO AX
dmi.board.vendor: Gigabyte Technology Co., Ltd.
dmi.board.version: x.x
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInternational,LLC.:bvrF2:bd07/08/2021:br5.17:svnGigabyteTechnologyCo.,Ltd.:pnX570SAORUSPROAX:pvr-CF:rvnGigabyteTechnologyCo.,Ltd.:rnX570SAORUSPROAX:rvrx.x:cvnDefaultstring:ct3:cvrDefaultstring:skuDefaultstring:
dmi.product.family: X570 MB
dmi.product.name: X570S AORUS PRO AX
dmi.product.sku: Default string
dmi.product.version: -CF
dmi.sys.vendor: Gigabyte Technology Co., Ltd.

Revision history for this message
Roger Knecht (rogerknecht) wrote :

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 2107285

Revision history for this message
Oliver Bambrough (obambrough) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected noble wayland-session
description: updated
Revision history for this message
Oliver Bambrough (obambrough) wrote : AudioDevicesInUse.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : IwConfig.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : Lspci.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : Lspci-vt.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : Lsusb.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : Lsusb-t.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : Lsusb-v.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : ProcModules.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : RfKill.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : UdevDb.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : WifiSyslog.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote : acpidump.txt

apport information

Revision history for this message
Oliver Bambrough (obambrough) wrote :

@rogerknecht I reverted the amdgpu firmware files back to those from linux-firmware:amd64 20240318.git3b128b60-0ubuntu2.11 (which break GPU passthrough VMs) and then ran the apport-collect command.

Revision history for this message
Oliver Bambrough (obambrough) wrote :
Download full text (4.9 KiB)

I'm not sure if the problem is related to the amdgpu driver now. After reverting my changes back to the most recent firmware I ran the apport-collect command and it failed, hanging at the lspci command. I rebooted and retried apport-collect, which succeeded (they're the files posted above.) Using the current firmware amdgpu drivers wasn't the actual problem because the lspci command worked with them and I was able to run a VM with GPU passthrough as well (the logs posted above from apport-collect may not be that valuable, since everything was working on that boot.) It must be an intermittent issue that I first noticed on 2025-04-12. I've reviewed my logs for each boot and thought the issue was related to timing, with the GPU on PCI 10:00.0 being initialized before the driverctl command applying the vfio-pci driver, but on the most recent reboot I saw the amdgpu driver initialize the GPU, then the driverctl replace it but actually logged that it failed (which I've never seen before when reviewing 30 boots) yet the lspci command succeeds and the VM with GPU passthrough works. Here's an example of what I thought was the issue in the logs:

Apr 26 14:49:12 dark kernel: [drm] Initialized amdgpu 3.59.0 for 0000:10:00.0 on minor 2
Apr 26 14:49:12 dark kernel: amdgpu 0000:10:00.0: [drm] fb1: amdgpudrmfb frame buffer device
Apr 26 14:49:04 dark systemd[1]: Starting driverctl@pci-0000:10:00.1.service - Load the driverctl override for pci-0000:10:00.1...
Apr 26 14:49:04 dark (udev-worker)[880]: controlC1: /usr/lib/udev/rules.d/78-sound-card.rules:5 Failed to write ATTR{/sys/devices/pci0000:00/0000:00:03.1/0000:0e:00.0/0000:0f:00.0/0000:10:00.1/sound/card1/controlC1/../uevent}, ignoring: No such file or directory
Apr 26 14:49:12 dark driverctl[1940]: /usr/sbin/driverctl: line 72: /sys//devices/pci0000:00/0000:00:03.1/0000:0e:00.0/0000:0f:00.0/0000:10:00.0/driver/unbind: Permission denied
Apr 26 14:49:12 dark driverctl[1940]: driverctl: unbinding 0000:10:00.0 failed
Apr 26 14:49:12 dark kernel: amdgpu 0000:10:00.0: amdgpu: amdgpu: finishing device.
Apr 26 14:49:12 dark kernel: [drm] amdgpu: ttm finalized
Apr 26 14:49:12 dark kernel: vfio-pci 0000:10:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
Apr 26 14:49:06 dark systemd[1]: Starting qemu-kvm.service - QEMU KVM preparation - module, ksm, hugepages...
Apr 26 14:49:06 dark systemd[1]: Finished qemu-kvm.service - QEMU KVM preparation - module, ksm, hugepages.
Apr 26 14:49:06 dark systemd[1]: Finished driverctl@pci-0000:10:00.1.service - Load the driverctl override for pci-0000:10:00.1.
Apr 26 14:49:09 dark systemd[1]: Starting driverctl@pci-0000:10:00.0.service - Load the driverctl override for pci-0000:10:00.0...
Apr 26 14:49:09 dark systemd[1]: driverctl@pci-0000:10:00.0.service: Main process exited, code=exited, status=1/FAILURE
Apr 26 14:49:09 dark systemd[1]: driverctl@pci-0000:10:00.0.service: Failed with result 'exit-code'.
Apr 26 14:49:09 dark systemd[1]: Failed to start driverctl@pci-0000:10:00.0.service - Load the driverctl override for pci-0000:10:00.0.
Apr 26 14:49:12 dark systemd[1]: Starting driverctl@pci-0000:10:00.0.service - Load the driverctl override for p...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.