pc-q35-4.1 and AMD Navi 5700/XT incompatible

Bug #1851972 reported by Marshall Porter on 2019-11-10
This bug affects 2 people
Affects Status Importance Assigned to Milestone

Bug Description


I am not sure if this qualifies as a "bug"; it is be more of an unknown issue with default settings. However, since the default value of q35 default_kernel_irqchip_split was changed seemingly due to similar user feedback, I thought this was important to share..

AMD Navi 5700/XT vfio-pci passthrough seems incompatible with one/multiple settings in pc-q35-3.1 and higher. The workaround for me is that pc-q35-3.0 still works fine passing through the GPU and official drivers can load/install fine.

The default/generic GPU drivers in a Fedora 30 or Windows 1903 guest do work; the monitor displays the desktop in a 800x600 resolution and things are rendered fine.. so the basic functionality of the card seems fine with pc-q35-4.1.

But attempting to use the official open source AMD driver with the card resulted in a hung kernel for the Fedora 30 guest.. and a BSOD on the Windows 1903 guest immediately during driver install.

I do not see any errors in Qemu command output.. did not investigate other logs or KVM etc, because I am not sure what to look for or how to go about it. Also not sure which combination of the latest q35 default settings are valid combinations to try either, because it seems that multiple things have changed related to pcie-root-port defaults and other machine options. I am happy to run tests and provide feedback if that helps identify the issue.

I am using "Linux arch 5.4.0-rc6-mainline" kernel on ArchLinux host with AMD Navi reset pci quirk patch applied.

My working Qemu command line is this:

QEMU_PA_SERVER=/run/user/1000/pulse/native \
/usr/bin/qemu-system-x86_64 \
-name windows \
-m 16g \
-accel kvm \
-machine pc-q35-3.0,accel=kvm,pflash0=ovmf0,pflash1=ovmf1 \
-blockdev node-name=ovmf0,driver=file,filename=/virt/qemu/roms/OVMF_CODE.fd,read-only=on \
-blockdev node-name=ovmf1,driver=file,filename=/virt/qemu/machines/windows/OVMF_VARS.fd \
-boot menu=on \
-global kvm-pit.lost_tick_policy=discard \
-no-hpet \
-rtc base=utc,clock=host,driftfix=slew \
-cpu host,kvm=off,hv_vendor_id=RedHatRedHat,hv_spinlocks=0x1fff,hv_vapic,hv_time,hv_reset,hv_vpindex,hv_runtime,hv_relaxed,hv_synic,hv_stimer \
-smp sockets=1,cores=4,threads=1 \
-nodefaults \
-netdev bridge,br=br0,id=net0 \
-device virtio-net-pci,netdev=net0,addr=19.0,mac=52:54:00:12:34:77 \
-device virtio-scsi-pci \
-blockdev raw,node-name=disk0,cache.direct=off,discard=unmap,file.driver=file,file.aio=threads,file.filename=/virt/qemu/machines/windows/os.raw \
-device scsi-hd,drive=disk0,rotation_rate=1 \
-blockdev raw,node-name=disk1,cache.direct=off,discard=unmap,file.driver=file,file.aio=threads,file.filename=/virt/qemu/machines/windows/data.raw \
-device scsi-hd,drive=disk1,rotation_rate=1 \
-drive index=0,if=ide,media=cdrom,readonly,file=/virt/qemu/isos/Win10_1903_V2_English_x64.iso \
-drive index=1,if=ide,media=cdrom,readonly,file=/virt/qemu/isos/virtio-win-0.1.173.iso \
-device ich9-intel-hda,addr=1b.0 \
-device hda-output \
-monitor stdio \
-display none \
-vga none \
-device pcie-root-port,id=pcierp0,chassis=1,slot=1,addr=1c.0,disable-acs=on,multifunction=on \
-device pcie-root-port,id=pcierp1,chassis=2,slot=2,addr=1c.1,disable-acs=on \
-device x3130-upstream,bus=pcierp0,id=pcieu0 \
-device xio3130-downstream,bus=pcieu0,id=pcied0,chassis=11,slot=11 \
-device vfio-pci,host=03:00.0,bus=pcied0,addr=00.0,multifunction=on \
-device vfio-pci,host=03:00.1,bus=pcied0,addr=00.1 \
-device qemu-xhci,addr=1d.0 \
-device usb-host,vendorid=0x258a,productid=0x0001 \
-device usb-host,vendorid=0x1bcf,productid=0x0005 ;

Thank you!

Paolo Bonzini commented on IRC: AMD avic requires kernel_irqchip=split.

Can you try using it? (released QEMU uses -machine ...,kernel_irqchip=split, git QEMU expects -accel kernel_irqchip=split).

Marshall Porter (meoporter) wrote :

Hi Philippe, thanks for replying.

The 'kernel_irqchip' parameter is a bit confusing to me. It looks like the documentation was updated from it defaulted to 'off' as a -machine parameter, to now it will default to 'on' as an -accel parameter.

This bug described how the value for 'default_kernel_irqchip_split' parameter had been changed to 'true' in Q35 version 4.0, but then set back to 'false' after discovering that it caused issues for Nvidia gpu passthrough and other things: https://bugs.launchpad.net/qemu/+bug/1826422

However, my problems with the AMD gpu passthrough are present when switching between Q35 3.0 (which does work) and 3.1 (which does not work), both of which would still have 'default_kernel_irqchip_split' set to false.. so it does not seem to me to be related to 'kernel_irqchip'.

Q35 version 3.1 did introduce many other changes:

static void pc_q35_3_1_machine_options(MachineClass *m)
    pcmc->do_not_add_smb_acpi = true;
    m->smbus_no_migration_support = true;
    m->alias = NULL;
    pcmc->pvh_enabled = false;

GlobalProperty hw_compat_3_1[] = {
    { "pcie-root-port", "x-speed", "2_5" },
    { "pcie-root-port", "x-width", "1" },

I thought maybe those could cause the AMD Navi gpu problems, but I am not that knowledgeable about these settings.

Also I do not have the AMD Navi gpu conveniently available anymore to test.

Joey Adams (joeyadams3-14159) wrote :

Commit 11bc4a13 (Nov 13, 2019, merged after v4.2.0-rc5) moved the kernel-irqchip parameter to -accel, but I think the default was inadvertently changed to off. The documentation was changed to say the default is on, but the code change seems to have done the opposite.

I found this when I tested my Windows Server 2016 VMs with the last qemu from git. Windows boots and runs very slowly unless I add either <ioapic driver='kvm'/> (kernel_irqchip=on) or <timer name="hypervclock" present="yes"/> to the libvirt config. Using the qemu installed with Ubuntu 19.10 (version 4.0.0), I can reproduce the slowness by explicitly adding kernel_irqchip=off.

- Host CPU: Ryzen 3950X (16 core, 32 thread)
- Host RAM: 64 GiB
- Host OS: Ubuntu 19.10 64-bit, kernel version 5.5.0-rc4 (commit 738d2902773e + ACS override patch)
- Guest CPU: host-passthrough, 16 vcpus (8 cores, 2 threads, topoext).
- Guest RAM: 12 GiB
- Guest machine type: pc-i440fx-4.0 (BIOS boot)
- Guest OS: Windows Server 2016, build 1607

Joey Adams (joeyadams3-14159) wrote :

Commit d1972be13f ("accel/kvm: Make "kernel_irqchip" default on") fixes the default mixup I described above. This isn't related to Marshall's issue as it involves qemu 3.0 vs 3.1, but at least it cleans up some confusion.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers