Windows (10?) guest freezes entire host on shutdown if using PCI passthrough

Bug #1580459 reported by Jimi on 2016-05-11
40
This bug affects 6 people
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned
libvirt
New
Undecided
Unassigned
Arch Linux
New
Undecided
Unassigned
Debian
New
Undecided
Unassigned
Fedora
New
Undecided
Unassigned

Bug Description

Problem: after leaving a Windows VM that uses PCI passthrough (as we do for gaming graphics cards, sound cards, and in my case, a USB card) running for some amount of time between 1 and 2 hours (it's not consistent with exactly how long), and for any amount of time longer than that, shutting down that guest will, right as it finishes shutting down, freeze the host computer, making it require a hard reboot. Unbinding (or in the other user's case, unbinding and THEN binding) any PCI device in sysfs, even one that has nothing to do with the VM, also has the same effect as shutting down the VM (if the VM has been running long enough). So, it's probably an issue related to unbinding and binding PCI devices.

There's a lot of info on this problem over at https://bbs.archlinux.org/viewtopic.php?id=206050
Here's a better-organized list of main details:
-at least 2 confirmed victims of this bug; 2 (including me) have provided lots of info in the link
-I'm on Arch Linux and the other one is on Gentoo (distro-nonspecific)
-issue affects my Windows 10 guest and others' Windows guests, but not my Arch Linux guest (the others don't have non-Windows guests to test)
-I'm using libvirt but the other user is not, so it's not an issue with libvirt
-It seems to be version non-specific, too. I first noticed it at, or when testing versions still had the issue at (whichever version is lower), Linux 4.1 and qemu 2.4.0. It still persists in all releases of both since, including the newest ones.
-I can't track down exactly what package downgrade can fix it, as downgrading further than Linux 4.1 and qemu 2.4.0 requires Herculean and system-destroying changes such as downgrading ncurses, meaning I don't know whether it's a bug in QEMU, the Linux kernel, or some weird seemingly unrelated thing.
-According to the other user, "graphics intensive gameplay (GTA V) can cause the crash to happen sooner," as soon as "15 minutes"
-Also, "bringing up a second passthrough VM with separate hardware will cause the same crash," and "bringing up another VM before the two-hour mark will not result in a crash," further cementing that it's triggered by the un/binding of PCI devices.
-This is NOT related to the very similar bug that can be worked around by not passing through the HDMI device or sound card. Even when we removed all traces of any sort of sound card from the VM, it still had the same behavior.

kachaffeous (murknfools) wrote :

I am seeing this issue on arch also. I also tried Fedora24 to see if it was a Arch only issue.

If I start a VM and stop it shortly after everything works fine.

If I start a VM and game for a while, on VM shutdown the host will totally lock. Tailing the journal to see if anything gets logged shows nothing (hangs before any errors are logged). Have to hard power cycle PC to regain use.

I'm willing to do any test to try to figure this out.

Hardware details:
i7-5820K 3.3 GHz (hex core)
12g ram
ASRock X99 Extreme4 LGA2011
GTX 970 nvidia drivers (pass thru card) using Display port
Asus Rog Swift 27"

Jimi (jimijames-bove) wrote :

Oh, I should post my hardware:

i7-5820K (also) (4/6 cores (8/12 threads) being passed to VMs)
12GB RAM (also) (8GB being passed to VMs)
MSI X99 SLI Plus (though I don't use SLI)
NVidia GTX 960 2GB pass-thru (also had this problem on a GTX 660 before that died)
GT 740 host card, using nouveau when VMs are running

We have some pretty similar hardware there.

kachaffeous (murknfools) wrote :

Here is my startup script.

#!/bin/bash

echo "Starting virtual machine..."

cp /usr/share/edk2.git/ovmf-x64/OVMF_VARS-pure-efi.fd /tmp/my_vars.fd

sudo \
 qemu-system-x86_64 \
  -name "Windows 10" \
  -enable-kvm \
  -m 12288 \
  -cpu host,kvm=off \
  -smp threads=2,cores=4,sockets=1 \
  -vga none \
  -soundhw hda \
  -net nic -net bridge,br=br0 \
  -usb -usbdevice host:1af3:0001 -usbdevice host:04d9:2221 -usbdevice host:046d:0a4d \
  -device vfio-pci,host=01:00.0,multifunction=on \
  -device vfio-pci,host=01:00.1 \
  -drive if=pflash,format=raw,readonly,file=/usr/share/edk2.git/ovmf-x64/OVMF_CODE-pure-efi.fd \
  -drive if=pflash,format=raw,file=/tmp/my_vars.fd \
  -boot order=cd \
  -device virtio-scsi-pci,id=scsi \
  -drive file=/home/jason/kvm/win.img,id=disk,format=qcow2,if=none,cache=writeback -device scsi-hd,drive=disk \

exit 0

Jimi (jimijames-bove) wrote :

I should also post my "scripts" (libvirt XML files in my case):

But, since the Windows VM and Linux VM are completely identical beyond the OS that's installed, I don't think our VM configurations have anything to do with this bug. I mean, they aren't completely identical right now because I removed the HDMI sound card from the Linux VM in favor of PulseAudio "network" streaming, but I did that recently and they had the same behavior or lack thereof before I did that.

Jimi (jimijames-bove) wrote :

Also, yeah, the Linux one is called SteamOS, but it is actually just an almost identical install of Arch. SteamOS wasn't playing nice with most of my hardware when I tried to install it.

Gandalf-The-Red (brl75) wrote :

I think this is what's happening to me on my windows 8.1 vm although it might be slightly different.

Just about everything you guys talked about applies except I don't have to shutdown for it to freeze up in my case(although if it's on for long enough and I shut it off it freezes). It freezes up on it's own seemingly at random taking the host with it.

First happened to me on a freshly installed Arch(antergos), then tried it on Debian after updating my kernel from 4.3 to 4.5(there was a bug that made the vm excruciatingly slow before 4.4) and it happened again.

My hardware:

i7 5820k
8GB Ram (Upgrading to 32GB when the ram I ordered gets here)
MSI X99S SLI Plus
AMD Radeon R9-270X (Host GPU using "radeon" drivers)
AMD Radeon HD 6950 1GB (Passthrough GPU)

Interesting that aside from the GPUs(which I'm pretty sure aren't the problem) we all have very similar hardware.

When I get some free time I'll try to replicate this bug on another OS but I have a feeling I'll just get the same result. I just want to see if it'll happen no matter what distro I use.

Jimi (jimijames-bove) wrote :

I doubt you have a different issue. My VM has randomly hanged my computer without a shut down a few times during the life of this bug, and there are two very possible ways it could happen: the VM suddenly crashed, making a situation similar to it shutting down, or something in your host caused some PCI device to be bound or unbound to a driver.

Gandalf-The-Red (brl75) wrote :

I see, it's definitely the same issue then.

Could it be something to do with our hardware unbinding and binding pci devices or something of the sort? I sort of doubt it but it is strange someone else with a more different CPU/mobo combo hasn't reported this problem yet.

That being said, we have a very small sample size so I don't know if that means anything.

Jimi (jimijames-bove) wrote :

Whoops, I clicked the wrong button and added the wrong thing for Arch Linux, and I don't know how to delete it. (new to launchpad here)

Changed in archlinux-lp:
status: New → Invalid
no longer affects: archlinux-lp
Jimi (jimijames-bove) wrote :

OK, I figured out how to delete it.

Download full text (130.7 KiB)

I am having the exact same issue!

My Setup:

Model: unRaid 6.2 Beta
M/B: ASUSTeK Computer INC. - Z8P(N)E-D12(X)
CPU: Intel® Xeon® CPU X5690 @ 3.47GHz
HVM: Enabled
IOMMU: Enabled
Cache: 384 kB, 1536 kB, 12288 kB
Memory: 32768 MB (max. installable capacity 96 GB)
Network: bond0: fault-tolerance (active-backup), mtu 1500
 eth0: 100Mb/s, Full Duplex, mtu 1500
 eth1: 1000Mb/s, Full Duplex, mtu 1500
Kernel: Linux 4.4.6-unRAID x86_64
OpenSSL: 1.0.2g

<?xml version="1.0" standalone="yes" ?>
<!-- generated by lshw-unknown -->
<!-- GCC 5.3.0 -->
<!-- Linux 4.4.6-unRAID #1 SMP PREEMPT Fri Mar 25 21:34:35 PDT 2016 x86_64 -->
<!-- GNU libc 2 (glibc 2.23) -->
<list>
<node id="computer" claimed="true" class="system" handle="DMI:0001">
 <description>Desktop Computer</description>
 <product>System Product Name (To Be Filled By O.E.M.)</product>
 <vendor>System manufacturer</vendor>
 <version>System Version</version>
 <serial>[REMOVED]</serial>
 <width units="bits">4294967295</width>
 <configuration>
  <setting id="boot" value="normal" />
  <setting id="chassis" value="desktop" />
  <setting id="family" value="To Be Filled By O.E.M." />
  <setting id="sku" value="To Be Filled By O.E.M." />
  <setting id="uuid" value="[REMOVED]" />
 </configuration>
 <capabilities>
  <capability id="smbios-2.6" >SMBIOS version 2.6</capability>
  <capability id="dmi-2.6" >DMI version 2.6</capability>
  <capability id="smp" >Symmetric Multi-Processing</capability>
 </capabilities>
  <node id="core" claimed="true" class="bus" handle="DMI:0002">
   <description>Motherboard</description>
   <product>Z8P(N)E-D12(X)</product>
   <vendor>ASUSTeK Computer INC.</vendor>
   <physid>0</physid>
   <version>Rev 1.0xG</version>
   <serial>[REMOVED]</serial>
   <slot>To Be Filled By O.E.M.</slot>
    <node id="firmware" claimed="true" class="memory" handle="">
     <description>BIOS</description>
     <vendor>American Megatrends Inc.</vendor>
     <physid>0</physid>
     <version>1302</version>
     <date>06/25/2012</date>
     <size units="bytes">65536</size>
     <capacity units="bytes">2031616</capacity>
     <capabilities>
      <capability id="isa" >ISA bus</capability>
      <capability id="pci" >PCI bus</capability>
      <capability id="pnp" >Plug-and-Play</capability>
      <capability id="upgrade" >BIOS EEPROM can be upgraded</capability>
      <capability id="shadowing" >BIOS shadowing</capability>
      <capability id="escd" >ESCD</capability>
      <capability id="cdboot" >Booting from CD-ROM/DVD</capability>
      <capability id="bootselect" >Selectable boot path</capability>
      <capability id="socketedrom" >BIOS ROM is socketed</capability>
      <capability id="edd" >Enhanced Disk Drive extensions</capability>
      <capability id="int13floppy1200" >5.25&quot; 1.2MB floppy</capability>
      <capability id="int13floppy720" >3.5&quot; 720KB floppy</capability>
      <capability id="int13floppy2880" >3.5&quot; 2.88MB floppy</capability>
      <capability id="int5printscreen" >Print Screen key</capability>
      <capability id="int9keyboard" >i8042 keyboard controller</capability>
      <capability id="int14serial" >INT14 serial line control</capability>
      <capability id...

Download full text (12.6 KiB)

I have 2 running virtual machines.

1. Ubuntu Server 16.04 acting as a headless game server
2. Windows 10 Pro used for gaming and other daily activities

I too can start/stop the Win 10 vm for a period of time after a cold boot but if it is logged in for a certain period of time, when I go to shut it down the entire system will freeze. I can reboot the Ubuntu server at will. It too has a SSD being passed thru.

Win 10 VM
<domain type='kvm' id='3'>
  <name>csmccarronwx00</name>
  <uuid>82c5e4f6-6991-cd5f-8207-49db04386cc9</uuid>
  <description>csmccarronwx00 i440fx-2.5 OVMF</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>11010048</memory>
  <currentMemory unit='KiB'>11010048</currentMemory>
  <memoryBacking>
    <nosharepages/>
    <locked/>
  </memoryBacking>
  <vcpu placement='static'>12</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='6'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <vcpupin vcpu='2' cpuset='7'/>
    <vcpupin vcpu='3' cpuset='19'/>
    <vcpupin vcpu='4' cpuset='8'/>
    <vcpupin vcpu='5' cpuset='20'/>
    <vcpupin vcpu='6' cpuset='9'/>
    <vcpupin vcpu='7' cpuset='21'/>
    <vcpupin vcpu='8' cpuset='10'/>
    <vcpupin vcpu='9' cpuset='22'/>
    <vcpupin vcpu='10' cpuset='11'/>
    <vcpupin vcpu='11' cpuset='23'/>
    <emulatorpin cpuset='1-3,13-15'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.5'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/82c5e4f6-6991-cd5f-8207-49db04386cc9_VARS-pure-efi.fd</nvram>
    <boot dev='cdrom'/>
    <boot dev='hd'/>
    <bootmenu enable='yes' timeout='3000'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='6' threads='2'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/ISO/virtio-win-0.1.117.iso'/>
      <backingStore/>
      <target dev='hda' bus='sata'/>
      <readonly/>
      <alias name='sata0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <disk type='file' device='cdrom'>
      <driver name='qemu' type='raw'/>
      <source file='/mnt/user/ISO/Windows10Pro_TH2.iso'/>
      <backingStore/>
      <target dev='hdb' bus='sata'/>
      <readonly/>
      <alias name='sata0-0-1'/>
      <address type='drive' controller='0' bus='0' target='0' unit='1'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source dev='/dev/disk/by-id/ata-Samsung_SSD_850_PRO_256GB_S1SUNSAG361503D'/>
      <backingStore/>
      <target dev='hdc' bus='virtio'/>
     ...

Jimi (jimijames-bove) wrote :

Well, now we finally know that it isn't the i7-5820K's or X99 chipset's or LGA 2011 socket's faults.

I have tried everything to keep it from happening but have had no success. The likely hood of an entire system lock up is based on how long the Win 10 VM is on. I personally have not timed it but usually i can shutdown/restart without problems for about an hour, maybe more.

My Ubuntu vm is not effected by this issue. I am passing thru 4 vcpus and a SSD that the vm boots from.

What can we do to help troubleshoot this issue? I find it strange the the problem happens at VM power off and not while the VM is in use. What happens at VM power off that can lock the entire system up and cause CPU stall errors.

The posts in syslog vary from time to time but they all end in cpu stalls.

Additional syslog image

Additional syslog image

Additional syslog image

Additional syslog image

I am not having any issues with my drives during normal operation on the server. I only see the ata errors when the system locks up.

If there is something I can do please let me know. I have been trying to figure this out for over a month now but have had no luck.

Jimi (jimijames-bove) wrote :

Remember, I think we've done enough testing to know that it isn't specifically the VM shutting down that causes this, but the binding or unbinding of PCI devices in sysfs, which is something a VM will do on shutdown if you're passing hardware into it. It *is* caused by the VM running for more than an hour, but it is *not* technically caused by the shutdown itself. I titled it as a shutdown issue because that's pretty much the only situation anybody's going to notice this problem, and we need to be Google-friendly.

Has any one found a way to shutdown/restart the vm without causing a system lockup or is this just the way it is until a fix is found?

James Newman (jdnewman85) wrote :

I've got the same issue. Pretty much just as it has been described by everyone else. Same on shutdown or certain events. Same for delay. Similar setups and hardware/software. (X99, Arch, Qemu, libvirt, pcie passthrough, windows 10, etc...) I've attached my system info (Hardware, lscpu, Archlinux package versions, qemu/libvirt xml files).

Brand new pc build, super fresh and clean system and images. Run 2 different Windows 10 vms, and occasionally another Arch vm for some game server stuffs.

What is the proper way of going about troubleshooting such things? Is there a way to enable a kernel debug mode or anything? I develop software and hardware, and am a novice linux user, just haven't ever troubleshot a hard lock like this. Willing to help if anyone can give me some direction. :)

James Newman (jdnewman85) wrote :

Unsure how to edit a post.

Also wanted to say, I can provide BIOS settings later, and any kernel logs if anyone wants. Wanted to note though that I am using UEFI with GPT style partitioning. I'm using bttrfs for the host fs. OVMF for guests (See package list in my system info for versioning). Guest main drive images are qcow2. Some SATA hard drives with NTFS partitions are passed through for each guest additional storage. Systemd Boot as the boot manager.

Can't think of much else, but hoping to get this fixed up.

Jimi (jimijames-bove) wrote :

Well, that's a bunch more stuff ruled out. My host is a BIOS with MBR partitioning, using ext4, and the images are all raw. For each guest, there's an image of the OS (so the C: drive on Windows and the / partition on Linux) on my SSD, and Windows also has a bigger image on my HDD (drive D:). I don't pass in any storage media; just the video card, its HDMI soundcard, and a USB card.

Jimi, does your HDMI sound lag? I am using a usb sound card and tries switching to the GTX970 sound and I got horrible lag, sounds like sound is in slow motion. Was completely unusable.

Chris

Jimi (jimijames-bove) wrote :

I know it didn't with the GTX 660. It worked perfectly fine. But, I went fully into Steam streaming everything before I got the 960, so the 960 could have that issue for all I know.

jimrif (jimrif) wrote :

I have been able to stop this from happening by recompiling my kernel without SND support. If you can live without sound in your host (it is still there in your guest if you pass through the sound device of your card) then try removing SND support from your hosts kernel. You can also try blacklisting the snd module and snd-hda-intel instead of removing it from your kernel if they are modules. I have not had a crash from a shutdown in a couple of months after removing SND from my hosts kernel. In my mind that points more of a finger at idea that the root of the problem has to do with binding/unbinding of the device.

Chris, for your HDMI sound issue there are a couple of things that might help. I would have that issue immediately if I was using a certain virtual network card in the guest. Using virtio as your network driver helps quite a bit, however it would still mess up on me every now and again. In order to fix everything, I switched it over to MSI signalling from IRQ on the sound device in Windows 10. I also switched the graphics card driver over to MSI and have to switch them each time one of the nVidia drivers gets an update.

Jimi (jimijames-bove) wrote :

Hm. Sound was the issue in that other bug. Have you already confirmed that you don't have that other, similar bug? If you undo all the other fixes you've done, including enabling SND again, does the VM still crash if you have NO sound device assigned to it at all, whether it be a pass-thru device or a virtual one?

jimrif (jimrif) wrote :

I'm not really sure what the other similar bug was, but what I was experiencing was a Win10 VM locking up the host machine upon shutdown of the VM after several minutes of gaming (or even several hours of youtube/netflix). It didn't happen all of the time, but most of the time after the VM had be up for a while.

I am positive that recompiling without SND support is when the host stopped crashing upon shutdown of the Windows 10 VM as I was only doing one change at a time. I had the issue for many months before removing CONFIG_SND. Since then, 2 months ago, I've upgraded qemu, libvirt, the kernel and win10 updates, including the nVidia drivers. I'm not really wanting to compile SND back in as my server is also doing a lot more than just hosting a Win10 VM and I don't want it to crash without anyone else trying the fix. If others try removing SND and continue to have the issue, I will recompile to help troubleshoot but I am very confident that is what stopped my system from locking up when shutting down a Windows 10 VM. If I were to take a guess, my guess is that just removing snd-hda-intel would do the trick.

My hardware is a X99 board, i7-5820K, and a nVidia 980 graphics card being passed through to the guest. The host video card is a cheap 1x radeon with HDMI sound.

I will try an blacklist the sound module in the unRaid kernel. Waiting on instructions on how to do it.

Chris

Jimi (jimijames-bove) wrote :

If your Windows VM does and always has a sound card being passed in (like the .1 address of your video card), then we can't know for sure that you don't have that other bug. In that other bug, you can fix the crash by not passing in any sound cards, real or virtual, to the VM. It's definitely not the same bug as this one.

Well for now my issue is resolved. This morning when I was shutting down my unRaid server to blacklist the intel sound module, snd-hda-intel, I first stopped my ubuntu vm and my two dockers then logged out of unraid. I then proceeded to shutdown my Windows 10 VM and like magic it shutdown nicely without locking up the entire system. Also, I found out from unRaid tech support that the unRaid kernel does not include any sound modules and it was not necessary to blacklist them.

So this is what I have changed since the last lockup last Thursday night.

1. Removed the NVIDIA Audio hardware from the VM Setup. I did this because the sound was lagging horribly and I could not figure out how to fix it. So I removed the sound hardware and I am now using a USB sound card that is plugged into the USB3 PCI-Express card that is being passed to the VM.
2. I enabled MSI Interrupts on the GPU using this URL as my reference.
    http://lime-technology.com/wiki/index.php/UnRAID_6/VM_Guest_Support#Enable_MSI_for_Interrupts_to_Fix_HDMI_Audio_Support

I should also mention that while I have the system NIC, USB1, and USB2 virtual modules mapped, they are disabled in the VM. I did this to improve latency issues inside the VM. I am using a wireless NIC plugged into the USB3 PCI-Express card and I do not require USB1 or USB2. These changes where made on Thursday prior to the last lockup, so while I do believe they have helped overall latency they had no effect on the system locking up.

USB3 card is handling Logitech G910 keyboard, WOW MMO Legendary Gaming Mouse, ASUS XONARU3 Sound Card, ASUS USB-AC56 Wireless NIC, and a USB Mouse.

I still would like to add the NVIDIA Sound card back into the VM and when I do I will enable MSI Interrupts. My goal is not not have to use the USB Sound card.

See next post for current VM setup.

Download full text (5.6 KiB)

Current VM Config

<domain type='kvm' id='1'>
  <name>csmccarronwx00</name>
  <uuid>82c5e4f6-6991-cd5f-8207-49db04386cc9</uuid>
  <description>csmccarronwx00 i440fx-2.5 OVMF</description>
  <metadata>
    <vmtemplate xmlns="unraid" name="Windows 10" icon="windows.png" os="windows10"/>
  </metadata>
  <memory unit='KiB'>10485760</memory>
  <currentMemory unit='KiB'>10485760</currentMemory>
  <memoryBacking>
    <nosharepages/>
  </memoryBacking>
  <vcpu placement='static'>12</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='6'/>
    <vcpupin vcpu='1' cpuset='18'/>
    <vcpupin vcpu='2' cpuset='7'/>
    <vcpupin vcpu='3' cpuset='19'/>
    <vcpupin vcpu='4' cpuset='8'/>
    <vcpupin vcpu='5' cpuset='20'/>
    <vcpupin vcpu='6' cpuset='9'/>
    <vcpupin vcpu='7' cpuset='21'/>
    <vcpupin vcpu='8' cpuset='10'/>
    <vcpupin vcpu='9' cpuset='22'/>
    <vcpupin vcpu='10' cpuset='11'/>
    <vcpupin vcpu='11' cpuset='23'/>
    <emulatorpin cpuset='1-3,13-15'/>
  </cputune>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.5'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x64/OVMF_CODE-pure-efi.fd</loader>
    <nvram>/etc/libvirt/qemu/nvram/82c5e4f6-6991-cd5f-8207-49db04386cc9_VARS-pure-efi.fd</nvram>
    <boot dev='hd'/>
    <boot dev='cdrom'/>
    <bootmenu enable='yes' timeout='3000'/>
  </os>
  <features>
    <acpi/>
    <apic/>
    <hyperv>
      <relaxed state='on'/>
      <vapic state='on'/>
      <spinlocks state='on' retries='8191'/>
      <vendor id='none'/>
    </hyperv>
  </features>
  <cpu mode='host-passthrough'>
    <topology sockets='1' cores='6' threads='2'/>
  </cpu>
  <clock offset='localtime'>
    <timer name='hypervclock' present='yes'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/local/sbin/qemu</emulator>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source dev='/dev/disk/by-id/ata-Samsung_SSD_850_PRO_256GB_S1SUNSAG361503D'/>
      <backingStore/>
      <target dev='hdc' bus='sata'/>
      <alias name='sata0-0-2'/>
      <address type='drive' controller='0' bus='0' target='0' unit='2'/>
    </disk>
    <disk type='block' device='disk'>
      <driver name='qemu' type='raw' cache='writeback'/>
      <source dev='/dev/disk/by-id/ata-SanDisk_Ultra_II_480GB_161322801967'/>
      <backingStore/>
      <target dev='hdd' bus='sata'/>
      <alias name='sata0-0-3'/>
      <address type='drive' controller='0' bus='0' target='0' unit='3'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <alias name='usb'/>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0' multifunction='on'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <alias name='usb'/>
      <master startpor...

Read more...

SYSLINUX.CFG

default /syslinux/menu.c32
menu title Lime Technology, Inc.
prompt 0
timeout 50
label unRAID OS
  kernel /bzimage
  append isolcpus=4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 pci-stub.ids=1b6f:7052,10de:13c2,10de:0fbb intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 pcie_acs_override=downstream initrd=/bzroot
label unRAID OS GUI Mode
  menu default
  kernel /bzimage
  append isolcpus=4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 pci-stub-ids=1b6f:7052,10de:13c2,10de:0fbb intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 pcie_acs_override=downstream initrd=/bzroot,/bzroot-gui
label unRAID OS Safe Mode (no plugins, no GUI)
  kernel /bzimage
  append initrd=/bzroot unraidsafemode
label Memtest86+
  kernel /memtest

pci-stub-ids=1b6f:7052,10de:13c2,10de:0fbb
1b6f:7052 = Etron Technology, Inc. EJ188/EJ198 USB 3.0 Host Controller
10de:13c2 = NVIDIA Corporation GM204 [GeForce GTX 970]
10de:0fbb = NVIDIA Corporation GM204 High Definition Audio Controller

James Newman (jdnewman85) wrote :

So guys, new information.

I was having trouble getting the HTC Vive passed through in host mode. The thing shows up as 10+ devices! I've also some logitech webcams that don't seem to work via usb host passthrough. So I gave windows my entire usb controller (only 1 for all my ports on this mobo). Since then, I haven't noticed an issue. Furthermore, waaaay more stable overall. I used to get random blue screens.

I'm going to order a usb3 pcie card for my other windows host. For now, I'm using a remote desktop connection to it for IO.

Anyway, still tinkering. I'm curious if anyone having the issues would try with no usb 'host' passthrough?

Jimi (jimijames-bove) wrote :

I've been not using USB host passthrough this whole time, as my PCI USB3 card covers that need pretty well. Speaking of those cards, for those of you who also use one, does it work perfectly? If so, I'd like to know its model so I can go buy it, because while my card works, about 50% of the time I try to use it, I get some bad output when I run "dmesg | grep -i vfio" (the standard spam when a device doesn't get passed through properly that's full of messages related to power management) and the VM doesn't seem to have any access to it. When this happens, I have to restart the whole host to get another 50% chance at using the card.

Peter Maloney (peter-maloney) wrote :

FYI I had a similar issue years ago until I figured out that adding the vgarom file fixes it, eg.:

     -device vfio-pci,host=04:00.0,bus=root.1,multifunction=on,x-vga=on,addr=0.0,romfile=Sapphire.R7260X.1024.131106.rom

For radeon, you can look in /sys. eg. we see /sys/devices/pci0000:00/0000:00:0b.0/0000:04:00.0/rom, and first we `echo 1 > rom` to prevent "invalid argument" error, and then `cat rom > ~/yourfile.rom` and you have it.

For nouveau, you have to bind nouveau driver (rather than vfio-pci) and you can find it somewhere like /sys/kernel/debug/dri/0/vbios.rom

Jimi (jimijames-bove) wrote :

Can someone else please confirm that? I can't test it because nouveau doesn't support the GTX 960 yet. If it turns out solid, then I could just ask EVGA support for the rom file.

kachaffeous (murknfools) wrote :

I just added the romfile argument to mine, will report back later tonight. (Don't want to reboot now, as my machine will hang and I'm at work)

Jimi (jimijames-bove) wrote :

I got impatient and got the rom file from EVGA and loaded it in, but for me and my GTX 960, I get no graphical output when it's loaded. I don't know anything beyond that. I don't get any error messages in dmesg or anything--just no video output whatsoever. It was also strangely booting into the Tianocore UEFI command line instead of Windows, so there could be something else going on here for me that stayed broken after I removed the romfile option.

Jimi (jimijames-bove) wrote :

I managed to fix that issue and properly load the VM with the rom file (what had gone wrong was it inexplicably acted like it had no hard drives, until I restored the libvirt XML file from a backup). I got a good test out of it: played video games in Windows for 2 hours, with the rom file loaded. It still froze on shutdown. So that's confirmedly not a fix.

My system has been behaving well the last couple of weeks. I can reboot at will with no lockups. I am still not passing the NVIDIA sound card to the VM and have GPU configure to use MSI interrupts. I am not passing the ROM for my GTX 970 gpu.

I know this is not related but I was able to lockup the entire system by installing BOINC software and configured it to use 100% of cpu's and cpu time. Backed those 2 settings down to 90% and no more lockups.

Jimi (jimijames-bove) wrote :

What are MSI interrupts and how did you configure your card to use them?

Apparently Passthrough devices work better when using a MSI Interrupt instead of a traditional interrupt.

See post 32 https://bugs.launchpad.net/qemu/+bug/1580459/comments/32 item 2.

2. I enabled MSI Interrupts on the GPU using this URL as my reference.
    http://lime-technology.com/wiki/index.php/UnRAID_6/VM_Guest_Support#Enable_MSI_for_Interrupts_to_Fix_HDMI_Audio_Support

Chris

Jimi (jimijames-bove) wrote :

I enabled MSI interrupts, and now for 2 nights in a row I gamed 2 hours straight and shut down the Windows VM without a freeze. Never in my 7 months of living with this bug have I gotten no freeze twice in a row. I think the MSI interrupts have fixed it for me, and no, I did not remove my HDMI sound card from the VM, so that wasn't part of the issue and should be safe to leave in for those who needed this fix. That's 2 people who this fix has worked for now. Hopefully it'll work for the rest of you, too. I'll post back if I ever get this freeze again after confirmed it hasn't suddenly switched my hardware off MSI interrupts or anything.

Note: I didn't just make my video card use MSI interrupts. Most of the VM's hardware was already set to use them by default--namely the VirtIO stuff--and I set EVERYTHING else to also use it, which is the video card, its HDMI, the USB3 card, and the virtual USB2 controller that I don't need but libvirt refuses to remove. I figured that'd work out because the USB3 card is also PCIe, which works better with MSI, and the USB2 controller doesn't matter. So, if this doesn't fix it for you, try making every last MSI-capable device use MSI interrupts.

Thats good to know, I want to reenable my Nvidia sound card as well.

Note: When you update the video card driver, it will disable the MSI interrupt so you will have to reenable it.

Clif Houck (clifhouck) wrote :

I was also experiencing the host hard locking when shutting down a Windows 10 guest with a Nvidia GPU passed-through, but the issue appears to be completely solved after switching the card to MSI mode in the Windows guest.

However, I would be interested in understanding *why* using the card in line-interrupt mode in the guest causes the host to lockup when the guest relinquishes control of the device. Is it a bug in qemu or vfio, or even the Linux kernel?

I don't know if its relevant, but I've noticed when the card is not being used by the guest it is listed as MSI: Enable- by lspci, suggesting that vfio is keeping the card in line-interrupt mode when not in use.

Jimi (jimijames-bove) wrote :

Oh, that is interesting. Using lscpi -v on my computer reveals that Linux tends to default to enabling MSI on my PCIe devices that support it (since the common opinion is that it's better for PCIe), including all my graphics cards, so the fact that vfio-pci and Windows 10 both default to disabling it is pretty odd indeed.

Jimi (jimijames-bove) wrote :

(Forgot to clarify: yes, vfio-pci devices disable MSI by default for me just like for Clif Houck, but all other PCIe devices have it enabled.)

yanman (yanman) wrote :

Hi guys, not sure if I'm on the right track here but I think I'm experiencing the same issue. My install might be a bit of a mess combining bits from the VFIO Tips site and Ubuntu guides on GPU passthrough, but I *did* have it all working for a few hours at a stretch before I got this lock up.

The trouble with this is that after the host lockup, the Windows VM seems to corrupt the EFI config or something like that as I can never get it to boot again properly, even though the main partition seems fine when tested in a bootable WinPE distro.

I'd be happy to supply versions and configs to help if it's related however.

john doe (avenger337) wrote :

Enabling MSI interrupts works for me. One note is that Windows updates will sometimes revert the changes so if this starts breaking after an update you may need to re-apply the registry changes.

Clif Houck (clifhouck) wrote :

Updating NVIDIA drivers in the guest also seems to disable MSI for some reason. Oddly enough I did not run into the host hard locking though.

Jimi (jimijames-bove) wrote :

I haven't remembered to reset those interrupts in a year, but I also haven't remembered to update my drivers in about as long, so I could be still on the right setting. I've also been on AMD for that year, and I don't remember whether this bug applies to modern AMD cards.

Benjamin (omega52390) wrote :

I've been experiencing something that sounds very similar to what has been described in this issue post and want to see if you guys think it's the same issue. For me from a cold boot everything is fine for a while and I can restart my vm and such just fine. but after a long time or stressful stuff mining/gaming if I shutdown my vm the host displays will all go to sleep and the system locks up which I had been assuming is a display driver crash. I can also sometimes trigger the exact same lockup by calling lspci. once such a lockup has happened I have to hard reset. where this gets even weirder is that after this happens I will get the same lockup during the startup process around when xorg loads. when this happens I either have to leave my computer alone for around 30 minutes to an hour, or I can get it to boot by disabling iommu with iommu=off as a kernel param, and then if I wait around 30 minutes to an hour I can restart and it will boot fine again with iommu=pt (I get a kernel panic if i don't use iommu=pt)

Hardware
Ryzen R5 1600
asrock ab350m pro4
32gb ram
Host gpu RX580
Guest gpu GTX1070

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers