Comment 1 for bug 1630304

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-04 18:34 EDT-------
Some observations:

1) QEMU appears to be sending the 'device-removed' event prematurely. The below output shows that the device's VFIO group FD is still open by the QEMU process at the time it signals libvirt that the device unplug/cleanup has completed:

root@ltc-fire1:~# virsh event ltc-fire1-vm3-ubuntu-16.10 --event device-removed && lsof /dev/vfio/7
event 'device-removed' for domain ltc-fire1-vm3-ubuntu-16.10: hostdev0
events received: 1

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
qemu-syst 31231 libvirt-qemu 26u CHR 242,0 0t0 750 /dev/vfio/7

2) In response to this event, libvirt issues the following sequence to rebind the VF:

echo $DEVID >/sys/bus/pci/drivers/vfio-pci/unbind
echo $DEVID >/sys/bus/pci/drivers_probe

3) On the VFIO side, this consistently leads to mlx5_core attempting to bind to the device while VFIO is still running it's cleanup routines:

[ 120.099498] KVM guest htab at c000000f2b000000 (order 26), LPID 1
[ 120.208235] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..3fffffff pg=1000
[ 138.281730] pci 0001:01: 0.2: [PE# 005] Setting up window#1 800000000000000..8000001ffffffff pg=10000
[ 396.873573] vfio-pci 0001:01:00.2: No device request channel registered, blocked until released by user
[ 396.873791] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[ 396.873796] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1
[ 396.873908] mlx5_core 0001:01:00.2: enabling device (0000 -> 0002)
[ 396.873940] mlx5_core 0001:01:00.2: Using 32-bit DMA via iommu
[ 396.874034] mlx5_core 0001:01:00.2: firmware version: 12.17.1010

The full cleanup path should include something like:
[ 4762.425039] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[ 4762.425043] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1
[ 4762.432014] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..7fffffff pg=1000
[ 4762.432018] pci 0001:01: 0.2: [PE# 005] Enabling 64-bit DMA bypass

So the driver is attempting to enable the device before the default DMA windows have been restored

4) The sleep Carol inserted above in VFIO cleanup path seems to avoid the issue. This suggests that the reprobe doesn't blindly run but instead waits for a signal of some sort, but that that signaling seems to happen prematurely without the explicit sleep.

This probably needs to be addressed at multiple levels, a fix in QEMU to defer the device-deleted event until VFIO has cleanup up the device, and a fix in VFIO path to avoid crashing the host if someone were to issue the reprobe manually while the device is still in use.

A possible workaround that might be worth trying in the meantime is specifying managed='no' in the device XML, which according to libvirt documentation would prevent libvirt from automatically rebinding the device back to default in the host after unplug. But I saw mention that maybe this wasn't supported yet for KVM, so it's not a given.