Ubuntu 16.10 KVM: Issue doing hotplug detach to SRIOV VF

Bug #1630304 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Opinion
Undecided
Unassigned

Bug Description

---Problem Description---
I can not get hotplug attach to work in Ubuntu but if I try to detach a CX4 VF from a guest I am getting some issues:
Like in this case:
[ 474.393308] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user
[ 474.393543] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0
[ 474.393553] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1
[ 474.393906] mlx5_core 0001:01:00.3: enabling device (0000 -> 0002)
[ 474.393939] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu
[ 474.400360] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fffffff pg=1000
[ 474.400380] mlx5_core 0001:01:00.3: firmware version: 12.17.226
[ 474.401341] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass
[ 474.402284] EEH: Frozen PE#6 on PHB#1 detected
[ 474.402475] EEH: PE location: Slot4, PHB location: N/A
[ 474.403699] EEH: This PCI device has failed 1 times in the last hour
[ 474.403700] EEH: Notify device drivers to shutdown
[ 474.403707] mlx5_core 0001:01:00.3: mlx5_pci_err_detected was called
[ 474.403711] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:115:(pid 779): start
[ 474.403870] mlx5_core 0001:01:00.3: 0001:01:00.3:mlx5_enter_error_state:120:(pid 779): end

One time I saw
SSep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: enabling device (0000 -> 0002)
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944145] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944152] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944195] mlx5_core 0001:01:00.3: firmware version: 12.17.226
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944260] Unable to handle kernel paging request for data at address 0x00000000
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944533] Faulting instruction address: 0xc0000000005b37e0
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944592] Oops: Kernel access of bad area, sig: 11 [#1]
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944636] SMP NR_CPUS=2048 NUMA PowerNV
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944851] Modules linked in: vfio_pci irqbypass vfio_iommu_spapr_tce vfio_virqfd vfio vfio_spapr_eeh xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp kvm_hv kvm_pr kvm ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE) ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx5_core(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) mlx4_en(OE) ib_addr(OE) ib_netlink(OE) mlx4_core(OE) mlx_compat(OE) bridge stp llc joydev input_leds mac_hid ofpart at24 cmdlinepart powernv_flash ipmi_powernv nvmem_core uio_pdrv_genirq opal_prd mtd ipmi_msghandler uio ibmpowernv powernv_rng binfmt_misc dm_multipath knem(OE) ip_tables x_tables autofs4 hid_generic usbhid hid uas usb_storage ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm ahci devlink libahci [last unloaded: mlx4_core]
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946007] CPU: 40 PID: 12501 Comm: libvirtd Tainted: G OE 4.7.0unofficial #5
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946074] task: c000000ec319a200 ti: c000000ec324c000 task.ti: c000000ec324c000
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946140] NIP: c0000000005b37e0 LR: c0000000005ad070 CTR: 0000000000000000
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946208] REGS: c000000ec324f100 TRAP: 0300 Tainted: G OE (4.7.0unofficial)
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946286] MSR: 900000010280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 84028844 XER: 20000000
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] CFAR: c000000000008468 DAR: 0000000000000000 DSISR: 40000000 SOFTE: 0
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR00: c0000000005d19c8 c000000ec324f380 c0000000013bef00 0000000000000000
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR04: ffffffffffffffff 0000000000000000 ffffffffffffffff 000000000000ffff
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR08: 0000000000000000 0000000000000000 ffffffffffffffff f000000003803140
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR12: 0000000024048840 c00000000fb96800 c000000ec0def080 c000000ec0def000
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR16: 00003fff93ef0000 0000000000000001 0000000000000000 c000000ec0def100
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR20: c000000ec0def118 0000000000000000 0000000000010000 0000000000000000
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR24: 0000000000010000 ffffffffffff0000 0000000000000000 000000000000ffff
Sep 13 09:41:32 ltc-fire1 kernel: [70437.946533] GPR28: 0000000000000000 ffffffffffffffff 0000000000010000 0000000000000000
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947437] NIP [c0000000005b37e0] _find_next_bit.part.0+0x10/0x90
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947496] LR [c0000000005ad070] bitmap_find_next_zero_area_off+0x70/0x110
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947553] Call Trace:
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947578] [c000000ec324f380] [c000000ec324f3c0] 0xc000000ec324f3c0 (unreliable)
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947659] [c000000ec324f3e0] [c0000000005d19c8] iommu_area_alloc+0x78/0x130
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947731] [c000000ec324f440] [c000000000041a0c] iommu_range_alloc+0x18c/0x410
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947812] [c000000ec324f500] [c000000000041f24] iommu_alloc+0x74/0x190
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947885] [c000000ec324f570] [c0000000000437b8] iommu_alloc_coherent+0x138/0x1e0
Sep 13 09:41:32 ltc-fire1 kernel: [70437.947968] [c000000ec324f620] [c000000000041538] dma_iommu_alloc_coherent+0x58/0x80
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948057] [c000000ec324f670] [d000000012e75eb0] mlx5_cmd_init+0x118/0x940 [mlx5_core]
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948146] [c000000ec324f740] [d000000012e722f0] mlx5_load_one+0x198/0x16c0 [mlx5_core]
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948235] [c000000ec324f8b0] [d000000012e73e5c] init_one+0x3b4/0x5e0 [mlx5_core]
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948314] [c000000ec324f950] [c0000000006042dc] local_pci_probe+0x6c/0x140
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948381] [c000000ec324f9e0] [c0000000006051b8] pci_device_probe+0x178/0x200
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948462] [c000000ec324fa40] [c0000000006f6cb0] driver_probe_device+0x220/0x580
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948542] [c000000ec324fad0] [c0000000006f348c] bus_for_each_drv+0x8c/0x100
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948611] [c000000ec324fb20] [c0000000006f67b0] __device_attach+0x140/0x220
Sep 13 09:41:32 ltc-fire1 kernel: [70437.948679] [c000000ec324fbb0] [c0000000006f4510] bus_rescan_devices_helper+0x70/0xc0

I think the key is that I am seeing this trace with 32bit dma which is maybe causing issue
Sep 13 09:41:32 ltc-fire1 kernel: [70437.943722] vfio-pci 0001:01:00.3: No device request channel registered, blocked until released by user
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944076] mlx5_core 0001:01:00.3: enabling device (0000 -> 0002)
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944110] mlx5_core 0001:01:00.3: Using 32-bit DMA via iommu
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944145] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944152] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1
Sep 13 09:41:32 ltc-fire1 kernel: [70437.944195] mlx5_core 0001:01:00.3: firmware version: 12.17.226

---uname output---
Linux ltc-fire1 4.7.0unofficial #5 SMP Mon Sep 5 08:53:38 EDT 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = P8

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 Im using a kernel 4.7 plus the qemu is from Mike Roth qemu with DDW.
To recreate just need to bring a guest with VF and then I did
virsh detach-device ltc-fire1-vm2-ubuntu-16.10 ./add_cx4vf.xml --live

This should be ship issue given that we now have whole sriov hotplug broken on Ubuntu 16.10

> Hi Carol,
>
> Just to confirm, have you hit this with the new QEMU package from Jon's PPA?
> It should be based on QEMU 2.6.1, which should have some fixes related to EEH

I have not tried. Will do during the day.

As far as the unplug issues, I'm unable to reproduce any of the host errors noted in the initial comments using the latest 2.6.1-based package. I'm only able to reproduce the following guest errors in dmesg. Carol, can you confirm whether or not this is also the case for you?

Here's what I'm seeing, but it's intermittent. Sometimes I see up to 7 successful unplugs before reproducing, but once I do reproduce from what I can tell the trace is always this:

[ 63.314271] Unable to handle kernel paging request for data at address 0x00000000
[ 63.327986] Faulting instruction address: 0xd000000000ea5cbc
[ 63.328388] Oops: Kernel access of bad area, sig: 11 [#1]
[ 63.328741] SMP NR_CPUS=2048 NUMA pSeries
[ 63.329085] Modules linked in: rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) ib_ipoib(OE) ib_cm(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE) ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE) sg ghash_generic gf128mul vmx_crypto ip_tables x_tables autofs4 ext4 crc16 jbd2 fscrypto mbcache sr_mod cdrom virtio_net virtio_blk ibmvscsi scsi_transport_srp mlx5_core(OE) crc32c_vpmsum ptp pps_core mlx_compat(OE) virtio_pci virtio_ring virtio
[ 63.332759] CPU: 14 PID: 5234 Comm: drmgr Tainted: G W OE 4.8.0-15-generic #16-Ubuntu
[ 63.332911] task: c000000003bb9780 task.stack: c000000003cd8000
[ 63.333025] NIP: d000000000ea5cbc LR: d000000000ea5cb4 CTR: c0000000002ddc10
[ 63.333151] REGS: c000000003cdb6d0 TRAP: 0300 Tainted: G W OE (4.8.0-15-generic)
[ 63.333305] MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 22000228 XER: 20000000
[ 63.333716] CFAR: c000000000142f3c DAR: 0000000000000000 DSISR: 40000000 SOFTE: 1
GPR00: d000000000ea5cb4 c000000003cdb950 d000000000f288e8 0000000000000000
GPR04: c0000000b3716b40 d000000000ea5b48 c0000000010f6a38 c0000000010d7b00
GPR08: c000000000fa7b00 0000000000000000 0000000000000000 d000000000eeff70
GPR12: c0000000002ddc10 c00000000fb87e00 0000000000000000 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: c000000004ff5b58 c000000004ff5b20 0000000000000001 c000000004fc04e8
GPR28: c000000004fc0000 0000000000000000 c000000004fc0290 0000000000000000
[ 63.335484] NIP [d000000000ea5cbc] destroy_msg_cache+0x84/0x130 [mlx5_core]
[ 63.335609] LR [d000000000ea5cb4] destroy_msg_cache+0x7c/0x130 [mlx5_core]
[ 63.335713] Call Trace:
[ 63.335805] [c000000003cdb950] [d000000000ea5cb4] destroy_msg_cache+0x7c/0x130 [mlx5_core] (unreliable)
[ 63.336003] [c000000003cdb9a0] [d000000000ea66e8] mlx5_cmd_cleanup+0x40/0x80 [mlx5_core]
[ 63.336206] [c000000003cdb9d0] [d000000000ea1a70] mlx5_unload_one+0x248/0x420 [mlx5_core]
[ 63.336383] [c000000003cdba90] [d000000000ea2028] remove_one+0xb0/0x170 [mlx5_core]
[ 63.336539] [c000000003cdbb20] [c0000000005660e0] pci_device_remove+0x70/0x110
[ 63.336688] [c000000003cdbb60] [c000000000610e80] __device_release_driver+0xc0/0x1a0
[ 63.336828] [c000000003cdbb90] [c000000000610fa0] device_release_driver+0x40/0x70
[ 63.337015] [c000000003cdbbc0] [c00000000055a730] pci_stop_bus_device+0xe0/0x100
[ 63.337156] [c000000003cdbc00] [c00000000055a914] pci_stop_and_remove_bus_device_locked+0x34/0x60
[ 63.337321] [c000000003cdbc30] [c0000000005692b0] remove_store+0xb0/0xd0
[ 63.337458] [c000000003cdbc80] [c000000000608ef8] dev_attr_store+0x68/0xa0
[ 63.337571] [c000000003cdbcc0] [c0000000003cb3f0] sysfs_kf_write+0x80/0xb0
[ 63.337692] [c000000003cdbd00] [c0000000003ca25c] kernfs_fop_write+0x17c/0x250
[ 63.337825] [c000000003cdbd50] [c00000000030fc3c] __vfs_write+0x6c/0xe0
[ 63.337950] [c000000003cdbd90] [c000000000311144] vfs_write+0xd4/0x240
[ 63.338094] [c000000003cdbde0] [c000000000312e5c] SyS_write+0x6c/0x110
[ 63.338233] [c000000003cdbe30] [c0000000000095e0] system_call+0x38/0x108
[ 63.338364] Instruction dump:
[ 63.338429] 60420000 480498b9 e8410018 813e0060 7fa4eb78 7f83e378 7ffdfb78 3929ffff
[ 63.338697] 913e0060 4bfffe21 7fbff040 7fe3fb78 <ebff0000> 409effd0 3bde0078 7fbed840
[ 63.338925] ---[ end trace a4fbfe789eac3ffc ]---
[ 63.344490]

I'm using QEMU directly, and see the above after the device_del. I see no other errors in guest/host:

root@ltc-fire1:~/mdroth# ./vfio-bind 0001:01:00.2
unbinding 0001:01:00.2 via /sys/bus/pci/devices/0001:01:00.2/driver/unbind
binding 0001:01:00.2
echo 0x15b3 0x1014 >/sys/bus/pci/drivers/vfio-pci/new_id
root@ltc-fire1:~/mdroth# ./vfio-bind 0001:01:00.2

qemu-system-ppc64 --smp sockets=1,cores=2,threads=8 --enable-kvm -M pseries -m 3G,slots=32,maxmem=8G -device virtio-blk-pci,id=disk1,drive=drive1 -drive file=ubuntu16.10.base.img.4,aio=threads,cache=none,if=none,id=drive1 -device spapr-vscsi,id=scsi0,reg=0x2000 -monitor unix:/tmp/mdroth-vm4-hmp.sock,server,nowait -qmp unix:/tmp/mdroth-vm4-qmp.sock,server,nowait -device virtio-net-pci,mac=52:54:00:12:34:01,netdev=net1,id=nic1 -netdev user,hostfwd=tcp:0.0.0.0:2224-:22,id=net1 -device vfio-pci,host=0001:01:00.2,id=vfio0 -L /root/mdroth-qemu/pc-bios -boot c -vga none -nographic

root@ltc-fire1:~# rlwrap socat stdio unix-connect:/tmp/mdroth-vm4-hmp.sock
QEMU 2.6.1 monitor - type 'help' for more information
(qemu) device_del vfio0

I'm planning to do more debugging of the guest driver's mlx5_cmd_cleanup -> destroy_msg_cache cleanup path which seems to be triggering the error, but want to confirm I'm looking at the same issue that Carol is currently tracking.

> Sometimes I see that one, but if I do not see that one then I see issues at
> host which could be the stack trace that I put or an EEH. But the problems
> at host the EEH or the stack trace I always see that the enable device is
> happening before the 64bit dma is set back again to the card.

You mean the enable device for the card in the guest after hotplug? or is this in the unplug path?

Here's what I see when doing unplug:

<bind 0001:01: 0.2 to vfio-pci>
[ 8056.733150] pci 0001:01: 0.2: [PE# 005] Disabling 64-bit DMA bypass
[ 8056.733157] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[ 8056.733927] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..7fffffff pg=1000
[ 8057.039487] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[ 8057.073152] vfio-pci 0001:01:00.2: enabling device (0000 -> 0002)

<start guest>
[ 7805.107856] KVM guest htab at c000000f2b000000 (order 26), LPID 1
<guest immediately sets up default 32-bit window>
[ 7805.216462] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..3fffffff pg=1000
<guest kernel boots, enables 64-bit window>
[ 7815.386476] pci 0001:01: 0.2: [PE# 005] Setting up window#1 800000000000000..8000001ffffffff pg=10000
<unplug 0001:01: 0.2 (if guest kernel doesn't crash, if it crashes we never get past here and hang in guest>
[ 7906.714683] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[ 7906.714690] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1
<host re-enables 32-bit window and re-enables 64-bit DMA bypass>
[ 7906.714722] window_size 0x80000000
[ 7906.715205] Calling iommu_init_table
[ 7906.721714] Calling pnv_pci_ioda2_set_window
[ 7906.721851] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..7fffffff pg=1000
[ 7906.722334] pci 0001:01: 0.2: [PE# 005] Enabling 64-bit DMA bypass

I'm unable to reproduce any host crashes in either case. If you're hitting an issue where maybe we're rebinding the vf back to mlx5_core in the host before the window re-init is complete then maybe that might explain what you're seeing. If that's the case it might be a race somewhere in libvirt when it "un-stubs"/re-assigns the device back to default mlx5_core driver. I recall some previous libvirt BZs around avoiding this sort of scenario but i'm not sure if they are still pending or not.

I'll do some more digging to see if I can reproduce without libvirt later today (perhaps by attempting to re-bind to mlx5_core immediately after unplug...in my scenario i think no such re-binding occurs so maybe that's why i don't see the host crashes (or other guest crashes...though it's also strange to me even in this race scenario that the guest would crash do to host race since drmgr fixes are already in place to make sure guest kernel has released the device before unplug completes in host/libvirt. maybe i just haven't seen the guest crashes yet and those are also a separate issue from the host ones))

I was consulting with Mike about this and the last experiment that wanted to try was to put a sleep in vfio so in that way qemu can do all the dma setup before libvirt does the enable device at the KVM host.
I put a sleep in drivers/vfio/vfio.c in function vfio_del_group_dev and that kind of helps in my case. Not sure if that will be the solution but it works for my case.
diff -Naurp vfio.c.save vfio.c
--- vfio.c.save 2016-09-28 10:11:30.047333880 -0500
+++ vfio.c 2016-09-30 13:57:15.376564520 -0500
@@ -34,6 +34,7 @@
 #include <linux/uaccess.h>
 #include <linux/vfio.h>
 #include <linux/wait.h>
+#include <linux/delay.h>

 #define DRIVER_VERSION "0.3"
 #define DRIVER_AUTHOR "Alex Williamson <email address hidden>"
@@ -910,6 +911,7 @@ void *vfio_del_group_dev(struct device *

        vfio_device_put(device);

+ msleep(2000);
        /*
         * If the device is still present in the group after the above
         * 'put', then it is in use and we need to request it from the

Then in dmesg I see
[ 1122.568409] pci 0001:01: 0.3: [PE# 006] Removing DMA window #0
[ 1122.568415] pci 0001:01: 0.3: [PE# 006] Removing DMA window #1
[ 1122.575117] pci 0001:01: 0.3: [PE# 006] Setting up window#0 0..7fffffff pg=1000
[ 1122.575121] pci 0001:01: 0.3: [PE# 006] Enabling 64-bit DMA bypass
[ 1124.584345] mlx5_core 0001:01:00.3: enabling device (0000 -> 0002)
[ 1124.584379] mlx5_core 0001:01:00.3: Using 64-bit DMA iommu bypass
[ 1124.584445] mlx5_core 0001:01:00.3: firmware version: 12.17.1010
[ 1125.062922] mlx5_core 0001:01:00.3: Assigned random MAC address f2:c8:5a:8b:2a:b2
[ 1125.108710] mlx5_core 0001:01:00.3 enP1p1s0f3: renamed from eth0

Mirroring bug to Launchpad for Canonical's awareness...

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-146212 severity-critical targetmilestone-inin1610
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-04 18:34 EDT-------
Some observations:

1) QEMU appears to be sending the 'device-removed' event prematurely. The below output shows that the device's VFIO group FD is still open by the QEMU process at the time it signals libvirt that the device unplug/cleanup has completed:

root@ltc-fire1:~# virsh event ltc-fire1-vm3-ubuntu-16.10 --event device-removed && lsof /dev/vfio/7
event 'device-removed' for domain ltc-fire1-vm3-ubuntu-16.10: hostdev0
events received: 1

COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
qemu-syst 31231 libvirt-qemu 26u CHR 242,0 0t0 750 /dev/vfio/7

2) In response to this event, libvirt issues the following sequence to rebind the VF:

echo $DEVID >/sys/bus/pci/drivers/vfio-pci/unbind
echo $DEVID >/sys/bus/pci/drivers_probe

3) On the VFIO side, this consistently leads to mlx5_core attempting to bind to the device while VFIO is still running it's cleanup routines:

[ 120.099498] KVM guest htab at c000000f2b000000 (order 26), LPID 1
[ 120.208235] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..3fffffff pg=1000
[ 138.281730] pci 0001:01: 0.2: [PE# 005] Setting up window#1 800000000000000..8000001ffffffff pg=10000
[ 396.873573] vfio-pci 0001:01:00.2: No device request channel registered, blocked until released by user
[ 396.873791] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[ 396.873796] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1
[ 396.873908] mlx5_core 0001:01:00.2: enabling device (0000 -> 0002)
[ 396.873940] mlx5_core 0001:01:00.2: Using 32-bit DMA via iommu
[ 396.874034] mlx5_core 0001:01:00.2: firmware version: 12.17.1010

The full cleanup path should include something like:
[ 4762.425039] pci 0001:01: 0.2: [PE# 005] Removing DMA window #0
[ 4762.425043] pci 0001:01: 0.2: [PE# 005] Removing DMA window #1
[ 4762.432014] pci 0001:01: 0.2: [PE# 005] Setting up window#0 0..7fffffff pg=1000
[ 4762.432018] pci 0001:01: 0.2: [PE# 005] Enabling 64-bit DMA bypass

So the driver is attempting to enable the device before the default DMA windows have been restored

4) The sleep Carol inserted above in VFIO cleanup path seems to avoid the issue. This suggests that the reprobe doesn't blindly run but instead waits for a signal of some sort, but that that signaling seems to happen prematurely without the explicit sleep.

This probably needs to be addressed at multiple levels, a fix in QEMU to defer the device-deleted event until VFIO has cleanup up the device, and a fix in VFIO path to avoid crashing the host if someone were to issue the reprobe manually while the device is still in use.

A possible workaround that might be worth trying in the meantime is specifying managed='no' in the device XML, which according to libvirt documentation would prevent libvirt from automatically rebinding the device back to default in the host after unplug. But I saw mention that maybe this wasn't supported yet for KVM, so it's not a given.

Revision history for this message
bugproxy (bugproxy) wrote : syslog

------- Comment (attachment only) From <email address hidden> 2016-10-10 06:08 EDT-------

bugproxy (bugproxy)
tags: added: severity-high
removed: severity-critical
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → nobody
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Following the comment "Mirroring bug to Launchpad for Canonical's awareness..." marking as "opinion".

Changed in linux (Ubuntu):
status: New → Opinion
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-08-04 16:11 EDT-------
Michael sent an updated version for his patch to the QEMU community last week [0].

0. https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg08410.html

bugproxy (bugproxy)
tags: added: targetmilestone-inin1710
removed: targetmilestone-inin1610
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-11-21 11:31 EDT-------
The patches are upstream and will be included in QEMU 2.11:

1: commit 04162f8f4bcf8c9ae2422def4357289b44208c8c
Author: Michael Roth <email address hidden>
Date: Mon Oct 16 17:23:13 2017 -0500

qdev: store DeviceState's canonical path to use when unparenting

2: commit 2fc06c4ac65594ad248e9a9150ebdde9ff5a1253
Author: Michael Roth <email address hidden>
Date: Mon Oct 16 17:23:14 2017 -0500

Revert "qdev: Free QemuOpts when the QOM path goes away"

3: commit f7b879e072ae6839b1b1d1312f48fa7f256397e2
Author: Michael Roth <email address hidden>
Date: Mon Oct 16 17:23:15 2017 -0500

qdev: defer DEVICE_DEL event until instance_finalize()

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-02-23 06:46 EDT-------
Any updates?

As part of the bug backlog screening exercise, we are looking for updates on the bugs that are not witnessing activity for quite sometime.

Please post the latest status on this bug so that it can closed accordingly.

Thanks for your support!!

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-07-23 12:14 EDT-------
probably Ubuntu 18.04 has the changes so will close.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.