[Hyper-V] PCI Passthrough kernel hang and explicit barriers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Medium
|
Unassigned | ||
Xenial |
Fix Released
|
Medium
|
Unassigned | ||
Yakkety |
Fix Released
|
Medium
|
Unassigned |
Bug Description
Two upstream commits (right now in Bjorn Helgaas's PCI tree, and heading to Linus's tree) address potential hangs in PCI passthrough. Please consider these upstream items for 16.10 and 16.04 (and HWE kernels based on lts-xenial).
PCI: hv: Report resources release after stopping the bus
Kernel hang is observed when pci-hyperv module is release with device
drivers still attached. E.g., when I do 'rmmod pci_hyperv' with BCM5720
device pass-through-ed (tg3 module) I see the following:
NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [rmmod:2104]
...
Call Trace:
[<ffffffffa06
[<ffffffffa06
[<ffffffffa06
[<ffffffffa06
[<ffffffffa06
[<ffffffffa06
...
[<ffffffffa06
[<ffffffff813
[<ffffffff814
[<ffffffff814
[<ffffffff813
[<ffffffff813
[<ffffffffa02
The problem seems to be that we report local resources release before
stopping the bus and removing devices from it and device drivers may try to
perform some operations with these resources on shutdown. Move resources
release report after we do pci_stop_
Signed-off-by: Vitaly Kuznetsov <email address hidden>
Signed-off-by: Bjorn Helgaas <email address hidden>
Acked-by: Jake Oshins <email address hidden>
PCI: hv: Add explicit barriers to config space accesspci/host-hv
I'm trying to pass-through Broadcom BCM5720 NIC (Dell device 1f5b) on a
Dell R720 server. Everything works fine when the target VM has only one
CPU, but SMP guests reboot when the NIC driver accesses PCI config space
with hv_pcifront_
appears to be induced by the hypervisor and no crash is observed. Windows
event logs are not helpful at all ('Virtual machine ... has quit
unexpectedly'). The particular access point is always different and
putting debug between them (printk/mdelay/...) moves the issue further
away. The server model affects the issue as well: on Dell R420 I'm able to
pass-through BCM5720 NIC to SMP guests without issues.
While I'm obviously failing to reveal the essence of the issue I was able
to come up with a (possible) solution: if explicit barriers are added to
hv_pcifront_
The essential minimum is rmb() at the end on _hv_pcifront_
wmb() at the end of _hv_pcifront_
will be sufficient for all hardware. I suggest the following barriers:
1) wmb()/mb() between choosing the function and writing to its space.
2) mb() before releasing the spinlock in both _hv_pcifront_
_hv_
the space won't get re-ordered as drivers may count on that.
Config space access is not supposed to be performance-
explicit barriers should not cause any slowdown.
[bhelgaas: use Linux "barriers" terminology]
Signed-off-by: Vitaly Kuznetsov <email address hidden>
Signed-off-by: Bjorn Helgaas <email address hidden>
Acked-by: Jake Oshins <email address hidden>
Changed in linux (Ubuntu): | |
importance: | Undecided → Medium |
tags: | added: kernel-da-key kernel-hyper-v xenial yakkety |
Changed in linux (Ubuntu): | |
status: | Confirmed → Triaged |
Changed in linux (Ubuntu Xenial): | |
status: | New → Triaged |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Xenial): | |
status: | In Progress → Fix Committed |
This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:
apport-collect 1581243
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.