device hotplug of vfio devices can lead to deadlock in vfio_pci_release

Bug #1792099 reported by Andy Whitcroft
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Critical
Andy Whitcroft
Bionic
Fix Released
Critical
Andy Whitcroft

Bug Description

[Impact]

Attempts to hotplug devices shared to userspace (qemu) via vfio triggers a deadlock in the kernel. A reboot is required to resolve this.

[Test Case]

Set up a KVM instance with attached devices, attempt to hotplug those using ipmitool.

[Regression Potential]

The change is to an uncommonly used driver. There is common code changes, but these are a noop in the normal case and should be easy to confirm basic operation.

[Other Info]

This fix has been verified by the reporter as fixing the deadlock.

===

We are seeing deadlocks during hotplug of devices under vfio.

As per the Linux kernel source code, there is a deadlock situation between vfio_pci_remove() and vfio_pci_release() on PCIe hotplug events. This issue can be avoided either by skipping the PCIe reset functionality or do device_unlock() in vfio_pci_remove() beforfe calling the function vfio_del_group_dev()().

Code flow on PCIe hotplug event:

Execution flow 1:
  device_release_driver() ( ( https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L935 )
   device_release_driver_internal() ( https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L908 )
   device_lock(dev); ( https://elixir.bootlin.com/linux/latest/source/drivers/base/dd.c#L915 )
   vfio_pci_remove() ( https://elixir.bootlin.com/linux/v4.18.5/source/drivers/vfio/pci/vfio_pci.c#L392 )
     vfio_del_group_dev() https://elixir.bootlin.com/linux/v4.18.5/source/drivers/vfio/vfio.c#L923
       send event request to user and wait for VFIO_PCI_DEVICE release in vfio_pci_release() ( https://elixir.bootlin.com/linux/v4.18.5/source/drivers/vfio/vfio.c#L967 )

Execution flow 2 triggered by above step "send event request to user":
  vfio_pci_releas() ( https://elixir.bootlin.com/linux/v4.18.5/source/drivers/vfio/pci/vfio_pci.c#L392 )
    vfio_pci_disable() ( https://elixir.bootlin.com/linux/v4.18.5/source/drivers/vfio/pci/vfio_pci.c#L302 )
      vfio_pci_try_bus_reset() ( https://elixir.bootlin.com/linux/v4.18.5/source/drivers/vfio/pci/vfio_pci.c#L1346 )
        pci_try_reset_bus() ( https://elixir.bootlin.com/linux/v4.18.5/source/drivers/pci/pci.c#L4981 )
          pci_bus_save_and_disable() ( https://elixir.bootlin.com/linux/v4.18.5/source/drivers/pci/pci.c#L4760 )
            pci_dev_lock(dev); ( https://elixir.bootlin.com/linux/v4.18.5/source/drivers/pci/pci.c#L4765 )

             DEADLOCK here since PCI_DEIVCE_LOCK is held by PCI_DEVICE remove code path in DD.c

Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: New → Confirmed
importance: Undecided → Critical
assignee: nobody → Andy Whitcroft (apw)
Changed in linux (Ubuntu Bionic):
importance: Undecided → Critical
assignee: nobody → Andy Whitcroft (apw)
status: New → In Progress
Andy Whitcroft (apw)
description: updated
Andy Whitcroft (apw)
summary: - vfio_pci_release hotplug deadlock
+ device hotplug of vfio devices can lead to deadlock in vfio_pci_release
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Seth Forshee (sforshee)
Changed in linux (Ubuntu):
status: Confirmed → Fix Committed
Revision history for this message
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (23.5 KiB)

This bug was fixed in the package linux - 4.15.0-36.39

---------------
linux (4.15.0-36.39) bionic; urgency=medium

  * CVE-2018-14633
    - iscsi target: Use hex2bin instead of a re-implementation

  * CVE-2018-17182
    - mm: get rid of vmacache_flush_all() entirely

linux (4.15.0-35.38) bionic; urgency=medium

  * linux: 4.15.0-35.38 -proposed tracker (LP: #1791719)

  * device hotplug of vfio devices can lead to deadlock in vfio_pci_release
    (LP: #1792099)
    - SAUCE: vfio -- release device lock before userspace requests

  * L1TF mitigation not effective in some CPU and RAM combinations
    (LP: #1788563)
    - x86/speculation/l1tf: Fix overflow in l1tf_pfn_limit() on 32bit
    - x86/speculation/l1tf: Fix off-by-one error when warning that system has too
      much RAM
    - x86/speculation/l1tf: Increase l1tf memory limit for Nehalem+

  * CVE-2018-15594
    - x86/paravirt: Fix spectre-v2 mitigations for paravirt guests

  * CVE-2017-5715 (Spectre v2 s390x)
    - KVM: s390: implement CPU model only facilities
    - s390: detect etoken facility
    - KVM: s390: add etoken support for guests
    - s390/lib: use expoline for all bcr instructions
    - s390: fix br_r1_trampoline for machines without exrl
    - SAUCE: s390: use expoline thunks for all branches generated by the BPF JIT

  * Ubuntu18.04.1: cpuidle: powernv: Fix promotion from snooze if next state
    disabled (performance) (LP: #1790602)
    - cpuidle: powernv: Fix promotion from snooze if next state disabled

  * Watchdog CPU:19 Hard LOCKUP when kernel crash was triggered (LP: #1790636)
    - powerpc: hard disable irqs in smp_send_stop loop
    - powerpc: Fix deadlock with multiple calls to smp_send_stop
    - powerpc: smp_send_stop do not offline stopped CPUs
    - powerpc/powernv: Fix opal_event_shutdown() called with interrupts disabled

  * Security fix: check if IOMMU page is contained in the pinned physical page
    (LP: #1785675)
    - vfio/spapr: Use IOMMU pageshift rather than pagesize
    - KVM: PPC: Check if IOMMU page is contained in the pinned physical page

  * Missing Intel GPU pci-id's (LP: #1789924)
    - drm/i915/kbl: Add KBL GT2 sku
    - drm/i915/whl: Introducing Whiskey Lake platform
    - drm/i915/aml: Introducing Amber Lake platform
    - drm/i915/cfl: Add a new CFL PCI ID.

  * CVE-2018-15572
    - x86/speculation: Protect against userspace-userspace spectreRSB

  * Support Power Management for Thunderbolt Controller (LP: #1789358)
    - thunderbolt: Handle NULL boot ACL entries properly
    - thunderbolt: Notify userspace when boot_acl is changed
    - thunderbolt: Use 64-bit DMA mask if supported by the platform
    - thunderbolt: Do not unnecessarily call ICM get route
    - thunderbolt: No need to take tb->lock in domain suspend/complete
    - thunderbolt: Use correct ICM commands in system suspend
    - thunderbolt: Add support for runtime PM

  * random oopses on s390 systems using NVMe devices (LP: #1790480)
    - s390/pci: fix out of bounds access during irq setup

  * [Bionic] Spectre v4 mitigation (Speculative Store Bypass Disable) support
    for arm64 using SMC firmware call to set a hardware chicken bit
    (LP: #1787993) // CVE-2018...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Andy Whitcroft (apw)
tags: added: kernel-fixup-verification-needed-bionic
removed: verification-needed-bionic
Revision history for this message
Andy Whitcroft (apw) wrote :

This bug was erroneously marked for verification in bionic; verification is not required and verification-needed-bionic is being removed.

tags: added: verification-done-bionic
Brad Figg (brad-figg)
tags: added: cscc
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.