kdump fail due to an IRQ storm

Bug #1797990 reported by Guilherme G. Piccoli on 2018-10-15
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Guilherme G. Piccoli
Nominated for Disco by Guilherme G. Piccoli
Trusty
High
Guilherme G. Piccoli
Xenial
High
Guilherme G. Piccoli
Bionic
High
Guilherme G. Piccoli
Cosmic
High
Guilherme G. Piccoli

Bug Description

[Impact]

 * A kexec/crash kernel might get stuck and fail to boot
   (for crash kernel, kdump fails to collect a crashdump)
   if a PCI device is buggy/stuck/looping and triggers a
   continuous flood of MSI(X) interrupts (that the kernel
   does not yet know about).

 * This fix allowed to obtain crashdumps when debugging a
   heavy-load scenario, in which a (heavy-loaded) network
   adapter wouldn't stop triggering MSI-X interrupts ever
   after panic()->kdump kicked in.

 * This fix disables MSI(X) in all PCI devices on early
   boot (this is OK as it's (re-)enabled normally later)
   with a kernel cmdline parameter (disabled by default).

[Test Case]

 * A synthetic test-case is not yet available, however,
   this particular system/workload triggered the problem
   consistently, and it was used for development/testing.

 * We'll update this bug once a synthetic test-case is
   available; we're working on patching QEMU for this.

 * $ cat /proc/cmdline
   <...> pci=clearmsi

   $ dmesg | grep 'Clearing MSI'
   [ 0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk)

 * The comparison of 'dmesg -t | sort' has been reviewed
   between option disabled/enabled on boot & kexec modes,
   and only expected differences found (MHz, PIDs, MIPS).

[Regression Potential]

 * The potential area for regressions is early boot,
   particularly effects of applying quirks during PCI
   bus scan, which is changed/broader w/ these patches.

 * However, all quirks are applied based on PCI ID
   matching, so would only apply if actually targeting
   a new device.

 * Moreover, the new quirk is only applied based on
   a kernel cmdline parameter that is disabled by
   default, which constraints even more when this
   is actually in effect.

[Other Info]

 * The patch series is still under review/discussion
   upstream, but it's relatively important for Ubuntu
   users at this point, and after internal discussions
   we decided to submit it for SRU.

 * These are links to the linux-pci archive with the
   patches [1, 2, 3]

   [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
       https://<email address hidden>/

   [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
       https://<email address hidden>/

   [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
       https://<email address hidden>/

[Original Description]

We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.

The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:

[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]

The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).

This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest.

Changed in linux (Ubuntu Bionic):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Xenial):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Trusty):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
Changed in linux (Ubuntu Xenial):
importance: Undecided → High
Changed in linux (Ubuntu Trusty):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
status: New → Confirmed
Changed in linux (Ubuntu Xenial):
status: New → Confirmed
Changed in linux (Ubuntu Trusty):
status: New → Confirmed
Guilherme G. Piccoli (gpiccoli) wrote :

During the investigation, we've noticed that PCI specification mentions the need of MSI/MSI-X capability to be disabled during a system boot/reset; from PCI Local Bus specification 3.0, sections 6.8.1.3 and 6.8.2.3: "[...] MSI Enable: This bit’s state after reset is 0 (MSI is disabled)."

PCI layer in the Linux kernel ensures this bit is 0 during its initialization [0], but for our case it is too late, give we had an IRQ storm during early stages in the kdump kernel boot process.

The idea to resolve the issue was then to disable MSI/MSI-X early in boot, using the early-quirks infrastructure in arch/x86, which proved to be a successful approach.
Patches will be attached here soon.

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/probe.c?h=v4.18#n1511

Guilherme G. Piccoli (gpiccoli) wrote :

One problem faced during this approach was that the early-quirks code in x86 performs a recursive search in the PCI bus descending from the "first" bus 0000:00, and walking through all secondary busses by jumping between bridges. For historical perspective about this code's evolution, see [0].

This is not enough in multi-processor systems, which may have multiple PCIe root complexes, exposing many root ports and so describing multiple hierarchy domains. The PCIe spec even doesn't guarantee those hierarchies are capable of communicating; from PCIe spec 3.0, section 1.3.1: "[...] The capability to route peer-to-peer transactions between hierarchy domains through a Root
Complex is optional and implementation dependent. For example, an implementation may
incorporate a real or virtual Switch internally within the Root Complex to enable full peer-to-
peer support in a software transparent way."

Usually we don't see PCI devices unable to communicate to each other if they are under different host bridges (aka root complexes in PCIe terminology). But from a software perspective, what Linux sees are multiple PCI devices organized in a tree way. The naive recursion from check_dev_quirk() in arch/x86 can't reach all root complexes starting always from bus 0000:00.

To exemplify how this tree would look like with a single or with multi root bridges, we'll attach outputs of "lspci -t" for 2 system next.
That said, we needed to change the bus scanning process to be comprehensive and walk through all buses. Good references for multi-root-complex PCIe BIOS probe (like its numbering rationale), [1] and [2].

[0] The early PCI scan dates back to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64", on October/2003. It initially restricted the search to the first 32 busses and slots. Due to a potential bug found in Nvidia chipsets, the scan was changed to run only in the first root bus: see commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks").
Finally, secondary busses reachable from the first bus were re-added back by: commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses").

[1] https://codywu2010.wordpress.com/2015/11/29/how-modern-multi-processor-multi-root-complex-system-assigns-pci-bus-number/

[2] PCI Firmware Specification and the ACPI spec.

description: updated
Guilherme G. Piccoli (gpiccoli) wrote :

Patches sent to the mailing lists today.

Guilherme G. Piccoli (gpiccoli) wrote :

Mailing list archive URL: https://marc.info/?l=linux-pci&m=153988799707413
(navigate using "next in list")

tags: added: patch
description: updated
description: updated
description: updated
description: updated

Patch set v2 submitted to the kernel-team mailing list
for Xenial, Bionic, Cosmic, Disco.

[SRU X][PATCH v2 0/3] Add kernel parameter 'pci=clearmsi' to clear MSI(X)s early on boot
https://lists.ubuntu.com/archives/kernel-team/2018-November/096631.html

[SRU B][PATCH v2 0/3] Add kernel parameter 'pci=clearmsi' to clear MSI(X)s early on boot
https://lists.ubuntu.com/archives/kernel-team/2018-November/096635.html

[SRU C][PATCH v2 0/3] Add kernel parameter 'pci=clearmsi' to clear MSI(X)s early on boot
https://lists.ubuntu.com/archives/kernel-team/2018-November/096642.html

[D][PATCH v2 0/3] Add kernel parameter 'pci=clearmsi' to clear MSI(X)s early on boot
https://lists.ubuntu.com/archives/kernel-team/2018-November/096646.html

Attaching for documentation purposes,

Tarball with 'dmesg -t | sort' for boot/kexec & option disabled/enabled,
in Xenial, Bionic, Cosmic, Disco.

Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Cosmic):
status: Confirmed → Fix Committed
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-cosmic' to 'verification-done-cosmic'. If the problem still exists, change the tag 'verification-needed-cosmic' to 'verification-failed-cosmic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-cosmic
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic

Thanks Maurício for submitting the patches and taking care of the bug while I was out.

I've verified all the 3 releases (in fact, I've also verified Trusty HWE) with a similar
test as used by Maurício, "dmesg -t | sort" and the kernels are running fine.
During kdump, with the "pci=clearmsi" option, we can see the message:

"Clearing MSI/MSI-X enable bits early in boot (quirk)"
which shows that the quirk is working.

I'll attach the logs for documentation purposes.
Cheers,

Guilherme

Changed in linux (Ubuntu Trusty):
status: Confirmed → Won't Fix
tags: added: verification-done-bionic verification-done-cosmic verification-done-xenial
removed: verification-needed-bionic verification-needed-cosmic verification-needed-xenial
Launchpad Janitor (janitor) wrote :
Download full text (39.7 KiB)

This bug was fixed in the package linux - 4.18.0-12.13

---------------
linux (4.18.0-12.13) cosmic; urgency=medium

  * linux: 4.18.0-12.13 -proposed tracker (LP: #1802743)

  * [FEAT] Guest-dedicated Crypto Adapters (LP: #1787405)
    - s390/zcrypt: Add ZAPQ inline function.
    - s390/zcrypt: Review inline assembler constraints.
    - s390/zcrypt: Integrate ap_asm.h into include/asm/ap.h.
    - s390/zcrypt: fix ap_instructions_available() returncodes
    - KVM: s390: vsie: simulate VCPU SIE entry/exit
    - KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
    - KVM: s390: refactor crypto initialization
    - s390: vfio-ap: base implementation of VFIO AP device driver
    - s390: vfio-ap: register matrix device with VFIO mdev framework
    - s390: vfio-ap: sysfs interfaces to configure adapters
    - s390: vfio-ap: sysfs interfaces to configure domains
    - s390: vfio-ap: sysfs interfaces to configure control domains
    - s390: vfio-ap: sysfs interface to view matrix mdev matrix
    - KVM: s390: interface to clear CRYCB masks
    - s390: vfio-ap: implement mediated device open callback
    - s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctl
    - s390: vfio-ap: zeroize the AP queues
    - s390: vfio-ap: implement VFIO_DEVICE_RESET ioctl
    - KVM: s390: Clear Crypto Control Block when using vSIE
    - KVM: s390: vsie: Do the CRYCB validation first
    - KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
    - KVM: s390: vsie: Allow CRYCB FORMAT-2
    - KVM: s390: vsie: allow CRYCB FORMAT-1
    - KVM: s390: vsie: allow CRYCB FORMAT-0
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
    - KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
    - KVM: s390: device attrs to enable/disable AP interpretation
    - KVM: s390: CPU model support for AP virtualization
    - s390: doc: detailed specifications for AP virtualization
    - KVM: s390: fix locking for crypto setting error path
    - KVM: s390: Tracing APCB changes
    - s390: vfio-ap: setup APCB mask using KVM dedicated function
    - [Config:] Enable CONFIG_S390_AP_IOMMU and set CONFIG_VFIO_AP to module.

  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

  * CVE-2018-18955: nested user namespaces with more than five extents
    incorrectly grant privileges over inode (LP: #1801924) // CVE-2018-18955
    - userns: also map extents in the reverse map to kernel IDs

  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks

  * crash in ENA driver on removing an interface (LP: #1802341)
    - SAUCE: net: ena: fix crash during ena_remove()

  * Ubuntu 18.04.1 - [s390x] Kernel panic while stressing network bonding
    (LP: #1797367)
    - s390/qeth: reduce hard-coded access to ccw channels
    - s390/qeth: sanitize strings in debug messages

  * Add checksum offload and T...

Changed in linux (Ubuntu Cosmic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (3.1 KiB)

This bug was fixed in the package linux - 4.15.0-42.45

---------------
linux (4.15.0-42.45) bionic; urgency=medium

  * linux: 4.15.0-42.45 -proposed tracker (LP: #1803592)

  * [FEAT] Guest-dedicated Crypto Adapters (LP: #1787405)
    - KVM: s390: reset crypto attributes for all vcpus
    - KVM: s390: vsie: simulate VCPU SIE entry/exit
    - KVM: s390: introduce and use KVM_REQ_VSIE_RESTART
    - KVM: s390: refactor crypto initialization
    - s390: vfio-ap: base implementation of VFIO AP device driver
    - s390: vfio-ap: register matrix device with VFIO mdev framework
    - s390: vfio-ap: sysfs interfaces to configure adapters
    - s390: vfio-ap: sysfs interfaces to configure domains
    - s390: vfio-ap: sysfs interfaces to configure control domains
    - s390: vfio-ap: sysfs interface to view matrix mdev matrix
    - KVM: s390: interface to clear CRYCB masks
    - s390: vfio-ap: implement mediated device open callback
    - s390: vfio-ap: implement VFIO_DEVICE_GET_INFO ioctl
    - s390: vfio-ap: zeroize the AP queues
    - s390: vfio-ap: implement VFIO_DEVICE_RESET ioctl
    - KVM: s390: Clear Crypto Control Block when using vSIE
    - KVM: s390: vsie: Do the CRYCB validation first
    - KVM: s390: vsie: Make use of CRYCB FORMAT2 clear
    - KVM: s390: vsie: Allow CRYCB FORMAT-2
    - KVM: s390: vsie: allow CRYCB FORMAT-1
    - KVM: s390: vsie: allow CRYCB FORMAT-0
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-1
    - KVM: s390: vsie: allow guest FORMAT-1 CRYCB on host FORMAT-2
    - KVM: s390: vsie: allow guest FORMAT-0 CRYCB on host FORMAT-2
    - KVM: s390: device attrs to enable/disable AP interpretation
    - KVM: s390: CPU model support for AP virtualization
    - s390: doc: detailed specifications for AP virtualization
    - KVM: s390: fix locking for crypto setting error path
    - KVM: s390: Tracing APCB changes
    - s390: vfio-ap: setup APCB mask using KVM dedicated function
    - s390/zcrypt: Add ZAPQ inline function.
    - s390/zcrypt: Review inline assembler constraints.
    - s390/zcrypt: Integrate ap_asm.h into include/asm/ap.h.
    - s390/zcrypt: fix ap_instructions_available() returncodes
    - s390/zcrypt: remove VLA usage from the AP bus
    - s390/zcrypt: Remove deprecated ioctls.
    - s390/zcrypt: Remove deprecated zcrypt proc interface.
    - s390/zcrypt: Support up to 256 crypto adapters.
    - [Config:] Enable CONFIG_S390_AP_IOMMU and set CONFIG_VFIO_AP to module.

  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

  * CVE-2018-18955: nested user namespaces with more than five extents
    incorrectly grant privileges over inode (LP: #1801924) // CVE-2018-18955
    - userns: also map extents in the reverse map to kernel IDs

  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks

 -- Thadeu Lima de Souza Cascardo <email address hidden> Thu, 15 Nov 2018 17:01:46 ...

Read more...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :
Download full text (8.4 KiB)

This bug was fixed in the package linux - 4.4.0-140.166

---------------
linux (4.4.0-140.166) xenial; urgency=medium

  * linux: 4.4.0-140.166 -proposed tracker (LP: #1802776)

  * Bypass of mount visibility through userns + mount propagation (LP: #1789161)
    - mount: Retest MNT_LOCKED in do_umount
    - mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts

  * kdump fail due to an IRQ storm (LP: #1797990)
    - SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
    - SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
    - SAUCE: x86/quirks: Scan all busses for early PCI quirks

  * crash in ENA driver on removing an interface (LP: #1802341)
    - SAUCE: net: ena: fix crash during ena_remove()

  * xenial guest on arm64 drops to busybox under openstack bionic-rocky
    (LP: #1797092)
    - [Config] CONFIG_PCI_ECAM=y
    - PCI: Provide common functions for ECAM mapping
    - PCI: generic, thunder: Use generic ECAM API
    - PCI, of: Move PCI I/O space management to PCI core code
    - PCI: Move ecam.h to linux/include/pci-ecam.h
    - PCI: Add parent device field to ECAM struct pci_config_window
    - PCI: Add pci_unmap_iospace() to unmap I/O resources
    - PCI/ACPI: Support I/O resources when parsing host bridge resources
    - [Config] CONFIG_ACPI_MCFG=y
    - PCI/ACPI: Add generic MCFG table handling
    - PCI: Refactor pci_bus_assign_domain_nr() for CONFIG_PCI_DOMAINS_GENERIC
    - PCI: Factor DT-specific pci_bus_find_domain_nr() code out
    - ARM64: PCI: Add acpi_pci_bus_find_domain_nr()
    - ARM64: PCI: ACPI support for legacy IRQs parsing and consolidation with DT
      code
    - ARM64: PCI: Support ACPI-based PCI host controller

  * [GLK/CLX] Enhanced IBRS (LP: #1786139)
    - x86/speculation: Remove SPECTRE_V2_IBRS in enum spectre_v2_mitigation
    - x86/speculation: Support Enhanced IBRS on future CPUs

  * Update ENA driver to version 2.0.1K (LP: #1798182)
    - net: ena: remove ndo_poll_controller
    - net: ena: fix warning in rmmod caused by double iounmap
    - net: ena: fix rare bug when failed restart/resume is followed by driver
      removal
    - net: ena: fix NULL dereference due to untimely napi initialization
    - net: ena: fix auto casting to boolean
    - net: ena: minor performance improvement
    - net: ena: complete host info to match latest ENA spec
    - net: ena: introduce Low Latency Queues data structures according to ENA spec
    - net: ena: add functions for handling Low Latency Queues in ena_com
    - net: ena: add functions for handling Low Latency Queues in ena_netdev
    - net: ena: use CSUM_CHECKED device indication to report skb's checksum status
    - net: ena: explicit casting and initialization, and clearer error handling
    - net: ena: limit refill Rx threshold to 256 to avoid latency issues
    - net: ena: change rx copybreak default to reduce kernel memory pressure
    - net: ena: remove redundant parameter in ena_com_admin_init()
    - net: ena: update driver version to 2.0.1
    - net: ena: fix indentations in ena_defs for better readability
    - net: ena: Fix Kconfig dependency on X86
    - net: ena: enable Low Latency Queues
    - net: ena: fix compilat...

Read more...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers