Activity log for bug #1797990

Date Who What changed Old value New value Message
2018-10-15 22:35:04 Guilherme G. Piccoli bug added bug
2018-10-15 22:35:43 Guilherme G. Piccoli nominated for series Ubuntu Bionic
2018-10-15 22:35:43 Guilherme G. Piccoli nominated for series Ubuntu Trusty
2018-10-15 22:35:43 Guilherme G. Piccoli nominated for series Ubuntu Xenial
2018-10-15 22:35:43 Guilherme G. Piccoli nominated for series Ubuntu Dd-series
2018-10-15 22:35:43 Guilherme G. Piccoli nominated for series Ubuntu Cosmic
2018-10-16 18:39:20 Joseph Salisbury bug task added linux (Ubuntu Bionic)
2018-10-16 18:39:26 Joseph Salisbury bug task added linux (Ubuntu Cosmic)
2018-10-16 18:39:31 Joseph Salisbury bug task added linux (Ubuntu Trusty)
2018-10-16 18:39:37 Joseph Salisbury bug task added linux (Ubuntu Xenial)
2018-10-16 19:00:26 Guilherme G. Piccoli linux (Ubuntu Bionic): assignee Guilherme G. Piccoli (gpiccoli)
2018-10-16 19:00:28 Guilherme G. Piccoli linux (Ubuntu Xenial): assignee Guilherme G. Piccoli (gpiccoli)
2018-10-16 19:00:29 Guilherme G. Piccoli linux (Ubuntu Trusty): assignee Guilherme G. Piccoli (gpiccoli)
2018-10-16 19:00:34 Guilherme G. Piccoli linux (Ubuntu Bionic): importance Undecided High
2018-10-16 19:00:36 Guilherme G. Piccoli linux (Ubuntu Xenial): importance Undecided High
2018-10-16 19:00:38 Guilherme G. Piccoli linux (Ubuntu Trusty): importance Undecided High
2018-10-16 19:00:41 Guilherme G. Piccoli linux (Ubuntu Bionic): status New Confirmed
2018-10-16 19:00:43 Guilherme G. Piccoli linux (Ubuntu Xenial): status New Confirmed
2018-10-16 19:00:45 Guilherme G. Piccoli linux (Ubuntu Trusty): status New Confirmed
2018-10-16 19:37:58 Guilherme G. Piccoli attachment added lspci tree output of a single root bridge PCI topology https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5201885/+files/lspci_single_root.txt
2018-10-16 19:38:35 Guilherme G. Piccoli attachment added lspci tree output of a multi root bridge PCI topology https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5201886/+files/lspci_multi_root.txt
2018-10-16 19:49:21 Guilherme G. Piccoli description We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest.
2018-10-16 19:55:19 Fabio Augusto Miranda Martins bug added subscriber Fabio Augusto Miranda Martins
2018-10-18 18:47:16 Guilherme G. Piccoli attachment added Patch 1: Scan all PCI busses https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5202664/+files/0001-x86-quirks-Scan-all-busses-for-early-PCI-quirks.patch
2018-10-18 18:48:07 Guilherme G. Piccoli attachment added Export pci capabilities function from AGP to early-PCI code https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5202666/+files/0002-x86-PCI-Export-find_cap-to-be-used-in-early-PCI-code.patch
2018-10-18 18:49:00 Guilherme G. Piccoli attachment added Parameter to enable quirk in early boot to disable MSIs on kexec https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5202667/+files/0003-x86-quirks-Add-parameter-to-clear-MSIs-early-on-boot.patch
2018-10-19 04:19:51 Ubuntu Foundations Team Bug Bot tags sts patch sts
2018-10-19 04:19:52 Ubuntu Foundations Team Bug Bot bug added subscriber Joseph Salisbury
2018-11-07 22:45:35 Mauricio Faria de Oliveira description We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. [Impact] * A kexec/crash kernel might get stuck and fail to boot (for crash kernel, kdump fails to collect a crashdump) if a PCI device is buggy/stuck/looping and triggers a continuous flood of MSI(X) interrupts (that the kernel does not yet know about). * This fix allowed to obtain crashdumps when debugging a heavy-load scenario, in which a (heavy-loaded) network adapter wouldn't stop triggering MSI-X interrupts ever after panic()->kdump kicked in. * This fix disables MSI(X) in all PCI devices on early boot; this is OK as it's (re-)enabled normally later. [Test Case] * A synthetic test-case is not yet available, however, this particular system/workload triggered the problem consistently, and it was used for development/testing. * We'll update this bug once a synthetic test-case is available; we're working on patching QEMU for this. [Regression Potential] * The potential area for regressions is early boot, particularly effects of applying quirks during PCI bus scan, which is changed/broader w/ these patches. * However, all quirks are applied based on PCI ID matching, so would only apply if actually targeting a new device. * Moreover, the new quirk is only applied based on a kernel cmdline parameter that is disabled by default, which constraints even more when this is actually in effect. [Other Info] * The patch series is still under review/discussion upstream, but it's relatively important for Ubuntu users at this point, and after internal discussions we decided to submit it for SRU. * These are links to the linux-pci archive with the patches [1, 2, 3] [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/ [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/ [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/ [Original Description] We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest.
2018-11-07 22:55:19 Mauricio Faria de Oliveira description [Impact] * A kexec/crash kernel might get stuck and fail to boot (for crash kernel, kdump fails to collect a crashdump) if a PCI device is buggy/stuck/looping and triggers a continuous flood of MSI(X) interrupts (that the kernel does not yet know about). * This fix allowed to obtain crashdumps when debugging a heavy-load scenario, in which a (heavy-loaded) network adapter wouldn't stop triggering MSI-X interrupts ever after panic()->kdump kicked in. * This fix disables MSI(X) in all PCI devices on early boot; this is OK as it's (re-)enabled normally later. [Test Case] * A synthetic test-case is not yet available, however, this particular system/workload triggered the problem consistently, and it was used for development/testing. * We'll update this bug once a synthetic test-case is available; we're working on patching QEMU for this. [Regression Potential] * The potential area for regressions is early boot, particularly effects of applying quirks during PCI bus scan, which is changed/broader w/ these patches. * However, all quirks are applied based on PCI ID matching, so would only apply if actually targeting a new device. * Moreover, the new quirk is only applied based on a kernel cmdline parameter that is disabled by default, which constraints even more when this is actually in effect. [Other Info] * The patch series is still under review/discussion upstream, but it's relatively important for Ubuntu users at this point, and after internal discussions we decided to submit it for SRU. * These are links to the linux-pci archive with the patches [1, 2, 3] [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/ [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/ [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/ [Original Description] We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. [Impact]  * A kexec/crash kernel might get stuck and fail to boot    (for crash kernel, kdump fails to collect a crashdump)    if a PCI device is buggy/stuck/looping and triggers a    continuous flood of MSI(X) interrupts (that the kernel    does not yet know about).  * This fix allowed to obtain crashdumps when debugging a    heavy-load scenario, in which a (heavy-loaded) network    adapter wouldn't stop triggering MSI-X interrupts ever    after panic()->kdump kicked in. * This fix disables MSI(X) in all PCI devices on early boot (this is OK as it's (re-)enabled normally later) with a kernel cmdline parameter (disabled by default). [Test Case]  * A synthetic test-case is not yet available, however,    this particular system/workload triggered the problem    consistently, and it was used for development/testing.  * We'll update this bug once a synthetic test-case is    available; we're working on patching QEMU for this. [Regression Potential]  * The potential area for regressions is early boot,    particularly effects of applying quirks during PCI    bus scan, which is changed/broader w/ these patches.  * However, all quirks are applied based on PCI ID    matching, so would only apply if actually targeting    a new device.  * Moreover, the new quirk is only applied based on    a kernel cmdline parameter that is disabled by    default, which constraints even more when this    is actually in effect. [Other Info]  * The patch series is still under review/discussion    upstream, but it's relatively important for Ubuntu    users at this point, and after internal discussions    we decided to submit it for SRU.  * These are links to the linux-pci archive with the    patches [1, 2, 3]    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks        https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code        https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot        https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/ [Original Description] We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest.
2018-11-08 01:22:09 Mauricio Faria de Oliveira description [Impact]  * A kexec/crash kernel might get stuck and fail to boot    (for crash kernel, kdump fails to collect a crashdump)    if a PCI device is buggy/stuck/looping and triggers a    continuous flood of MSI(X) interrupts (that the kernel    does not yet know about).  * This fix allowed to obtain crashdumps when debugging a    heavy-load scenario, in which a (heavy-loaded) network    adapter wouldn't stop triggering MSI-X interrupts ever    after panic()->kdump kicked in. * This fix disables MSI(X) in all PCI devices on early boot (this is OK as it's (re-)enabled normally later) with a kernel cmdline parameter (disabled by default). [Test Case]  * A synthetic test-case is not yet available, however,    this particular system/workload triggered the problem    consistently, and it was used for development/testing.  * We'll update this bug once a synthetic test-case is    available; we're working on patching QEMU for this. [Regression Potential]  * The potential area for regressions is early boot,    particularly effects of applying quirks during PCI    bus scan, which is changed/broader w/ these patches.  * However, all quirks are applied based on PCI ID    matching, so would only apply if actually targeting    a new device.  * Moreover, the new quirk is only applied based on    a kernel cmdline parameter that is disabled by    default, which constraints even more when this    is actually in effect. [Other Info]  * The patch series is still under review/discussion    upstream, but it's relatively important for Ubuntu    users at this point, and after internal discussions    we decided to submit it for SRU.  * These are links to the linux-pci archive with the    patches [1, 2, 3]    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks        https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code        https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot        https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/ [Original Description] We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. [Impact]  * A kexec/crash kernel might get stuck and fail to boot    (for crash kernel, kdump fails to collect a crashdump)    if a PCI device is buggy/stuck/looping and triggers a    continuous flood of MSI(X) interrupts (that the kernel    does not yet know about).  * This fix allowed to obtain crashdumps when debugging a    heavy-load scenario, in which a (heavy-loaded) network    adapter wouldn't stop triggering MSI-X interrupts ever    after panic()->kdump kicked in.  * This fix disables MSI(X) in all PCI devices on early    boot (this is OK as it's (re-)enabled normally later)    with a kernel cmdline parameter (disabled by default). [Test Case]  * A synthetic test-case is not yet available, however,    this particular system/workload triggered the problem    consistently, and it was used for development/testing.  * We'll update this bug once a synthetic test-case is    available; we're working on patching QEMU for this. * $ dmesg | grep 'Clearing MSI' [ 0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk) * The comparison of 'dmesg -t | sort' has been reviewed between option disabled/enabled on boot & kexec modes, and only expected differences found (MHz, PIDs, MIPS). [Regression Potential]  * The potential area for regressions is early boot,    particularly effects of applying quirks during PCI    bus scan, which is changed/broader w/ these patches.  * However, all quirks are applied based on PCI ID    matching, so would only apply if actually targeting    a new device.  * Moreover, the new quirk is only applied based on    a kernel cmdline parameter that is disabled by    default, which constraints even more when this    is actually in effect. [Other Info]  * The patch series is still under review/discussion    upstream, but it's relatively important for Ubuntu    users at this point, and after internal discussions    we decided to submit it for SRU.  * These are links to the linux-pci archive with the    patches [1, 2, 3]    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks        https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code        https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot        https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/ [Original Description] We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest.
2018-11-08 16:29:51 Mauricio Faria de Oliveira description [Impact]  * A kexec/crash kernel might get stuck and fail to boot    (for crash kernel, kdump fails to collect a crashdump)    if a PCI device is buggy/stuck/looping and triggers a    continuous flood of MSI(X) interrupts (that the kernel    does not yet know about).  * This fix allowed to obtain crashdumps when debugging a    heavy-load scenario, in which a (heavy-loaded) network    adapter wouldn't stop triggering MSI-X interrupts ever    after panic()->kdump kicked in.  * This fix disables MSI(X) in all PCI devices on early    boot (this is OK as it's (re-)enabled normally later)    with a kernel cmdline parameter (disabled by default). [Test Case]  * A synthetic test-case is not yet available, however,    this particular system/workload triggered the problem    consistently, and it was used for development/testing.  * We'll update this bug once a synthetic test-case is    available; we're working on patching QEMU for this. * $ dmesg | grep 'Clearing MSI' [ 0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk) * The comparison of 'dmesg -t | sort' has been reviewed between option disabled/enabled on boot & kexec modes, and only expected differences found (MHz, PIDs, MIPS). [Regression Potential]  * The potential area for regressions is early boot,    particularly effects of applying quirks during PCI    bus scan, which is changed/broader w/ these patches.  * However, all quirks are applied based on PCI ID    matching, so would only apply if actually targeting    a new device.  * Moreover, the new quirk is only applied based on    a kernel cmdline parameter that is disabled by    default, which constraints even more when this    is actually in effect. [Other Info]  * The patch series is still under review/discussion    upstream, but it's relatively important for Ubuntu    users at this point, and after internal discussions    we decided to submit it for SRU.  * These are links to the linux-pci archive with the    patches [1, 2, 3]    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks        https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code        https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot        https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/ [Original Description] We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. [Impact]  * A kexec/crash kernel might get stuck and fail to boot    (for crash kernel, kdump fails to collect a crashdump)    if a PCI device is buggy/stuck/looping and triggers a    continuous flood of MSI(X) interrupts (that the kernel    does not yet know about).  * This fix allowed to obtain crashdumps when debugging a    heavy-load scenario, in which a (heavy-loaded) network    adapter wouldn't stop triggering MSI-X interrupts ever    after panic()->kdump kicked in.  * This fix disables MSI(X) in all PCI devices on early    boot (this is OK as it's (re-)enabled normally later)    with a kernel cmdline parameter (disabled by default). [Test Case]  * A synthetic test-case is not yet available, however,    this particular system/workload triggered the problem    consistently, and it was used for development/testing.  * We'll update this bug once a synthetic test-case is    available; we're working on patching QEMU for this.  * $ cat /proc/cmdline <...> pci=clearmsi $ dmesg | grep 'Clearing MSI'    [ 0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk)  * The comparison of 'dmesg -t | sort' has been reviewed    between option disabled/enabled on boot & kexec modes,    and only expected differences found (MHz, PIDs, MIPS). [Regression Potential]  * The potential area for regressions is early boot,    particularly effects of applying quirks during PCI    bus scan, which is changed/broader w/ these patches.  * However, all quirks are applied based on PCI ID    matching, so would only apply if actually targeting    a new device.  * Moreover, the new quirk is only applied based on    a kernel cmdline parameter that is disabled by    default, which constraints even more when this    is actually in effect. [Other Info]  * The patch series is still under review/discussion    upstream, but it's relatively important for Ubuntu    users at this point, and after internal discussions    we decided to submit it for SRU.  * These are links to the linux-pci archive with the    patches [1, 2, 3]    [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks        https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/    [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code        https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/    [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot        https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/ [Original Description] We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device. The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like: [...] [ 342.265294] do_IRQ: 0.155 No irq handler for vector [ 342.266916] do_IRQ: 0.155 No irq handler for vector [ 347.258422] do_IRQ: 14053260 callbacks suppressed [...] The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only). This was tested using upstream kernel version 4.18, and the problem reproduces. In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest.
2018-11-08 17:15:56 Mauricio Faria de Oliveira attachment added sf202166.dmesg.tar.xz https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5210405/+files/sf202166.dmesg.tar.xz
2018-11-12 05:05:02 Khaled El Mously linux (Ubuntu Xenial): status Confirmed Fix Committed
2018-11-12 05:05:05 Khaled El Mously linux (Ubuntu Bionic): status Confirmed Fix Committed
2018-11-12 05:05:09 Khaled El Mously linux (Ubuntu Cosmic): status Confirmed Fix Committed
2018-11-15 11:03:55 Brad Figg tags patch sts patch sts verification-needed-cosmic
2018-11-16 16:36:25 Brad Figg tags patch sts verification-needed-cosmic patch sts verification-needed-cosmic verification-needed-xenial
2018-11-16 18:15:10 Brad Figg tags patch sts verification-needed-cosmic verification-needed-xenial patch sts verification-needed-bionic verification-needed-cosmic verification-needed-xenial
2018-11-22 12:50:34 Mauricio Faria de Oliveira bug added subscriber Mauricio Faria de Oliveira
2018-11-23 19:04:50 Guilherme G. Piccoli linux (Ubuntu Trusty): status Confirmed Won't Fix
2018-11-23 19:05:45 Guilherme G. Piccoli tags patch sts verification-needed-bionic verification-needed-cosmic verification-needed-xenial patch sts verification-done-bionic verification-done-cosmic verification-done-xenial
2018-11-23 19:06:34 Guilherme G. Piccoli attachment added lp1797990_verification.tgz https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5215743/+files/lp1797990_verification.tgz
2018-12-03 08:49:32 Launchpad Janitor linux (Ubuntu Cosmic): status Fix Committed Fix Released
2018-12-03 08:49:32 Launchpad Janitor cve linked 2018-18653
2018-12-03 08:49:32 Launchpad Janitor cve linked 2018-18955
2018-12-03 08:49:32 Launchpad Janitor cve linked 2018-6559
2018-12-03 14:01:15 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2018-12-03 14:59:47 Launchpad Janitor linux (Ubuntu Xenial): status Fix Committed Fix Released
2019-01-17 13:16:31 Dan Streetman bug task added linux (Ubuntu Disco)
2019-01-17 13:16:51 Dan Streetman linux (Ubuntu Disco): status Confirmed Fix Released
2019-01-17 18:00:39 Joseph Salisbury removed subscriber Joseph Salisbury
2019-07-24 20:56:28 Brad Figg tags patch sts verification-done-bionic verification-done-cosmic verification-done-xenial cscc patch sts verification-done-bionic verification-done-cosmic verification-done-xenial
2021-07-01 22:31:33 Dexuan Cui bug added subscriber Dexuan Cui