2018-10-15 22:35:04 |
Guilherme G. Piccoli |
bug |
|
|
added bug |
2018-10-15 22:35:43 |
Guilherme G. Piccoli |
nominated for series |
|
Ubuntu Bionic |
|
2018-10-15 22:35:43 |
Guilherme G. Piccoli |
nominated for series |
|
Ubuntu Trusty |
|
2018-10-15 22:35:43 |
Guilherme G. Piccoli |
nominated for series |
|
Ubuntu Xenial |
|
2018-10-15 22:35:43 |
Guilherme G. Piccoli |
nominated for series |
|
Ubuntu Dd-series |
|
2018-10-15 22:35:43 |
Guilherme G. Piccoli |
nominated for series |
|
Ubuntu Cosmic |
|
2018-10-16 18:39:20 |
Joseph Salisbury |
bug task added |
|
linux (Ubuntu Bionic) |
|
2018-10-16 18:39:26 |
Joseph Salisbury |
bug task added |
|
linux (Ubuntu Cosmic) |
|
2018-10-16 18:39:31 |
Joseph Salisbury |
bug task added |
|
linux (Ubuntu Trusty) |
|
2018-10-16 18:39:37 |
Joseph Salisbury |
bug task added |
|
linux (Ubuntu Xenial) |
|
2018-10-16 19:00:26 |
Guilherme G. Piccoli |
linux (Ubuntu Bionic): assignee |
|
Guilherme G. Piccoli (gpiccoli) |
|
2018-10-16 19:00:28 |
Guilherme G. Piccoli |
linux (Ubuntu Xenial): assignee |
|
Guilherme G. Piccoli (gpiccoli) |
|
2018-10-16 19:00:29 |
Guilherme G. Piccoli |
linux (Ubuntu Trusty): assignee |
|
Guilherme G. Piccoli (gpiccoli) |
|
2018-10-16 19:00:34 |
Guilherme G. Piccoli |
linux (Ubuntu Bionic): importance |
Undecided |
High |
|
2018-10-16 19:00:36 |
Guilherme G. Piccoli |
linux (Ubuntu Xenial): importance |
Undecided |
High |
|
2018-10-16 19:00:38 |
Guilherme G. Piccoli |
linux (Ubuntu Trusty): importance |
Undecided |
High |
|
2018-10-16 19:00:41 |
Guilherme G. Piccoli |
linux (Ubuntu Bionic): status |
New |
Confirmed |
|
2018-10-16 19:00:43 |
Guilherme G. Piccoli |
linux (Ubuntu Xenial): status |
New |
Confirmed |
|
2018-10-16 19:00:45 |
Guilherme G. Piccoli |
linux (Ubuntu Trusty): status |
New |
Confirmed |
|
2018-10-16 19:37:58 |
Guilherme G. Piccoli |
attachment added |
|
lspci tree output of a single root bridge PCI topology https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5201885/+files/lspci_single_root.txt |
|
2018-10-16 19:38:35 |
Guilherme G. Piccoli |
attachment added |
|
lspci tree output of a multi root bridge PCI topology https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5201886/+files/lspci_multi_root.txt |
|
2018-10-16 19:49:21 |
Guilherme G. Piccoli |
description |
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
|
2018-10-16 19:55:19 |
Fabio Augusto Miranda Martins |
bug |
|
|
added subscriber Fabio Augusto Miranda Martins |
2018-10-18 18:47:16 |
Guilherme G. Piccoli |
attachment added |
|
Patch 1: Scan all PCI busses https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5202664/+files/0001-x86-quirks-Scan-all-busses-for-early-PCI-quirks.patch |
|
2018-10-18 18:48:07 |
Guilherme G. Piccoli |
attachment added |
|
Export pci capabilities function from AGP to early-PCI code https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5202666/+files/0002-x86-PCI-Export-find_cap-to-be-used-in-early-PCI-code.patch |
|
2018-10-18 18:49:00 |
Guilherme G. Piccoli |
attachment added |
|
Parameter to enable quirk in early boot to disable MSIs on kexec https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5202667/+files/0003-x86-quirks-Add-parameter-to-clear-MSIs-early-on-boot.patch |
|
2018-10-19 04:19:51 |
Ubuntu Foundations Team Bug Bot |
tags |
sts |
patch sts |
|
2018-10-19 04:19:52 |
Ubuntu Foundations Team Bug Bot |
bug |
|
|
added subscriber Joseph Salisbury |
2018-11-07 22:45:35 |
Mauricio Faria de Oliveira |
description |
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
[Impact]
* A kexec/crash kernel might get stuck and fail to boot
(for crash kernel, kdump fails to collect a crashdump)
if a PCI device is buggy/stuck/looping and triggers a
continuous flood of MSI(X) interrupts (that the kernel
does not yet know about).
* This fix allowed to obtain crashdumps when debugging a
heavy-load scenario, in which a (heavy-loaded) network
adapter wouldn't stop triggering MSI-X interrupts ever
after panic()->kdump kicked in.
* This fix disables MSI(X) in all PCI devices on early
boot; this is OK as it's (re-)enabled normally later.
[Test Case]
* A synthetic test-case is not yet available, however,
this particular system/workload triggered the problem
consistently, and it was used for development/testing.
* We'll update this bug once a synthetic test-case is
available; we're working on patching QEMU for this.
[Regression Potential]
* The potential area for regressions is early boot,
particularly effects of applying quirks during PCI
bus scan, which is changed/broader w/ these patches.
* However, all quirks are applied based on PCI ID
matching, so would only apply if actually targeting
a new device.
* Moreover, the new quirk is only applied based on
a kernel cmdline parameter that is disabled by
default, which constraints even more when this
is actually in effect.
[Other Info]
* The patch series is still under review/discussion
upstream, but it's relatively important for Ubuntu
users at this point, and after internal discussions
we decided to submit it for SRU.
* These are links to the linux-pci archive with the
patches [1, 2, 3]
[1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/
[2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/
[3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
[Original Description]
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
|
2018-11-07 22:55:19 |
Mauricio Faria de Oliveira |
description |
[Impact]
* A kexec/crash kernel might get stuck and fail to boot
(for crash kernel, kdump fails to collect a crashdump)
if a PCI device is buggy/stuck/looping and triggers a
continuous flood of MSI(X) interrupts (that the kernel
does not yet know about).
* This fix allowed to obtain crashdumps when debugging a
heavy-load scenario, in which a (heavy-loaded) network
adapter wouldn't stop triggering MSI-X interrupts ever
after panic()->kdump kicked in.
* This fix disables MSI(X) in all PCI devices on early
boot; this is OK as it's (re-)enabled normally later.
[Test Case]
* A synthetic test-case is not yet available, however,
this particular system/workload triggered the problem
consistently, and it was used for development/testing.
* We'll update this bug once a synthetic test-case is
available; we're working on patching QEMU for this.
[Regression Potential]
* The potential area for regressions is early boot,
particularly effects of applying quirks during PCI
bus scan, which is changed/broader w/ these patches.
* However, all quirks are applied based on PCI ID
matching, so would only apply if actually targeting
a new device.
* Moreover, the new quirk is only applied based on
a kernel cmdline parameter that is disabled by
default, which constraints even more when this
is actually in effect.
[Other Info]
* The patch series is still under review/discussion
upstream, but it's relatively important for Ubuntu
users at this point, and after internal discussions
we decided to submit it for SRU.
* These are links to the linux-pci archive with the
patches [1, 2, 3]
[1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/
[2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/
[3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
[Original Description]
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
[Impact]
* A kexec/crash kernel might get stuck and fail to boot
(for crash kernel, kdump fails to collect a crashdump)
if a PCI device is buggy/stuck/looping and triggers a
continuous flood of MSI(X) interrupts (that the kernel
does not yet know about).
* This fix allowed to obtain crashdumps when debugging a
heavy-load scenario, in which a (heavy-loaded) network
adapter wouldn't stop triggering MSI-X interrupts ever
after panic()->kdump kicked in.
* This fix disables MSI(X) in all PCI devices on early
boot (this is OK as it's (re-)enabled normally later)
with a kernel cmdline parameter (disabled by default).
[Test Case]
* A synthetic test-case is not yet available, however,
this particular system/workload triggered the problem
consistently, and it was used for development/testing.
* We'll update this bug once a synthetic test-case is
available; we're working on patching QEMU for this.
[Regression Potential]
* The potential area for regressions is early boot,
particularly effects of applying quirks during PCI
bus scan, which is changed/broader w/ these patches.
* However, all quirks are applied based on PCI ID
matching, so would only apply if actually targeting
a new device.
* Moreover, the new quirk is only applied based on
a kernel cmdline parameter that is disabled by
default, which constraints even more when this
is actually in effect.
[Other Info]
* The patch series is still under review/discussion
upstream, but it's relatively important for Ubuntu
users at this point, and after internal discussions
we decided to submit it for SRU.
* These are links to the linux-pci archive with the
patches [1, 2, 3]
[1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/
[2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/
[3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
[Original Description]
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
|
2018-11-08 01:22:09 |
Mauricio Faria de Oliveira |
description |
[Impact]
* A kexec/crash kernel might get stuck and fail to boot
(for crash kernel, kdump fails to collect a crashdump)
if a PCI device is buggy/stuck/looping and triggers a
continuous flood of MSI(X) interrupts (that the kernel
does not yet know about).
* This fix allowed to obtain crashdumps when debugging a
heavy-load scenario, in which a (heavy-loaded) network
adapter wouldn't stop triggering MSI-X interrupts ever
after panic()->kdump kicked in.
* This fix disables MSI(X) in all PCI devices on early
boot (this is OK as it's (re-)enabled normally later)
with a kernel cmdline parameter (disabled by default).
[Test Case]
* A synthetic test-case is not yet available, however,
this particular system/workload triggered the problem
consistently, and it was used for development/testing.
* We'll update this bug once a synthetic test-case is
available; we're working on patching QEMU for this.
[Regression Potential]
* The potential area for regressions is early boot,
particularly effects of applying quirks during PCI
bus scan, which is changed/broader w/ these patches.
* However, all quirks are applied based on PCI ID
matching, so would only apply if actually targeting
a new device.
* Moreover, the new quirk is only applied based on
a kernel cmdline parameter that is disabled by
default, which constraints even more when this
is actually in effect.
[Other Info]
* The patch series is still under review/discussion
upstream, but it's relatively important for Ubuntu
users at this point, and after internal discussions
we decided to submit it for SRU.
* These are links to the linux-pci archive with the
patches [1, 2, 3]
[1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/
[2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/
[3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
[Original Description]
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
[Impact]
* A kexec/crash kernel might get stuck and fail to boot
(for crash kernel, kdump fails to collect a crashdump)
if a PCI device is buggy/stuck/looping and triggers a
continuous flood of MSI(X) interrupts (that the kernel
does not yet know about).
* This fix allowed to obtain crashdumps when debugging a
heavy-load scenario, in which a (heavy-loaded) network
adapter wouldn't stop triggering MSI-X interrupts ever
after panic()->kdump kicked in.
* This fix disables MSI(X) in all PCI devices on early
boot (this is OK as it's (re-)enabled normally later)
with a kernel cmdline parameter (disabled by default).
[Test Case]
* A synthetic test-case is not yet available, however,
this particular system/workload triggered the problem
consistently, and it was used for development/testing.
* We'll update this bug once a synthetic test-case is
available; we're working on patching QEMU for this.
* $ dmesg | grep 'Clearing MSI'
[ 0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk)
* The comparison of 'dmesg -t | sort' has been reviewed
between option disabled/enabled on boot & kexec modes,
and only expected differences found (MHz, PIDs, MIPS).
[Regression Potential]
* The potential area for regressions is early boot,
particularly effects of applying quirks during PCI
bus scan, which is changed/broader w/ these patches.
* However, all quirks are applied based on PCI ID
matching, so would only apply if actually targeting
a new device.
* Moreover, the new quirk is only applied based on
a kernel cmdline parameter that is disabled by
default, which constraints even more when this
is actually in effect.
[Other Info]
* The patch series is still under review/discussion
upstream, but it's relatively important for Ubuntu
users at this point, and after internal discussions
we decided to submit it for SRU.
* These are links to the linux-pci archive with the
patches [1, 2, 3]
[1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/
[2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/
[3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
[Original Description]
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
|
2018-11-08 16:29:51 |
Mauricio Faria de Oliveira |
description |
[Impact]
* A kexec/crash kernel might get stuck and fail to boot
(for crash kernel, kdump fails to collect a crashdump)
if a PCI device is buggy/stuck/looping and triggers a
continuous flood of MSI(X) interrupts (that the kernel
does not yet know about).
* This fix allowed to obtain crashdumps when debugging a
heavy-load scenario, in which a (heavy-loaded) network
adapter wouldn't stop triggering MSI-X interrupts ever
after panic()->kdump kicked in.
* This fix disables MSI(X) in all PCI devices on early
boot (this is OK as it's (re-)enabled normally later)
with a kernel cmdline parameter (disabled by default).
[Test Case]
* A synthetic test-case is not yet available, however,
this particular system/workload triggered the problem
consistently, and it was used for development/testing.
* We'll update this bug once a synthetic test-case is
available; we're working on patching QEMU for this.
* $ dmesg | grep 'Clearing MSI'
[ 0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk)
* The comparison of 'dmesg -t | sort' has been reviewed
between option disabled/enabled on boot & kexec modes,
and only expected differences found (MHz, PIDs, MIPS).
[Regression Potential]
* The potential area for regressions is early boot,
particularly effects of applying quirks during PCI
bus scan, which is changed/broader w/ these patches.
* However, all quirks are applied based on PCI ID
matching, so would only apply if actually targeting
a new device.
* Moreover, the new quirk is only applied based on
a kernel cmdline parameter that is disabled by
default, which constraints even more when this
is actually in effect.
[Other Info]
* The patch series is still under review/discussion
upstream, but it's relatively important for Ubuntu
users at this point, and after internal discussions
we decided to submit it for SRU.
* These are links to the linux-pci archive with the
patches [1, 2, 3]
[1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/
[2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/
[3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
[Original Description]
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
[Impact]
* A kexec/crash kernel might get stuck and fail to boot
(for crash kernel, kdump fails to collect a crashdump)
if a PCI device is buggy/stuck/looping and triggers a
continuous flood of MSI(X) interrupts (that the kernel
does not yet know about).
* This fix allowed to obtain crashdumps when debugging a
heavy-load scenario, in which a (heavy-loaded) network
adapter wouldn't stop triggering MSI-X interrupts ever
after panic()->kdump kicked in.
* This fix disables MSI(X) in all PCI devices on early
boot (this is OK as it's (re-)enabled normally later)
with a kernel cmdline parameter (disabled by default).
[Test Case]
* A synthetic test-case is not yet available, however,
this particular system/workload triggered the problem
consistently, and it was used for development/testing.
* We'll update this bug once a synthetic test-case is
available; we're working on patching QEMU for this.
* $ cat /proc/cmdline
<...> pci=clearmsi
$ dmesg | grep 'Clearing MSI'
[ 0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk)
* The comparison of 'dmesg -t | sort' has been reviewed
between option disabled/enabled on boot & kexec modes,
and only expected differences found (MHz, PIDs, MIPS).
[Regression Potential]
* The potential area for regressions is early boot,
particularly effects of applying quirks during PCI
bus scan, which is changed/broader w/ these patches.
* However, all quirks are applied based on PCI ID
matching, so would only apply if actually targeting
a new device.
* Moreover, the new quirk is only applied based on
a kernel cmdline parameter that is disabled by
default, which constraints even more when this
is actually in effect.
[Other Info]
* The patch series is still under review/discussion
upstream, but it's relatively important for Ubuntu
users at this point, and after internal discussions
we decided to submit it for SRU.
* These are links to the linux-pci archive with the
patches [1, 2, 3]
[1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/
[2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/
[3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/
[Original Description]
We have reports of a kdump failure in Ubuntu (in x86 machine) that was narrowed down to a MSI irq storm coming from a PCI network device.
The bug manifests as a lack of progress in the boot process of the kdump kernel, and a storm of kernel messages like:
[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]
The root cause of the issue is that the kdump kernel kexec process does not ensure PCI devices are reset and/or MSI capabilities are disabled, so a PCI device could produce a huge amount of PCI irqs which would take all the processing time for the CPU (specially since we restrict the kdump kernel to use one single CPU only).
This was tested using upstream kernel version 4.18, and the problem reproduces.
In the specific test scenario, the PCI NIC was an "Intel 82599ES 10-Gigabit [8086:10fb]" that was used in SR-IOV PCI passthrough mode (vfio_pci), under high load on the guest. |
|
2018-11-08 17:15:56 |
Mauricio Faria de Oliveira |
attachment added |
|
sf202166.dmesg.tar.xz https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5210405/+files/sf202166.dmesg.tar.xz |
|
2018-11-12 05:05:02 |
Khaled El Mously |
linux (Ubuntu Xenial): status |
Confirmed |
Fix Committed |
|
2018-11-12 05:05:05 |
Khaled El Mously |
linux (Ubuntu Bionic): status |
Confirmed |
Fix Committed |
|
2018-11-12 05:05:09 |
Khaled El Mously |
linux (Ubuntu Cosmic): status |
Confirmed |
Fix Committed |
|
2018-11-15 11:03:55 |
Brad Figg |
tags |
patch sts |
patch sts verification-needed-cosmic |
|
2018-11-16 16:36:25 |
Brad Figg |
tags |
patch sts verification-needed-cosmic |
patch sts verification-needed-cosmic verification-needed-xenial |
|
2018-11-16 18:15:10 |
Brad Figg |
tags |
patch sts verification-needed-cosmic verification-needed-xenial |
patch sts verification-needed-bionic verification-needed-cosmic verification-needed-xenial |
|
2018-11-22 12:50:34 |
Mauricio Faria de Oliveira |
bug |
|
|
added subscriber Mauricio Faria de Oliveira |
2018-11-23 19:04:50 |
Guilherme G. Piccoli |
linux (Ubuntu Trusty): status |
Confirmed |
Won't Fix |
|
2018-11-23 19:05:45 |
Guilherme G. Piccoli |
tags |
patch sts verification-needed-bionic verification-needed-cosmic verification-needed-xenial |
patch sts verification-done-bionic verification-done-cosmic verification-done-xenial |
|
2018-11-23 19:06:34 |
Guilherme G. Piccoli |
attachment added |
|
lp1797990_verification.tgz https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1797990/+attachment/5215743/+files/lp1797990_verification.tgz |
|
2018-12-03 08:49:32 |
Launchpad Janitor |
linux (Ubuntu Cosmic): status |
Fix Committed |
Fix Released |
|
2018-12-03 08:49:32 |
Launchpad Janitor |
cve linked |
|
2018-18653 |
|
2018-12-03 08:49:32 |
Launchpad Janitor |
cve linked |
|
2018-18955 |
|
2018-12-03 08:49:32 |
Launchpad Janitor |
cve linked |
|
2018-6559 |
|
2018-12-03 14:01:15 |
Launchpad Janitor |
linux (Ubuntu Bionic): status |
Fix Committed |
Fix Released |
|
2018-12-03 14:59:47 |
Launchpad Janitor |
linux (Ubuntu Xenial): status |
Fix Committed |
Fix Released |
|
2019-01-17 13:16:31 |
Dan Streetman |
bug task added |
|
linux (Ubuntu Disco) |
|
2019-01-17 13:16:51 |
Dan Streetman |
linux (Ubuntu Disco): status |
Confirmed |
Fix Released |
|
2019-01-17 18:00:39 |
Joseph Salisbury |
removed subscriber Joseph Salisbury |
|
|
|
2019-07-24 20:56:28 |
Brad Figg |
tags |
patch sts verification-done-bionic verification-done-cosmic verification-done-xenial |
cscc patch sts verification-done-bionic verification-done-cosmic verification-done-xenial |
|
2021-07-01 22:31:33 |
Dexuan Cui |
bug |
|
|
added subscriber Dexuan Cui |