PCIe AER device recovery failed due to logic flaw
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Incomplete
|
Undecided
|
Unassigned | ||
Eoan |
Won't Fix
|
Undecided
|
Unassigned | ||
Focal |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
SRU Justification
Impact:
During PCI Express Downstream Port Containment (DPC) recovery,
certain types of failures do not recover due to a logic flaw
in pcie_do_recovery().
The upstream git commit log explains the change:
PCI/ERR: Update error status after reset_link()
Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
reset_link() to recover from fatal errors. But during fatal error
recovery, if the initial value of error status is PCI_ERS_
or PCI_ERS_
reset_link()) pcie_do_recovery() will report the recovery result as
failure. Update the status of error after reset_link().
You can reproduce this issue by triggering a SW DPC using "DPC Software
Trigger" bit in "DPC Control Register". You should see recovery failed
dmesg log as below:
pcieport 0000:00:16.0: DPC: containment event, status:0x1f27 source:0x0000
pcieport 0000:00:16.0: DPC: software trigger detected
pci 0000:04:00.0: AER: can't recover (no error_detected callback)
pcieport 0000:00:16.0: AER: device recovery failed
Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
Link: https:/
[bhelgaas: split pci_channel_
Signed-off-by: Kuppuswamy Sathyanarayanan <email address hidden>
Signed-off-by: Bjorn Helgaas <email address hidden>
Acked-by: Keith Busch <email address hidden>
Cc: Ashok Raj <email address hidden>
Note that a second prerequisite patch is necessary as well. This patch,
commit b5dfbeacf74865a
Author: Kuppuswamy Sathyanarayanan <email address hidden>
Date: Fri Mar 27 17:33:24 2020 -0500
PCI/ERR: Combine pci_channel_
is a code readability change, and makes no functional changes.
Testcase:
On a system with DPC enabled, setpci may be used to set the DPC Software
Trigger bit (bit 6, value 0x40) in the DPC Control register of a suitable
PCIe device (a PCIe bridge, for example).
On a system lacking the fix, the output will be as shown above (i.e.,
culminating in the "device recovery failed" message). With the fix
applied, the device successfully recovers, resulting in a message of the
form
pcieport 0000:d9:01.0: AER: Device recovery successful
Regression Potential:
The risk of regression is low, as (a) the path in question currently does
not work, and (b) the changes are minimal, comprising only a housekeeping
change and the logically correct updating of a status variable that did
not previously occur.
Changed in linux (Ubuntu Eoan): | |
status: | New → Fix Committed |
Changed in linux (Ubuntu Focal): | |
status: | New → Fix Committed |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 1873537
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.