PCIe AER device recovery failed due to logic flaw

Bug #1873537 reported by Jay Vosburgh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Eoan
Undecided
Unassigned
Focal
Undecided
Unassigned

Bug Description

SRU Justification

Impact:

During PCI Express Downstream Port Containment (DPC) recovery,
certain types of failures do not recover due to a logic flaw
in pcie_do_recovery().

The upstream git commit log explains the change:

PCI/ERR: Update error status after reset_link()
Commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery") uses
reset_link() to recover from fatal errors. But during fatal error
recovery, if the initial value of error status is PCI_ERS_RESULT_DISCONNECT
or PCI_ERS_RESULT_NO_AER_DRIVER then even after successful recovery (using
reset_link()) pcie_do_recovery() will report the recovery result as
failure. Update the status of error after reset_link().

You can reproduce this issue by triggering a SW DPC using "DPC Software
Trigger" bit in "DPC Control Register". You should see recovery failed
dmesg log as below:

  pcieport 0000:00:16.0: DPC: containment event, status:0x1f27 source:0x0000
  pcieport 0000:00:16.0: DPC: software trigger detected
  pci 0000:04:00.0: AER: can't recover (no error_detected callback)
  pcieport 0000:00:16.0: AER: device recovery failed

Fixes: bdb5ac85777d ("PCI/ERR: Handle fatal error recovery")
Link: https://lore.kernel.org/r/a255fcb3a3fdebcd90f84e08b555f1786eb8eba2<email address hidden>
[bhelgaas: split pci_channel_io_frozen simplification to separate patch]
Signed-off-by: Kuppuswamy Sathyanarayanan <email address hidden>
Signed-off-by: Bjorn Helgaas <email address hidden>
Acked-by: Keith Busch <email address hidden>
Cc: Ashok Raj <email address hidden>

Note that a second prerequisite patch is necessary as well. This patch,

commit b5dfbeacf74865a8d62a4f70f501cdc61510f8e0
Author: Kuppuswamy Sathyanarayanan <email address hidden>
Date: Fri Mar 27 17:33:24 2020 -0500

    PCI/ERR: Combine pci_channel_io_frozen cases

is a code readability change, and makes no functional changes.

Testcase:

On a system with DPC enabled, setpci may be used to set the DPC Software
Trigger bit (bit 6, value 0x40) in the DPC Control register of a suitable
PCIe device (a PCIe bridge, for example).

On a system lacking the fix, the output will be as shown above (i.e.,
culminating in the "device recovery failed" message). With the fix
applied, the device successfully recovers, resulting in a message of the
form

pcieport 0000:d9:01.0: AER: Device recovery successful

Regression Potential:

The risk of regression is low, as (a) the path in question currently does
not work, and (b) the changes are minimal, comprising only a housekeeping
change and the logically correct updating of a status variable that did
not previously occur.

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1873537

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Eoan):
status: New → Fix Committed
Changed in linux (Ubuntu Focal):
status: New → Fix Committed
Revision history for this message
Brian Murray (brian-murray) wrote :

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in linux (Ubuntu Eoan):
status: Fix Committed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers