[SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which causes Bus Fatal Error when rebooting system with BCM5720 NIC
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Fix Released
|
Medium
|
Jeff Lane | ||
Focal |
Fix Released
|
Medium
|
Jeff Lane | ||
Impish |
Fix Released
|
Medium
|
Jeff Lane | ||
Jammy |
Fix Released
|
Medium
|
Jeff Lane |
Bug Description
SRU Justification:
[IMPACT]
This is being reported by a hardware partner as it is being noticed a lot both in their internal testing teams and also being reported with some frequency by customers who are seeing these messages in their logs and thus it is generating an unusualy high volume of support calls from the field.
In 5.4, commit d60cd06331a3566
according to the hardware partner:
the following error messages are observed when rebooting a machine that uses the BCM5720 chipset, which is a widely used 1GbE controller found on LOMs and OCP NICs as well as many PCIe NIC models.
[ 146.429212] shutdown[1]: Rebooting.
[ 146.435151] kvm: exiting hardware virtualization
[ 146.575319] megaraid_sas 0000:67:00.0: megasas_
[ 148.088133] [qede_unload:
[ 148.183618] qede 0000:31:00.1: Ending qede_remove successfully
[ 148.518541] [qede_unload:
[ 148.625066] qede 0000:31:00.0: Ending qede_remove successfully
[ 148.762067] ACPI: Preparing to enter system sleep state S5
[ 148.794638] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 5
[ 148.803731] {1}[Hardware Error]: event severity: recoverable
[ 148.810191] {1}[Hardware Error]: Error 0, type: fatal
[ 148.816088] {1}[Hardware Error]: section_type: PCIe error
[ 148.822391] {1}[Hardware Error]: port_type: 0, PCIe end point
[ 148.829026] {1}[Hardware Error]: version: 3.0
[ 148.834266] {1}[Hardware Error]: command: 0x0006, status: 0x0010
[ 148.841140] {1}[Hardware Error]: device_id: 0000:04:00.0
[ 148.847309] {1}[Hardware Error]: slot: 0
[ 148.852077] {1}[Hardware Error]: secondary_bus: 0x00
[ 148.857876] {1}[Hardware Error]: vendor_id: 0x14e4, device_id: 0x165f
[ 148.865145] {1}[Hardware Error]: class_code: 020000
[ 148.870845] {1}[Hardware Error]: aer_uncor_status: 0x00100000, aer_uncor_mask: 0x00010000
[ 148.879842] {1}[Hardware Error]: aer_uncor_severity: 0x000ef030
[ 148.886575] {1}[Hardware Error]: TLP Header: 40000001 0000030f 90028090 00000000
[ 148.894823] tg3 0000:04:00.0: AER: aer_status: 0x00100000, aer_mask: 0x00010000
[ 148.902795] tg3 0000:04:00.0: AER: [20] UnsupReq (First)
[ 148.910234] tg3 0000:04:00.0: AER: aer_layer=
[ 148.918806] tg3 0000:04:00.0: AER: aer_uncor_severity: 0x000ef030
[ 148.925558] tg3 0000:04:00.0: AER: TLP Header: 40000001 0000030f 90028090 00000000
[ 148.933984] reboot: Restarting system
[ 148.938319] reboot: machine restart
The hardware partner did some bisection and observed the following:
Kernel version Fatal Error
5.4.0-42.46 No
5.4.0-45.49 No
5.4.0-47.51 No
5.4.0-48.52 No
5.4.0-51.56 No
5.4.0-52.57 No
5.4.0-53.59 No
5.4.0-54.60 No
5.4.0-58.64 No
5.4.0-59.65 yes
5.4.0-60.67 yes
[FIX]
The fix is to apply this patch from upstream:
commit 9d3fcb28f9b9750
Author: Josef Bacik <email address hidden>
Date: Tue Mar 16 22:17:48 2021 -0400
Revert "PM: ACPI: reboot: Use S5 for reboot"
This reverts commit d60cd06331a3566
This patch causes a panic when rebooting my Dell Poweredge r440. I do
not have the full panic log as it's lost at that stage of the reboot and
I do not have a serial console. Reverting this patch makes my system
able to reboot again.
Example:
https:/
The hardware partner has preemptively pulled our 5.4 tree, applied the fix and tested it in their labs and determined that this does resolve the issue.
[TEST CASE]
Install the patched kernel on a machine that uses a BCM5720 LOM and reboot the machine and see that the errors no longer appear.
CVE References
no longer affects: | linux (Ubuntu) |
affects: | linux → linux (Ubuntu) |
Changed in linux (Ubuntu Focal): | |
status: | New → In Progress |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Impish): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Jammy): | |
importance: | Undecided → Medium |
Changed in linux (Ubuntu Focal): | |
assignee: | nobody → Jeff Lane (bladernr) |
Changed in linux (Ubuntu Impish): | |
assignee: | nobody → Jeff Lane (bladernr) |
Changed in linux (Ubuntu Jammy): | |
assignee: | nobody → Jeff Lane (bladernr) |
Changed in linux (Ubuntu Impish): | |
status: | New → In Progress |
Changed in linux (Ubuntu Jammy): | |
status: | New → In Progress |
description: | updated |
description: | updated |
summary: |
- [Regression] Bus Fatal Error observed when reboot on BCM5720 + [SRU][Regression] Bus Fatal Error observed when reboot on BCM5720 |
summary: |
- [SRU][Regression] Bus Fatal Error observed when reboot on BCM5720 + [SRU][Regression] Revert "PM: ACPI: reboot: Use S5 for reboot" which + causes Bus Fatal Error when rebooting system with BCM5720 NIC |
description: | updated |
Changed in linux (Ubuntu Focal): | |
status: | In Progress → Fix Committed |
tags: |
added: verification-done-focal removed: verification-needed-focal |
original project wasn't correct - moving to the kernel to be examined, this is potentially a regression