Comment 8 for bug 1771467

Revision history for this message
In , ryan (ryan-linux-kernel-bugs) wrote :

Created attachment 276079
lspci -vv

On HPe DL360 Gen9 (and possibly other gens and/or products; I haven't been able to test other HP hardware right now, but I do have several DL360 Gen9s I've confirmed on), upon shutdown/reboot, it will crash with:

[ 122.447111] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 122.447112] {1}[Hardware Error]: event severity: fatal
[ 122.447113] {1}[Hardware Error]: Error 0, type: fatal
[ 122.447114] {1}[Hardware Error]: section_type: PCIe error
[ 122.447115] {1}[Hardware Error]: port_type: 4, root port
[ 122.447116] {1}[Hardware Error]: version: 1.16
[ 122.447118] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 122.447119] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 122.447119] {1}[Hardware Error]: slot: 0
[ 122.447120] {1}[Hardware Error]: secondary_bus: 0x03
[ 122.447120] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 122.447121] {1}[Hardware Error]: class_code: 040600
[ 122.447122] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 122.447123] {1}[Hardware Error]: Error 1, type: fatal
[ 122.447123] {1}[Hardware Error]: section_type: PCIe error
[ 122.447124] {1}[Hardware Error]: port_type: 4, root port
[ 122.447125] {1}[Hardware Error]: version: 1.16
[ 122.447125] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 122.447126] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 122.447127] {1}[Hardware Error]: slot: 0
[ 122.447127] {1}[Hardware Error]: secondary_bus: 0x03
[ 122.447128] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 122.447129] {1}[Hardware Error]: class_code: 040600
[ 122.447130] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 122.447131] Kernel panic - not syncing: Fatal hardware error!
[ 122.447166] Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 122.459295] ERST: [Firmware Warn]: Firmware does not respond in time.

And after that, upon POST, the storage controller is not happy but does eventually work:

Embedded RAID 1 : Smart Array P440ar Controller - (2048 MB, V6.30) 7 Logical
Drive(s) - Operation Failed
 - 1719-Slot 0 Drive Array - A controller failure event occurred prior
   to this power-up. (Previous lock up code = 0x13) Action: Install the
   latest controller firmware. If the problem persists, replace the
   controller.

Up to date firmware (P89 01/22/2018, controller 6.30). Interestingly, on older (circa 2016 but I don't have an exact version) firmware, this manifested as a crash loop:

[529151.035267] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[529153.222883] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.222884] Do you have a strange power saving mode enabled?
[529153.222884] Dazed and confused, but trying to continue
[529153.554447] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554448] Do you have a strange power saving mode enabled?
[529153.554449] Dazed and confused, but trying to continue

I've narrowed it down to https://patchwork.kernel.org/patch/10027157/ as part of commit 1b6115fbe3b3db746d7baa11399dd617fc75e1c4; removing that line prevents the panic.