2018-08-10 15:35:21 |
Joseph Salisbury |
description |
Verified on multiple DL360 Gen9 servers with up to date firmware. Just before reboot or shutdown, there is the following panic:
[ 289.093083] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 289.093085] {1}[Hardware Error]: event severity: fatal
[ 289.093087] {1}[Hardware Error]: Error 0, type: fatal
[ 289.093088] {1}[Hardware Error]: section_type: PCIe error
[ 289.093090] {1}[Hardware Error]: port_type: 4, root port
[ 289.093091] {1}[Hardware Error]: version: 1.16
[ 289.093093] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 289.093094] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 289.093095] {1}[Hardware Error]: slot: 0
[ 289.093096] {1}[Hardware Error]: secondary_bus: 0x03
[ 289.093097] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 289.093098] {1}[Hardware Error]: class_code: 040600
[ 289.093378] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 289.093380] {1}[Hardware Error]: Error 1, type: fatal
[ 289.093381] {1}[Hardware Error]: section_type: PCIe error
[ 289.093382] {1}[Hardware Error]: port_type: 4, root port
[ 289.093383] {1}[Hardware Error]: version: 1.16
[ 289.093384] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 289.093386] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 289.093386] {1}[Hardware Error]: slot: 0
[ 289.093387] {1}[Hardware Error]: secondary_bus: 0x03
[ 289.093388] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 289.093674] {1}[Hardware Error]: class_code: 040600
[ 289.093676] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 289.093678] Kernel panic - not syncing: Fatal hardware error!
[ 289.093745] Kernel Offset: 0x1cc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 289.105835] ERST: [Firmware Warn]: Firmware does not respond in time.
It does eventually restart after this. Then during the subsequent POST, the following warning appears:
Embedded RAID 1 : Smart Array P440ar Controller - (2048 MB, V6.30) 7 Logical
Drive(s) - Operation Failed
- 1719-Slot 0 Drive Array - A controller failure event occurred prior
to this power-up. (Previous lock up code = 0x13) Action: Install the
latest controller firmware. If the problem persists, replace the
controller.
The latter's symptoms are described in https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c04805565 but the running storage controller firmware is much newer than the doc's resolution.
Neither of these problems occur during shutdown/reboot on the xenial kernel.
FWIW, when running on old P89 (1.50 (07/20/2015) vs 2.56 (01/22/2018)), the shutdown failure mode was a loop like so:
[529151.035267] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[529153.222883] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.222884] Do you have a strange power saving mode enabled?
[529153.222884] Dazed and confused, but trying to continue
[529153.554447] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554448] Do you have a strange power saving mode enabled?
[529153.554449] Dazed and confused, but trying to continue
[529153.554450] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554451] Do you have a strange power saving mode enabled?
[529153.554452] Dazed and confused, but trying to continue
[529153.554452] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554453] Do you have a strange power saving mode enabled?
[529153.554454] Dazed and confused, but trying to continue
[529153.554454] Uhhuh. NMI received for unknown reason 35 on CPU 0.
[529153.554455] Do you have a strange power saving mode enabled?
[529153.554456] Dazed and confused, but trying to continue
[529153.554457] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554458] Do you have a strange power saving mode enabled?
[529153.554458] Dazed and confused, but trying to continue
[529153.554459] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554460] Do you have a strange power saving mode enabled?
[529153.554460] Dazed and confused, but trying to continue
[529154.953916] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529154.953917] Do you have a strange power saving mode enabled?
[529154.953918] Dazed and confused, but trying to continue
But upgrading to 2.56 changes that to a kernel panic.
ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-signed-image-generic 4.15.0.21.22
ProcVersionSignature: Ubuntu 4.15.0-21.22-generic 4.15.17
Uname: Linux 4.15.0-21-generic x86_64
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 May 15 23:11 seq
crw-rw---- 1 root audio 116, 33 May 15 23:11 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Wed May 16 00:17:53 2018
HibernationDevice: RESUME=UUID=696e8063-c668-4c89-a478-bfc23a450369
InstallationDate: Installed on 2016-06-01 (713 days ago)
InstallationMedia: Ubuntu-Server 14.04.5 LTS "Trusty Tahr" - Beta amd64 (20160527)
MachineType: HP ProLiant DL360 Gen9
PciMultimedia:
ProcEnviron:
TERM=xterm-256color
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-21-generic root=UUID=6e6d422d-8ffb-4db3-b8c7-6c81e320b1b2 ro console=tty0 console=ttyS1,38400 nosplash console=ttyS1,38400 console=tty0 nosplash
RelatedPackageVersions:
linux-restricted-modules-4.15.0-21-generic N/A
linux-backports-modules-4.15.0-21-generic N/A
linux-firmware 1.173
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to bionic on 2018-05-09 (6 days ago)
dmi.bios.date: 01/22/2018
dmi.bios.vendor: HP
dmi.bios.version: P89
dmi.board.name: ProLiant DL360 Gen9
dmi.board.vendor: HP
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP89:bd01/22/2018:svnHP:pnProLiantDL360Gen9:pvr:rvnHP:rnProLiantDL360Gen9:rvr:cvnHP:ct23:cvr:
dmi.product.family: ProLiant
dmi.product.name: ProLiant DL360 Gen9
dmi.sys.vendor: HP |
== SRU Justification ==
Mainline commit introduced a regression in v4.15-rc1. The regression
causes a kernel panic during system shutdown. This commit fixes
that regression. This commit was also cc'd to upstream stable, but it
has not landed in Bionic as of yet.
== Fix ==
0d98ba8d70b0 ("scsi: hpsa: disable device during shutdown")
== Regression Potential ==
Low. This patch fixes a current regression. It has been cc'd to
upstream stable, so it has had additon upstream review.
== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.
Verified on multiple DL360 Gen9 servers with up to date firmware. Just before reboot or shutdown, there is the following panic:
[ 289.093083] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 289.093085] {1}[Hardware Error]: event severity: fatal
[ 289.093087] {1}[Hardware Error]: Error 0, type: fatal
[ 289.093088] {1}[Hardware Error]: section_type: PCIe error
[ 289.093090] {1}[Hardware Error]: port_type: 4, root port
[ 289.093091] {1}[Hardware Error]: version: 1.16
[ 289.093093] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 289.093094] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 289.093095] {1}[Hardware Error]: slot: 0
[ 289.093096] {1}[Hardware Error]: secondary_bus: 0x03
[ 289.093097] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 289.093098] {1}[Hardware Error]: class_code: 040600
[ 289.093378] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 289.093380] {1}[Hardware Error]: Error 1, type: fatal
[ 289.093381] {1}[Hardware Error]: section_type: PCIe error
[ 289.093382] {1}[Hardware Error]: port_type: 4, root port
[ 289.093383] {1}[Hardware Error]: version: 1.16
[ 289.093384] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 289.093386] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 289.093386] {1}[Hardware Error]: slot: 0
[ 289.093387] {1}[Hardware Error]: secondary_bus: 0x03
[ 289.093388] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 289.093674] {1}[Hardware Error]: class_code: 040600
[ 289.093676] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 289.093678] Kernel panic - not syncing: Fatal hardware error!
[ 289.093745] Kernel Offset: 0x1cc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 289.105835] ERST: [Firmware Warn]: Firmware does not respond in time.
It does eventually restart after this. Then during the subsequent POST, the following warning appears:
Embedded RAID 1 : Smart Array P440ar Controller - (2048 MB, V6.30) 7 Logical
Drive(s) - Operation Failed
- 1719-Slot 0 Drive Array - A controller failure event occurred prior
to this power-up. (Previous lock up code = 0x13) Action: Install the
latest controller firmware. If the problem persists, replace the
controller.
The latter's symptoms are described in https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c04805565 but the running storage controller firmware is much newer than the doc's resolution.
Neither of these problems occur during shutdown/reboot on the xenial kernel.
FWIW, when running on old P89 (1.50 (07/20/2015) vs 2.56 (01/22/2018)), the shutdown failure mode was a loop like so:
[529151.035267] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[529153.222883] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.222884] Do you have a strange power saving mode enabled?
[529153.222884] Dazed and confused, but trying to continue
[529153.554447] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554448] Do you have a strange power saving mode enabled?
[529153.554449] Dazed and confused, but trying to continue
[529153.554450] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554451] Do you have a strange power saving mode enabled?
[529153.554452] Dazed and confused, but trying to continue
[529153.554452] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554453] Do you have a strange power saving mode enabled?
[529153.554454] Dazed and confused, but trying to continue
[529153.554454] Uhhuh. NMI received for unknown reason 35 on CPU 0.
[529153.554455] Do you have a strange power saving mode enabled?
[529153.554456] Dazed and confused, but trying to continue
[529153.554457] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554458] Do you have a strange power saving mode enabled?
[529153.554458] Dazed and confused, but trying to continue
[529153.554459] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554460] Do you have a strange power saving mode enabled?
[529153.554460] Dazed and confused, but trying to continue
[529154.953916] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529154.953917] Do you have a strange power saving mode enabled?
[529154.953918] Dazed and confused, but trying to continue
But upgrading to 2.56 changes that to a kernel panic.
ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-signed-image-generic 4.15.0.21.22
ProcVersionSignature: Ubuntu 4.15.0-21.22-generic 4.15.17
Uname: Linux 4.15.0-21-generic x86_64
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 May 15 23:11 seq
crw-rw---- 1 root audio 116, 33 May 15 23:11 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Wed May 16 00:17:53 2018
HibernationDevice: RESUME=UUID=696e8063-c668-4c89-a478-bfc23a450369
InstallationDate: Installed on 2016-06-01 (713 days ago)
InstallationMedia: Ubuntu-Server 14.04.5 LTS "Trusty Tahr" - Beta amd64 (20160527)
MachineType: HP ProLiant DL360 Gen9
PciMultimedia:
ProcEnviron:
TERM=xterm-256color
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-21-generic root=UUID=6e6d422d-8ffb-4db3-b8c7-6c81e320b1b2 ro console=tty0 console=ttyS1,38400 nosplash console=ttyS1,38400 console=tty0 nosplash
RelatedPackageVersions:
linux-restricted-modules-4.15.0-21-generic N/A
linux-backports-modules-4.15.0-21-generic N/A
linux-firmware 1.173
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to bionic on 2018-05-09 (6 days ago)
dmi.bios.date: 01/22/2018
dmi.bios.vendor: HP
dmi.bios.version: P89
dmi.board.name: ProLiant DL360 Gen9
dmi.board.vendor: HP
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP89:bd01/22/2018:svnHP:pnProLiantDL360Gen9:pvr:rvnHP:rnProLiantDL360Gen9:rvr:cvnHP:ct23:cvr:
dmi.product.family: ProLiant
dmi.product.name: ProLiant DL360 Gen9
dmi.sys.vendor: HP |
|