Ubuntu
linux package

Reboot/shutdown kernel panic on HP DL360/DL380 Gen9 w/ bionic 4.15.0

Bug #1771467 reported by Ryan Finnie on 2018-05-16

This bug affects 10 people

	Status	Importance	Assigned to
Linux	Fix Released	Medium	linux-kernel-bugs #199779
linux (Ubuntu)	Fix Released	High	Unassigned
Bionic	Fix Released	High	Unassigned

Bug Description

== SRU Justification ==
Mainline commit introduced a regression in v4.15-rc1. The regression
causes a kernel panic during system shutdown. This commit fixes
that regression. This commit was also cc'd to upstream stable, but it
has not landed in Bionic as of yet.

== Fix ==
0d98ba8d70b0 ("scsi: hpsa: disable device during shutdown")

== Regression Potential ==
Low. This patch fixes a current regression. It has been cc'd to
upstream stable, so it has had additon upstream review.

== Test Case ==
A test kernel was built with this patch and tested by the original bug reporter.
The bug reporter states the test kernel resolved the bug.

Verified on multiple DL360 Gen9 servers with up to date firmware. Just before reboot or shutdown, there is the following panic:

[ 289.093083] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 289.093085] {1}[Hardware Error]: event severity: fatal
[ 289.093087] {1}[Hardware Error]: Error 0, type: fatal
[ 289.093088] {1}[Hardware Error]: section_type: PCIe error
[ 289.093090] {1}[Hardware Error]: port_type: 4, root port
[ 289.093091] {1}[Hardware Error]: version: 1.16
[ 289.093093] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 289.093094] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 289.093095] {1}[Hardware Error]: slot: 0
[ 289.093096] {1}[Hardware Error]: secondary_bus: 0x03
[ 289.093097] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 289.093098] {1}[Hardware Error]: class_code: 040600
[ 289.093378] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 289.093380] {1}[Hardware Error]: Error 1, type: fatal
[ 289.093381] {1}[Hardware Error]: section_type: PCIe error
[ 289.093382] {1}[Hardware Error]: port_type: 4, root port
[ 289.093383] {1}[Hardware Error]: version: 1.16
[ 289.093384] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 289.093386] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 289.093386] {1}[Hardware Error]: slot: 0
[ 289.093387] {1}[Hardware Error]: secondary_bus: 0x03
[ 289.093388] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 289.093674] {1}[Hardware Error]: class_code: 040600
[ 289.093676] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 289.093678] Kernel panic - not syncing: Fatal hardware error!
[ 289.093745] Kernel Offset: 0x1cc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 289.105835] ERST: [Firmware Warn]: Firmware does not respond in time.

It does eventually restart after this. Then during the subsequent POST, the following warning appears:

Embedded RAID 1 : Smart Array P440ar Controller - (2048 MB, V6.30) 7 Logical
Drive(s) - Operation Failed
- 1719-Slot 0 Drive Array - A controller failure event occurred prior
   to this power-up. (Previous lock up code = 0x13) Action: Install the
   latest controller firmware. If the problem persists, replace the
   controller.

The latter's symptoms are described in https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-c04805565 but the running storage controller firmware is much newer than the doc's resolution.

Neither of these problems occur during shutdown/reboot on the xenial kernel.

FWIW, when running on old P89 (1.50 (07/20/2015) vs 2.56 (01/22/2018)), the shutdown failure mode was a loop like so:

[529151.035267] NMI: IOCK error (debug interrupt?) for reason 75 on CPU 0.
[529153.222883] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.222884] Do you have a strange power saving mode enabled?
[529153.222884] Dazed and confused, but trying to continue
[529153.554447] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554448] Do you have a strange power saving mode enabled?
[529153.554449] Dazed and confused, but trying to continue
[529153.554450] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554451] Do you have a strange power saving mode enabled?
[529153.554452] Dazed and confused, but trying to continue
[529153.554452] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554453] Do you have a strange power saving mode enabled?
[529153.554454] Dazed and confused, but trying to continue
[529153.554454] Uhhuh. NMI received for unknown reason 35 on CPU 0.
[529153.554455] Do you have a strange power saving mode enabled?
[529153.554456] Dazed and confused, but trying to continue
[529153.554457] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554458] Do you have a strange power saving mode enabled?
[529153.554458] Dazed and confused, but trying to continue
[529153.554459] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529153.554460] Do you have a strange power saving mode enabled?
[529153.554460] Dazed and confused, but trying to continue
[529154.953916] Uhhuh. NMI received for unknown reason 25 on CPU 0.
[529154.953917] Do you have a strange power saving mode enabled?
[529154.953918] Dazed and confused, but trying to continue

But upgrading to 2.56 changes that to a kernel panic.

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-signed-image-generic 4.15.0.21.22
ProcVersionSignature: Ubuntu 4.15.0-21.22-generic 4.15.17
Uname: Linux 4.15.0-21-generic x86_64
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 May 15 23:11 seq
crw-rw---- 1 root audio 116, 33 May 15 23:11 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Wed May 16 00:17:53 2018
HibernationDevice: RESUME=UUID=696e8063-c668-4c89-a478-bfc23a450369
InstallationDate: Installed on 2016-06-01 (713 days ago)
InstallationMedia: Ubuntu-Server 14.04.5 LTS "Trusty Tahr" - Beta amd64 (20160527)
MachineType: HP ProLiant DL360 Gen9
PciMultimedia:

ProcEnviron:
TERM=xterm-256color
PATH=(custom, no user)
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-21-generic root=UUID=6e6d422d-8ffb-4db3-b8c7-6c81e320b1b2 ro console=tty0 console=ttyS1,38400 nosplash console=ttyS1,38400 console=tty0 nosplash
RelatedPackageVersions:
linux-restricted-modules-4.15.0-21-generic N/A
linux-backports-modules-4.15.0-21-generic N/A
linux-firmware 1.173
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: Upgraded to bionic on 2018-05-09 (6 days ago)
dmi.bios.date: 01/22/2018
dmi.bios.vendor: HP
dmi.bios.version: P89
dmi.board.name: ProLiant DL360 Gen9
dmi.board.vendor: HP
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP89:bd01/22/2018:svnHP:pnProLiantDL360Gen9:pvr:rvnHP:rnProLiantDL360Gen9:rvr:cvnHP:ct23:cvr:
dmi.product.family: ProLiant
dmi.product.name: ProLiant DL360 Gen9
dmi.sys.vendor: HP

See original description

Tags:

Revision history for this message

Ryan Finnie (fo0bar) wrote on 2018-05-16:

CRDA.txt Edit (454 bytes, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (85.9 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (2.3 KiB, text/plain; charset="utf-8")
IwConfig.txt Edit (601 bytes, text/plain; charset="utf-8")
Lspci.txt Edit (151.2 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (471 bytes, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (22.5 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.1 KiB, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (45.7 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (4.4 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (205.3 KiB, text/plain; charset="utf-8")
WifiSyslog.txt Edit (105.2 KiB, text/plain; charset="utf-8")

Revision history for this message

Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote on 2018-05-16: Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-05-16: Re: Reboot/shutdown kernel panic on HP DL360 Gen9 w/ bionic 4.15.0

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.17 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.17-rc5

Changed in linux (Ubuntu):
importance:	Undecided → High
tags:	added: kernel-da-key
Changed in linux (Ubuntu):
status:	Confirmed → Incomplete
Changed in linux (Ubuntu Bionic):
importance:	Undecided → High
status:	New → Incomplete

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-21:

Download full text (3.2 KiB)

Created attachment 276079
lspci -vv

On HPe DL360 Gen9 (and possibly other gens and/or products; I haven't been able to test other HP hardware right now, but I do have several DL360 Gen9s I've confirmed on), upon shutdown/reboot, it will crash with:

[ 122.447111] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[ 122.447112] {1}[Hardware Error]: event severity: fatal
[ 122.447113] {1}[Hardware Error]: Error 0, type: fatal
[ 122.447114] {1}[Hardware Error]: section_type: PCIe error
[ 122.447115] {1}[Hardware Error]: port_type: 4, root port
[ 122.447116] {1}[Hardware Error]: version: 1.16
[ 122.447118] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 122.447119] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 122.447119] {1}[Hardware Error]: slot: 0
[ 122.447120] {1}[Hardware Error]: secondary_bus: 0x03
[ 122.447120] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 122.447121] {1}[Hardware Error]: class_code: 040600
[ 122.447122] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 122.447123] {1}[Hardware Error]: Error 1, type: fatal
[ 122.447123] {1}[Hardware Error]: section_type: PCIe error
[ 122.447124] {1}[Hardware Error]: port_type: 4, root port
[ 122.447125] {1}[Hardware Error]: version: 1.16
[ 122.447125] {1}[Hardware Error]: command: 0x6010, status: 0x0143
[ 122.447126] {1}[Hardware Error]: device_id: 0000:00:01.0
[ 122.447127] {1}[Hardware Error]: slot: 0
[ 122.447127] {1}[Hardware Error]: secondary_bus: 0x03
[ 122.447128] {1}[Hardware Error]: vendor_id: 0x8086, device_id: 0x2f02
[ 122.447129] {1}[Hardware Error]: class_code: 040600
[ 122.447130] {1}[Hardware Error]: bridge: secondary_status: 0x2000, control: 0x0003
[ 122.447131] Kernel panic - not syncing: Fatal hardware error!
[ 122.447166] Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 122.459295] ERST: [Firmware Warn]: Firmware does not respond in time.

And after that, upon POST, the storage controller is not happy but does eventually work:

Up to date firmware (P89 01/22/2018, controller 6.30). Interestingly, on older (circa 2016 but I don't have an exact version) firmware, this manifested as a crash loop:

I've narrowed it down to https://patchwork.kernel.org/patch/10027157/ as part of commit 1b6115fbe3b...

Created attachment 276079
lspci -vv

[  122.447111] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1
[  122.447112] {1}[Hardware Error]: event severity: fatal
[  122.447113] {1}[Hardware Error]:  Error 0, type: fatal
[  122.447114] {1}[Hardware Error]:   section_type: PCIe error
[  122.447115] {1}[Hardware Error]:   port_type: 4, root port
[  122.447116] {1}[Hardware Error]:   version: 1.16
[  122.447118] {1}[Hardware Error]:   command: 0x6010, status: 0x0143
[  122.447119] {1}[Hardware Error]:   device_id: 0000:00:01.0
[  122.447119] {1}[Hardware Error]:   slot: 0
[  122.447120] {1}[Hardware Error]:   secondary_bus: 0x03
[  122.447120] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f02
[  122.447121] {1}[Hardware Error]:   class_code: 040600
[  122.447122] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
[  122.447123] {1}[Hardware Error]:  Error 1, type: fatal
[  122.447123] {1}[Hardware Error]:   section_type: PCIe error
[  122.447124] {1}[Hardware Error]:   port_type: 4, root port
[  122.447125] {1}[Hardware Error]:   version: 1.16
[  122.447125] {1}[Hardware Error]:   command: 0x6010, status: 0x0143
[  122.447126] {1}[Hardware Error]:   device_id: 0000:00:01.0
[  122.447127] {1}[Hardware Error]:   slot: 0
[  122.447127] {1}[Hardware Error]:   secondary_bus: 0x03
[  122.447128] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x2f02
[  122.447129] {1}[Hardware Error]:   class_code: 040600
[  122.447130] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
[  122.447131] Kernel panic - not syncing: Fatal hardware error!
[  122.447166] Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  122.459295] ERST: [Firmware Warn]: Firmware does not respond in time.

And after that, upon POST, the storage controller is not happy but does eventually work:

Embedded RAID 1 : Smart Array P440ar Controller - (2048 MB, V6.30) 7 Logical
Drive(s) - Operation Failed
 - 1719-Slot 0 Drive Array - A controller failure event occurred prior
   to this power-up.  (Previous lock up code = 0x13) Action: Install the
   latest controller firmware. If the problem persists, replace the
   controller.

Up to date firmware (P89 01/22/2018, controller 6.30).  Interestingly, on older (circa 2016 but I don't have an exact version) firmware, this manifested as a crash loop:

I've narrowed it down to https://patchwork.kernel.org/patch/10027157/ as part of commit 1b6115fbe3b3db746d7baa11399dd617fc75e1c4; removing that line prevents the panic.

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-21:

Can you test this patch?

https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/pci/hotplug?id=d22b362184553899f7d6b6760899a77d3b2d7c1b

There is a known Intel errata that we missed.

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-21:

#10

can you also share your dmesg?

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-21:

#11

Created attachment 276103
4.17.0-rc5-next-20180517 dmesg

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-21:

#12

Thanks, but same problem with that patch against 4.15. Even tried next-20180517 to be sure, no luck. dmesg against next-20180517 has been attached.

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-21:

#13

Cool, I had my suspicions. That's why, I asked for dmesg. Your system doesn't seem to have hotplug driver loaded. The bugfix above is valid only if you have hotplug driver enabled. Something else must be happening.

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-21:

#14

it looks like PME is the only PCIe port service driver loaded. Can you empty out this line to see if it makes any difference? Then, we can start going deeper based on your test result.

https://elixir.bootlin.com/linux/latest/ident/pcie_pme_remove

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-21:

#15

Created attachment 276111
pcie_pme_remove removed, crash

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-21:

#16

- .remove = pcie_pme_remove,

With that removed, the crash becomes:

[ 115.008578] kernel BUG at drivers/pci/msi.c:352!
[ 115.069730] invalid opcode: 0000 [#1] SMP PTI
[ 115.127399] CPU: 15 PID: 1 Comm: systemd-shutdow Not tainted 4.17.0-rc5-next-20180517-custom #1
[ 115.242735] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 01/22/2018
[ 115.351050] RIP: 0010:free_msi_irqs+0x17b/0x1b0
[ 115.410250] Code: 84 e1 fe ff ff 45 31 f6 eb 11 41 83 c6 01 44 39 73 14 0f 86 ce fe ff ff 8b 7b 10 44 01 f7 e8 7c f4 bb ff 48 83 78 70 00 74 e0 <0f> 0b 49 8d b5 a0 00 00 00 e8 b7 a0 bc ff e9 cf fe ff ff 48 8b 78
[...]

Full output attached.

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-21:

#17

Oops. can you comment out this line only?

https://elixir.bootlin.com/linux/latest/source/drivers/pci/pcie/pme.c#L431

We have to call free_irq(). I went too aggressive at the problem.

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-21:

#18

Commented out "pcie_pme_suspend(srv);", back to original Hardware Error crash.

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-21:

#19

Weird. I'll come up with a debug patch. Can you collect some more data as to what other systems see this issue in the meantime?

Since you are the first one to report the problem, there must be something unique about your setup.

Also, please attach sudo lspci -t output too.

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-21:

#20

Sure. I'm seeing this on a set of 4 DL360 Gen9s, I believe they were all purchased at the same time around 2016. I'll look around for further machines I can test on, looking for:

1) DL360 Gen9s but not from the same batch as these
2) Previous gens (not sure we have any older ones)
3) DL380 Gen9

Attaching lspci -t.

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-21:

#21

Created attachment 276113
lspci -t

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-21:

#22

Created attachment 276115
debug_patch.patch

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-22:

#23

I was able to test on another DL360 Gen9 received about a year after the ones I discovered on, same problem. And a DL380 Gen9 with similar specs, also crashes. I was able to test on a DL380 Gen10, which did *not* crash. In summary:

Bad: DL360 Gen9 - BIOS P89 v2.56 (01/22/2018) - P440ar V6.30 (originals)
Bad: DL360 Gen9 - BIOS P89 v2.52 (10/25/2017) - P440ar V6.06 (newer)
Bad: DL380 Gen9 - BIOS P89 v2.52 (10/25/2017) - P440ar V6.06 (newer)
Good: DL380 Gen10 - U30 v1.32 (02/01/2018) - P408i-a 1.04-0 (even newer)

Attached is the output from your debug patch on the original test system.

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-22:

#24

Created attachment 276121
debug patch output

Revision history for this message

Ryan Finnie (fo0bar) wrote on 2018-05-22:

I tracked it down to https://patchwork.kernel.org/patch/10027157/ just before 4.15-rc1. This appears to affect all DL360/DL380 Gen9 I've encountered so far. Opened https://bugzilla.kernel.org/show_bug.cgi?id=199779 and currently working with Sinan Kaya to disagnose.

tags:	added: kernel-bug-exists-upstream
Changed in linux (Ubuntu):
status:	Incomplete → Confirmed
Changed in linux (Ubuntu Bionic):
status:	Incomplete → Confirmed

Ryan Finnie (fo0bar) on 2018-05-22

summary:

- Reboot/shutdown kernel panic on HP DL360 Gen9 w/ bionic 4.15.0
+ Reboot/shutdown kernel panic on HP DL360/DL380 Gen9 w/ bionic 4.15.0

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-22:

#25

Many thanks, let's try these tests. Debug prints are not giving me any clues. The error seems to be asynchronous to the code execution. We'll have to find out by trial and error which one is confusing the HW. My bet is on the first one followed by the third.

1. comment out this line only.

https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_core.c#L412

2. Comment out this line only.

https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_pci.c#L148

3. Comment out the if block only.

https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_pci.c#L142

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-22:

#26

Progress! #1 reboots correctly.

A) I had reverted out the debug print patch, want me to add it back? Does it give you any extra insight?
B) Should I move on to #2 and #3?

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-22:

#27

No, this is enough. We now understand that disabling the bus master bit in the command-control register of the root port is causing a crash on your system.

I suspect that the firmware is talking to the PCIe bus in parallel and by disabling the bus master bit, we are breaking the FW.

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-22:

#28

Can you also attach the messages you are seeing during shutdown/reboot? The driver clean up order could be important too.

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-05-22:

#29

Created attachment 276123
shutdown log

[`dmesg -n debug` added, otherwise normal systemd-obfuscated user messages]

This particular test machine is a MAAS server, 4 interfaces, 2 bonds, 2 bridges. It normally runs a KVM instance directly, but I don't have it set up to autoboot to save time while testing.

Functionally, the other machines tested don't have a common operational trait: OpenStack "smoosh" (nova-compute + n-c-c + neutron + swift + ceph etc in LXDs), a straight Apache archive server, a standby firewall. Actually, they all appear to be at least partially utilizing 10gige interfaces (hopefully that's not a consideration since I'm not sure if I can pull a straight gigabit machine out of active use to test on short notice).

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-22:

#30

can you apply debug_patch.patch +

1. comment out this line only.

https://elixir.bootlin.com/linux/v4.17-rc6/source/drivers/pci/pcie/portdrv_core.c#L412

and collect shutdown log one more time.

I see quite a bit of driver shutdown activity from your network adapters. I want to see them in reference to the port service driver shutdown to see which one is happening first and last.

Joseph Salisbury (jsalisbury) on 2018-05-22

Changed in linux (Ubuntu Bionic):
status:	Confirmed → Triaged
Changed in linux (Ubuntu):
status:	Confirmed → Triaged

Revision history for this message

In Linux Kernel Bug Tracker #199779, bhelgaas (bhelgaas-linux-kernel-bugs) wrote on 2018-05-22:

#31

I am not yet convinced that it is necessary for pcie_port_device_remove() to call pci_disable_device() on PCIe Root Ports and Switch Ports during a reboot.

A similar question came during discussion of pciehp timeouts during shutdown [1]. Eric Biederman had a good response [2] that I haven't had time to assimilate yet.

[1] https://<email address hidden>
[2] https://<email address hidden>

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-05-22:

#32

I think the motivation is for rogue transactions from the devices not to hit the system memory while a new kernel is booted via kexec.

It is not an issue when IOMMU is not present since the second kernel that is booting doesn't share the same address space.

However; when IOMMU is present, an adapter can corrupt the newly booting kernel. So, you ideally want to have bus master bit cleared for a clean boot.

What is interesting is that kexec is already doing this job in pci_device_shutdown(). This extra clear is unnecessary. I'll post a patch to remove it.

Revision history for this message

Ryan Finnie (fo0bar) wrote on 2018-05-22:

A patch has been submitted to linux-pci, and I've confirmed this fix works: https://lkml.org/lkml/2018/5/22/817

Revision history for this message

In Linux Kernel Bug Tracker #199779, okaya (okaya-linux-kernel-bugs) wrote on 2018-06-11:

#33

change merged to the 4.18 kernel:

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git/commit/?id=0d98ba8d70b0070ac117452ea0b663e26bbf46bf

This issue can be closed.

Revision history for this message

Quiksmage (quicksilverinc06) wrote on 2018-06-12:

hi, I just updated to 18.04 today and have started to see this message on reboot.
I am also on an HP DL380 Gen9.

It looks like everything has already been found :).

Pardon me for asking, but how long does a fix like this (ballpark estimate) usually take to get into an OS update?

Revision history for this message

Ryan Finnie (fo0bar) wrote on 2018-06-19:

The fix was bikeshedded a tiny bit on LKML, but is now accepted upstream and AIUI will be in linux-next soon: https://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi.git/commit/?id=0d98ba8d70b0070ac117452ea0b663e26bbf46bf

This change is tested as backwards compatible with Ubuntu 4.15, and would be appreciated for SRU.

Revision history for this message

In Linux Kernel Bug Tracker #199779, ryan (ryan-linux-kernel-bugs) wrote on 2018-06-19:

#34

Ack, thank you for all your help.

Joseph Salisbury (jsalisbury) on 2018-06-20

Changed in linux (Ubuntu):
status:	Triaged → In Progress
Changed in linux (Ubuntu Bionic):
status:	Triaged → In Progress
Changed in linux (Ubuntu):
assignee:	nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Bionic):
assignee:	nobody → Joseph Salisbury (jsalisbury)

Bug Watch Updater (bug-watch-updater) on 2018-06-21

Changed in linux:
importance:	Unknown → Medium
status:	Unknown → Fix Released

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-06-21:

#35

I built a test kernel with commit 0d98ba8d70b0070ac117452ea0b663e26bbf46bf. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1771467

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-modules, linux-modules-extra and linux-image-unsigned .deb packages.

Thanks in advance!

Revision history for this message

Quiksmage (quicksilverinc06) wrote on 2018-06-24:

#36

I wouldn't mind testing, but I'm not sure how :)

I'm on 18.04 LTS
with 4.15.0-23-generic

If you can give me some commands to try (as well as a command to revert), I have no problem trying.

Thanks!

Revision history for this message

Andreas Bininda (bininda) wrote on 2018-08-10:

#37

We run a DL360pg8
Same problem at reboot with kernel 4.15.0-30 (hang)

We can confirm, that the fix in 4.15.0-23 fixes the bug

Thanks!

Revision history for this message

Andreas Bininda (bininda) wrote on 2018-08-10:

#38

We download the fix from

http://kernel.ubuntu.com/~jsalisbury/lp1771467

and testet it.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-08-10:

#39

Thanks for testing! I'll submit an SRU request for that commit.

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2018-08-10:

#40

https://lists.ubuntu.com/archives/kernel-team/2018-August/094654.html

description:

updated

Revision history for this message

Steffen Neumann (sneumann) wrote on 2018-08-15:

#41

Hi, we've been hit on ProLiant BL460c Gen9 with H244br
07:00.0 Serial Attached SCSI controller: Hewlett-Packard Company Smart Array Gen9 Controllers
and happy to test if a kernel *.deb is available. Yours, Steffen

Revision history for this message

Frank Brendel (faunsen) wrote on 2018-08-22:

#42

Happily this fixes NMIs on DL380p Gen8 during reboot too.

Revision history for this message

Junien F (axino) wrote on 2019-01-15:

#43

What's the status of the SRU for this bug ? Thanks !

Joseph Salisbury (jsalisbury) on 2019-01-15

Changed in linux (Ubuntu):
assignee:	Joseph Salisbury (jsalisbury) → nobody
status:	In Progress → Confirmed
Changed in linux (Ubuntu Bionic):
status:	In Progress → Confirmed
assignee:	Joseph Salisbury (jsalisbury) → nobody

Revision history for this message

Quiksmage (quicksilverinc06) wrote on 2019-03-01:

#44

Was this one forgotten, haha? If I can help in any way, please let me know.
Thanks!

Revision history for this message

Kai-Heng Feng (kaihengfeng) wrote on 2019-03-04:

#45

It's included since 4.15.0-34.37.

Changed in linux (Ubuntu):
status:	Confirmed → Fix Released
Changed in linux (Ubuntu Bionic):
status:	Confirmed → Fix Released

Brad Figg (brad-figg) on 2019-07-24

tags:

added: cscc

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

linux-kernel-bugs #199779
[RESOLVED CODE_FIX] Edit

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

Reboot/shutdown kernel panic on HP DL360/DL380 Gen9 w/ bionic 4.15.0

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package