Complete System Halt on HP Proliant 380e Gen8 since 5.4.0-40

Bug #1886057 reported by menschentier
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I experienced two complete system freezes today after updating to new kernel 5.4.0-40

Additional Information: I bought a new (used) server about two months ago which replaced an older HP Workstation used as server. Before, I used to install ubuntu mainline kernel regularly, mainly out of my own interest.
Being lazy, I did not set up the server with a new OS but used the old Ubuntu HDD in the new server - it booted up just fine. But then out of a sudden, freezes started - complete system halts, no direct console access possible, not even a message on the screen what happened, simply a complete halt.

I can access the server via ilo - ilo tells me that there are no errors, after resetting, it boots up like normal, but before reset operating system is entirely stalled.

I solved the problem uninstalling any mainline kernels and sticking to kernel version shipped with ubuntu 20.04 LTS, which made the system rock solid (rebooting server for new kernel versions is about every x days, so we can argue what rock solid is, it did not have an uptime of 100 days) but while using 5.4. kernel branch shipped with focal, I had no single freeze at all - until today, twice, one about 10 hours after booting into 5.4.40, second one 1 hour after first one.

I now went back to 5.4.39 and until now - it is stable again without any freeze.

I have no real idea what this could be, my assumption while not using mainline kernel was, that it has something to do with migitations against cpu bugs. So if my theory is correct it could be something already implemented in more recent kernel versions but not implemented into 5.4 before 5.4.40.

I did not find any bug report similar to this - just hoping it is not an extremely special case concerning this server with this cpu...

If I can assist in any way - just drop me a note.

Thank you in advance
Simon

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: linux-image-5.4.0-39-generic 5.4.0-39.43
ProcVersionSignature: Ubuntu 5.4.0-39.43-generic 5.4.41
Uname: Linux 5.4.0-39-generic x86_64
ApportVersion: 2.20.11-0ubuntu27.3
Architecture: amd64
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CasperMD5CheckResult: skip
Date: Thu Jul 2 15:46:41 2020
HibernationDevice: RESUME=UUID=59222355-1327-4698-8eba-bcb35fa38b74
MachineType: HP ProLiant DL380e Gen8
ProcFB: 0 mgag200drmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.4.0-39-generic root=UUID=56e35c58-7c2f-4804-9fff-736a2d3ab35d ro
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-5.4.0-39-generic N/A
 linux-backports-modules-5.4.0-39-generic N/A
 linux-firmware 1.187.1
RfKill:

SourcePackage: linux
UpgradeStatus: Upgraded to focal on 2020-06-01 (31 days ago)
dmi.bios.date: 05/24/2019
dmi.bios.vendor: HP
dmi.bios.version: P73
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP73:bd05/24/2019:svnHP:pnProLiantDL380eGen8:pvr:cvnHP:ct23:cvr:
dmi.product.family: ProLiant
dmi.product.name: ProLiant DL380e Gen8
dmi.product.sku: D0E39A
dmi.sys.vendor: HP
modified.conffile..etc.logrotate.d.apport: [modified]
mtime.conffile..etc.logrotate.d.apport: 2016-01-02T14:22:51.291889

Revision history for this message
menschentier (srockelmann) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
DaveTickem (dave-tickem) wrote :

Same symptoms here - on a [past history] stable system.

Experienced a system lockup with linux-image-5.4.0-40-generic yesterday, while working at an x terminal ssh'ing into a remote host. Have rolled back to linux-image-5.4.0-39-generic now.

Completely different hardware to your spec - AMD A10 here, unfortunately no ILO to retrieve any useful crash information.

No messages/syslog "interesting" events reported in post-analysis. At the time there was no network response, mouse, kbd, sysreq, disk activity, complete freeze. Hard reboot required.

Stable now back on linux-image-5.4.0-39-generic.

Also moved from bfq io scheduler back to mq-deadline as I found lockup issue reports here too. ymmv.

See:

https://bugzilla.kernel.org/show_bug.cgi?id=205447

https://bugzilla.redhat.com/show_bug.cgi?id=1693965

https://bugzilla.redhat.com/show_bug.cgi?id=1767539

[ which is a shame as I switched to BFQ naievely to balance hdd load to avoid io starvation to "quieter" tasks ]

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Possible dupe of lp: #1886277?

Please follow the instructions to bisect the kernel...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.