CPU lockup on HP Proliant DL380 Gen9 servers

Bug #1500739 reported by Brad Marshall on 2015-09-29
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)

Bug Description

Over the past 3-ish weeks we've had 3 seperate HP Proliant DL380 Gen9 servers lock up with a similar looking cpu lockup bug. All 3 of these servers are nova-compute nodes in an OpenStack cluster, with a reasonable amount of load on them. The symptoms are the load shoots up into the hundreds, and ps stops returning.

I've attached lspci -vnvn and 3 sets of syslog message traces that we grabbed on each of the 3 times it has crashed.

$ cat /proc/version_signature
Ubuntu 3.16.0-49.65~14.04.1-generic 3.16.7-ckt15

Please let us know if you need any further information.
 total 0
 crw-rw---- 1 root audio 116, 1 Sep 29 06:47 seq
 crw-rw---- 1 root audio 116, 33 Sep 29 06:47 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.15
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
HibernationDevice: RESUME=UUID=c2111986-9219-44f2-ad09-8d367ce8b30b
MachineType: HP ProLiant DL380 Gen9
Package: linux (not installed)

 PATH=(custom, no user)
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.16.0-49-generic root=UUID=038cefb6-3559-4104-9754-5308497961f5 ro console=tty0 console=ttyS1,115200 BOOTIF=01-3c:a8:2a:23:f5:00 quiet
ProcVersionSignature: Ubuntu 3.16.0-49.65~14.04.1-generic 3.16.7-ckt15
 linux-restricted-modules-3.16.0-49-generic N/A
 linux-backports-modules-3.16.0-49-generic N/A
 linux-firmware 1.127.15
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 3.16.0-49-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dialout libvirtd lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 05/06/2015
dmi.bios.vendor: HP
dmi.bios.version: P89
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP89:bd05/06/2015:svnHP:pnProLiantDL380Gen9:pvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL380 Gen9
dmi.sys.vendor: HP

Brad Marshall (brad-marshall) wrote :
Brad Marshall (brad-marshall) wrote :

The first of the lockups. These all required us to hard reset the servers via the ilo.

Brad Marshall (brad-marshall) wrote :

2nd of the lockups.

Brad Marshall (brad-marshall) wrote :

3rd of the lockups

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1500739

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.3 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3-rc3-unstable/

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key vivid
Brad Marshall (brad-marshall) wrote :

Unfortunately we're unable to test the latest upstream kernel in this situation. These servers are running a production system, and as they use bcache we need a kernel that supports it.

apport information

tags: added: apport-collected trusty uec-images
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Junien Fridrick (axino) wrote :

FYI I just opened #1505564, which is very similar and probably a duplicate.

Chris J Arges (arges) wrote :

Initially this looked similar to bug 1413540.
This bug patched 3.13 with 9242b5b to _mitigate_ the issue, but this patch is already present in 3.16. So perhaps we're hitting another failure mode.
It would be good to know if the smp_call_function_* path in the backtrace is actually leading up to an IPI call that gets lost, and thus we spin in csd_lock_wait.

Are you running nested KVM instances? How often does this lockup occur? Can you get crashdumps of this issue?

--chris j arges

tags: added: kernel-key
removed: kernel-da-key
Junien Fridrick (axino) wrote :

I uploaded a crashdump of a very similar issue in LP#1505564, FYI.

tags: added: kernel-da-key
removed: kernel-key
Neale Pickett (neale) wrote :

Next time this happens, please check the ilo for a massive backlog of nbd kernel messages. I suspect it's nbd's insane logging rate (tens of thousands of lines per second) endlessly growing the serial console backlog. "dmesg -D" appears to be a quick way to fix machines in this state, or prevent it from ever happening. There is probably a more elegant solution (such as, get the nbd module to calm down)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers