CPU lockup on HP Proliant DL380 Gen9 servers

Bug #1500739 reported by Brad Marshall
50
This bug affects 10 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
High
Unassigned

Bug Description

Over the past 3-ish weeks we've had 3 seperate HP Proliant DL380 Gen9 servers lock up with a similar looking cpu lockup bug. All 3 of these servers are nova-compute nodes in an OpenStack cluster, with a reasonable amount of load on them. The symptoms are the load shoots up into the hundreds, and ps stops returning.

I've attached lspci -vnvn and 3 sets of syslog message traces that we grabbed on each of the 3 times it has crashed.

$ cat /proc/version_signature
Ubuntu 3.16.0-49.65~14.04.1-generic 3.16.7-ckt15

Please let us know if you need any further information.
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Sep 29 06:47 seq
 crw-rw---- 1 root audio 116, 33 Sep 29 06:47 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.15
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
HibernationDevice: RESUME=UUID=c2111986-9219-44f2-ad09-8d367ce8b30b
MachineType: HP ProLiant DL380 Gen9
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.16.0-49-generic root=UUID=038cefb6-3559-4104-9754-5308497961f5 ro console=tty0 console=ttyS1,115200 BOOTIF=01-3c:a8:2a:23:f5:00 quiet
ProcVersionSignature: Ubuntu 3.16.0-49.65~14.04.1-generic 3.16.7-ckt15
RelatedPackageVersions:
 linux-restricted-modules-3.16.0-49-generic N/A
 linux-backports-modules-3.16.0-49-generic N/A
 linux-firmware 1.127.15
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 3.16.0-49-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dialout libvirtd lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 05/06/2015
dmi.bios.vendor: HP
dmi.bios.version: P89
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP89:bd05/06/2015:svnHP:pnProLiantDL380Gen9:pvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL380 Gen9
dmi.sys.vendor: HP

Revision history for this message
Brad Marshall (brad-marshall) wrote :
Revision history for this message
Brad Marshall (brad-marshall) wrote :

The first of the lockups. These all required us to hard reset the servers via the ilo.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

2nd of the lockups.

Revision history for this message
Brad Marshall (brad-marshall) wrote :

3rd of the lockups

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1500739

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.3 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3-rc3-unstable/

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key vivid
Revision history for this message
Brad Marshall (brad-marshall) wrote :

Unfortunately we're unable to test the latest upstream kernel in this situation. These servers are running a production system, and as they use bcache we need a kernel that supports it.

Revision history for this message
Brad Marshall (brad-marshall) wrote : BootDmesg.txt

apport information

tags: added: apport-collected trusty uec-images
description: updated
Revision history for this message
Brad Marshall (brad-marshall) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : IwConfig.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : Lspci.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : Lsusb.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : ProcModules.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : UdevDb.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : UdevLog.txt

apport information

Revision history for this message
Brad Marshall (brad-marshall) wrote : WifiSyslog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Junien F (axino) wrote :

FYI I just opened #1505564, which is very similar and probably a duplicate.

Revision history for this message
Chris J Arges (arges) wrote :

Initially this looked similar to bug 1413540.
This bug patched 3.13 with 9242b5b to _mitigate_ the issue, but this patch is already present in 3.16. So perhaps we're hitting another failure mode.
It would be good to know if the smp_call_function_* path in the backtrace is actually leading up to an IPI call that gets lost, and thus we spin in csd_lock_wait.

Are you running nested KVM instances? How often does this lockup occur? Can you get crashdumps of this issue?

Thanks,
--chris j arges

tags: added: kernel-key
removed: kernel-da-key
Revision history for this message
Junien F (axino) wrote :

I uploaded a crashdump of a very similar issue in LP#1505564, FYI.

tags: added: kernel-da-key
removed: kernel-key
Revision history for this message
Neale Pickett (neale) wrote :

Next time this happens, please check the ilo for a massive backlog of nbd kernel messages. I suspect it's nbd's insane logging rate (tens of thousands of lines per second) endlessly growing the serial console backlog. "dmesg -D" appears to be a quick way to fix machines in this state, or prevent it from ever happening. There is probably a more elegant solution (such as, get the nbd module to calm down)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.