Ubuntu
linux package

CPU lockup on HP Proliant DL380 Gen9 servers

Bug #1500739 reported by Brad Marshall on 2015-09-29

This bug affects 10 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Confirmed	High	Unassigned

Bug Description

Over the past 3-ish weeks we've had 3 seperate HP Proliant DL380 Gen9 servers lock up with a similar looking cpu lockup bug. All 3 of these servers are nova-compute nodes in an OpenStack cluster, with a reasonable amount of load on them. The symptoms are the load shoots up into the hundreds, and ps stops returning.

I've attached lspci -vnvn and 3 sets of syslog message traces that we grabbed on each of the 3 times it has crashed.

$ cat /proc/version_signature
Ubuntu 3.16.0-49.65~14.04.1-generic 3.16.7-ckt15

Please let us know if you need any further information.
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Sep 29 06:47 seq
crw-rw---- 1 root audio 116, 33 Sep 29 06:47 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.14.1-0ubuntu3.15
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 14.04
HibernationDevice: RESUME=UUID=c2111986-9219-44f2-ad09-8d367ce8b30b
MachineType: HP ProLiant DL380 Gen9
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
TERM=xterm
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.16.0-49-generic root=UUID=038cefb6-3559-4104-9754-5308497961f5 ro console=tty0 console=ttyS1,115200 BOOTIF=01-3c:a8:2a:23:f5:00 quiet
ProcVersionSignature: Ubuntu 3.16.0-49.65~14.04.1-generic 3.16.7-ckt15
RelatedPackageVersions:
linux-restricted-modules-3.16.0-49-generic N/A
linux-backports-modules-3.16.0-49-generic N/A
linux-firmware 1.127.15
RfKill: Error: [Errno 2] No such file or directory
Tags: trusty uec-images
Uname: Linux 3.16.0-49-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dialout libvirtd lpadmin plugdev sambashare sudo
_MarkForUpload: True
dmi.bios.date: 05/06/2015
dmi.bios.vendor: HP
dmi.bios.version: P89
dmi.chassis.type: 23
dmi.chassis.vendor: HP
dmi.modalias: dmi:bvnHP:bvrP89:bd05/06/2015:svnHP:pnProLiantDL380Gen9:pvr:cvnHP:ct23:cvr:
dmi.product.name: ProLiant DL380 Gen9
dmi.sys.vendor: HP

See original description

Tags:

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-29:

lspci-vnvn output Edit (193.5 KiB, text/plain)

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-29:

CPU Lockup 1 Edit (20.2 KiB, text/plain)

The first of the lockups. These all required us to hard reset the servers via the ilo.

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-29:

CPU Lockup 2 Edit (95.1 KiB, text/plain)

2nd of the lockups.

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-29:

CPU Lockup 3 Edit (36.9 KiB, text/plain)

3rd of the lockups

Revision history for this message

Brad Figg (brad-figg) wrote on 2015-09-29: Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1500739

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2015-09-29:

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.3 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.3-rc3-unstable/

Changed in linux (Ubuntu):
importance:	Undecided → High
tags:	added: kernel-da-key vivid

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-29:

Unfortunately we're unable to test the latest upstream kernel in this situation. These servers are running a production system, and as they use bcache we need a kernel that supports it.

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: BootDmesg.txt

BootDmesg.txt Edit (99.8 KiB, text/plain)

apport information

tags:	added: apport-collected trusty uec-images
description:	updated

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: CurrentDmesg.txt

CurrentDmesg.txt Edit (183.3 KiB, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: IwConfig.txt

#10

IwConfig.txt Edit (7.7 KiB, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: Lspci.txt

#11

Lspci.txt Edit (82.9 KiB, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: Lsusb.txt

#12

Lsusb.txt Edit (468 bytes, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: ProcCpuinfo.txt

#13

ProcCpuinfo.txt Edit (31.9 KiB, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: ProcInterrupts.txt

#14

ProcInterrupts.txt Edit (50.8 KiB, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: ProcModules.txt

#15

ProcModules.txt Edit (5.0 KiB, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: UdevDb.txt

#16

UdevDb.txt Edit (274.5 KiB, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: UdevLog.txt

#17

UdevLog.txt Edit (526.2 KiB, text/plain)

apport information

Revision history for this message

Brad Marshall (brad-marshall) wrote on 2015-09-30: WifiSyslog.txt

#18

WifiSyslog.txt Edit (463.7 KiB, text/plain)

apport information

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Revision history for this message

Junien F (axino) wrote on 2015-10-13:

#19

FYI I just opened #1505564, which is very similar and probably a duplicate.

Revision history for this message

Chris J Arges (arges) wrote on 2015-10-14:

#20

Initially this looked similar to bug 1413540.
This bug patched 3.13 with 9242b5b to _mitigate_ the issue, but this patch is already present in 3.16. So perhaps we're hitting another failure mode.
It would be good to know if the smp_call_function_* path in the backtrace is actually leading up to an IPI call that gets lost, and thus we spin in csd_lock_wait.

Are you running nested KVM instances? How often does this lockup occur? Can you get crashdumps of this issue?

Thanks,
--chris j arges

Joseph Salisbury (jsalisbury) on 2015-10-15

tags:

added: kernel-key
removed: kernel-da-key

Revision history for this message

Junien F (axino) wrote on 2015-10-28:

#21

I uploaded a crashdump of a very similar issue in LP#1505564, FYI.

Joseph Salisbury (jsalisbury) on 2015-12-01

tags:

added: kernel-da-key
removed: kernel-key

Revision history for this message

Neale Pickett (neale) wrote on 2016-01-20:

#22

Next time this happens, please check the ilo for a massive backlog of nbd kernel messages. I suspect it's nbd's insane logging rate (tens of thousands of lines per second) endlessly growing the serial console backlog. "dmesg -D" appears to be a quick way to fix machines in this state, or prevent it from ever happening. There is probably a more elegant solution (such as, get the nbd module to calm down)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

CPU lockup on HP Proliant DL380 Gen9 servers

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package