(docker/lxc) container restart causes kernel to lockup

Bug #1275809 reported by fish
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Unassigned

Bug Description

After restarting some 'ghost' docker containers on precise with the raring-lts kernel, the kernel locks up and shows:

    [1095015.392057] BUG: soft lockup - CPU#0 stuck for 22s! [gunicorn:12804]
    ... (for each core)

Here is the original, more docker focused bug report: https://github.com/dotcloud/docker/issues/3873

I could reproduce this bug with various kernel versions. I've set the softlockup_panic=1 kernel parameter to get some stack traces. See this gist for stack trace for 3.5 and 3.8 kernels (will add 3.11 any minute): https://gist.github.com/discordianfish/7886354d9a19b2084775

It also contains a small script to reproduce this, although I couldn't reproduce it in a vagrant VM just our Dell R710 systems so far.
---
AlsaDevices:
 total 0
 crw-rw---T 1 root audio 116, 1 Feb 3 09:45 seq
 crw-rw---T 1 root audio 116, 33 Feb 3 09:45 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu17.6
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: [Errno 2] No such file or directory
CRDA: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 12.04
HibernationDevice: RESUME=UUID=e8d3c6ec-f202-480c-ac43-52ffbbbcb393
IwConfig: Error: [Errno 2] No such file or directory
MachineType: Dell Inc. PowerEdge R710
MarkForUpload: True
Package: linux (not installed)
PciMultimedia:

ProcFB: 0 VESA VGA
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.8.0-35-generic root=/dev/mapper/ubuntu--vg-root ro console=tty0 console=ttyS1,115200 softlockup_panic=1
ProcVersionSignature: Ubuntu 3.8.0-35.52~precise1-generic 3.8.13.13
RelatedPackageVersions:
 linux-restricted-modules-3.8.0-35-generic N/A
 linux-backports-modules-3.8.0-35-generic N/A
 linux-firmware 1.79.9
RfKill: Error: [Errno 2] No such file or directory
Tags: precise
Uname: Linux 3.8.0-35-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

dmi.bios.date: 07/24/2012
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 6.3.0
dmi.board.name: 0YDJK3
dmi.board.vendor: Dell Inc.
dmi.board.version: A09
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr6.3.0:bd07/24/2012:svnDellInc.:pnPowerEdgeR710:pvr:rvnDellInc.:rn0YDJK3:rvrA09:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R710
dmi.sys.vendor: Dell Inc.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1275809

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key kernel-stable-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Do you you know if this is a regression? Was there a prior kernel version that did not exhibit this bug?

Also, it would be good to know if the latest mainline kernel also has the bug. It can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.14-rc1-trusty/

Revision history for this message
fish (discordianfish) wrote :

The systems are new, so I'm not aware of any state where this doesn't happen. I'll try the mainline kernel soon and will if I can reproduce it there as well.

Revision history for this message
fish (discordianfish) wrote :

Looks like 3.14 has no support for aufs, so I can't reproduce it with those (aufs based) containers.

Revision history for this message
fish (discordianfish) wrote :

I could *not* reproduce this issue on my laptop, so it might be specific to some aspect of our servers. Those are Dell PowerEdge R710 with Intel(R) Xeon(R) CPU L5520 @ 2.27GHz and 24GB RAM.

Revision history for this message
fish (discordianfish) wrote : AcpiTables.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
fish (discordianfish) wrote : BootDmesg.txt

apport information

Revision history for this message
fish (discordianfish) wrote : CurrentDmesg.txt

apport information

Revision history for this message
fish (discordianfish) wrote : Lspci.txt

apport information

Revision history for this message
fish (discordianfish) wrote : Lsusb.txt

apport information

Revision history for this message
fish (discordianfish) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
fish (discordianfish) wrote : ProcEnviron.txt

apport information

Revision history for this message
fish (discordianfish) wrote : ProcInterrupts.txt

apport information

Revision history for this message
fish (discordianfish) wrote : ProcModules.txt

apport information

Revision history for this message
fish (discordianfish) wrote : UdevDb.txt

apport information

Revision history for this message
fish (discordianfish) wrote : UdevLog.txt

apport information

Revision history for this message
fish (discordianfish) wrote : WifiSyslog.gz

apport information

Andy Whitcroft (apw)
Changed in linux (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
fish (discordianfish) wrote :

Let me know if there is anything I can do to help.

Revision history for this message
fish (discordianfish) wrote :

I've tried to reproduce it with the same containers but with docker's btrfs driver instead of the default aufs driver and I couldn't reproduce it. So it might be an issue with aufs.

Revision history for this message
fish (discordianfish) wrote :

I just had the same issue when rebooting the system although aufs wasn't directly involved (it was loaded but not used). I'll blacklist it now and see if it happens again.

Revision history for this message
Jérôme Petazzoni (jerome-petazzoni) wrote :

For what it's worth, some bugs can be easier to reproduce on machines with lots of cores (that might explain why you couldn't reproduce it on your local laptop).

I recall that bug #1011792 never happened on our local 4-cores VM, but the same workload would lock up a 8-cores VM in a few hours.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.