[Dell Latitude E6220] Crashes or exits with signal 9 when under heavy load

Bug #1016195 reported by Daniel Manrique
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned
Precise
Fix Released
High
Unassigned

Bug Description

CID 201102-7188.

As part of certification testing, machines undergo a stress test to see how they perform under heavy load.

This particular system started being unstable with Ubuntu 12.04. On prior releases (I just retested with Oneiric and kernel 3.0.0) it's rock-solid and is able to complete the 2-hour stress test.

On 12.04, I've seen three possible failure scenarios (and so far, no success completing the test run):

1- The test runs for 2 hours but upon finishing it exits with signal 9.
2- The test runs for less than 2 hours, exiting because some of the workers got signal 9.
3- The system freezes before 2 hours have elapsed, requiring a reboot.

How to reproduce:
- Install Ubuntu 12.04
- apt-get install stress
- On a terminal, run:
  stress --cpu 4 --vm 4 --timeout 7200

I'm not sure this bug belongs in the kernel, but I hope it's a good starting point.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.2.0-26-generic 3.2.0-26.41
ProcVersionSignature: Ubuntu 3.2.0-26.41-generic 3.2.19
Uname: Linux 3.2.0-26-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
ApportVersion: 2.0.1-0ubuntu10
Architecture: amd64
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: PCH [HDA Intel PCH], device 0: STAC92xx Analog [STAC92xx Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 1658 F.... pulseaudio
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Card0.Amixer.info:
 Card hw:0 'PCH'/'HDA Intel PCH at 0x3ec60000 irq 44'
   Mixer name : 'Intel CougarPoint HDMI'
   Components : 'HDA:111d76e7,102804a9,00100102 HDA:80862805,80860101,00100000'
   Controls : 37
   Simple ctrls : 13
CurrentDmesg: [ 50.700559] eth0: no IPv6 routers present
Date: Thu Jun 21 14:52:49 2012
HibernationDevice: RESUME=UUID=8a63aac3-1a3a-4270-987b-94b087b5a02d
InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release amd64 (20120425)
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
MachineType: Dell Inc. Latitude E6220
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.2.0-26-generic root=UUID=f870a30a-f09a-40ea-9334-7322d754e897 ro quiet splash initcall_debug vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.2.0-26-generic N/A
 linux-backports-modules-3.2.0-26-generic N/A
 linux-firmware 1.79
RfKill:
 0: dell-wifi: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
StagingDrivers: mei
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 03/28/2011
dmi.bios.vendor: Dell Inc.
dmi.bios.version: X29
dmi.board.name: 0862D8
dmi.board.vendor: Dell Inc.
dmi.board.version: X00
dmi.chassis.type: 9
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrX29:bd03/28/2011:svnDellInc.:pnLatitudeE6220:pvr01:rvnDellInc.:rn0862D8:rvrX00:cvnDellInc.:ct9:cvr:
dmi.product.name: Latitude E6220
dmi.product.version: 01
dmi.sys.vendor: Dell Inc.

Revision history for this message
Daniel Manrique (roadmr) wrote :
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Are there any messages written to the console or /var/log/syslog when the crash happens?

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, as usual, can you also see if this happens with the mainline kernel:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc3-quantal/

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi Joseph!

Sure thing, I just ran the test and it *does* happen on the mainline kernel (3.5-rc3). It crashed completely :(

I'm now doing another run to try to see if there are any messages in syslog, as there were none on the console.

Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi,

On my retest the system crashed again, I put a marker in the syslog to be able to identify any kernel messages triggered by the process. This is what I see:

Jun 22 11:46:04 201102-7188 ubuntu: --- ROADMR RAN STRESS TEST HERE ---
Jun 22 11:46:15 201102-7188 kernel: [ 9367.450012] CPU3: Package power limit notification (total events = 11)
Jun 22 11:46:15 201102-7188 kernel: [ 9367.450016] CPU1: Package power limit notification (total events = 11)
Jun 22 11:46:15 201102-7188 kernel: [ 9367.450018] CPU2: Package power limit notification (total events = 11)
Jun 22 11:46:15 201102-7188 kernel: [ 9367.450021] CPU0: Package power limit notification (total events = 11)
Jun 22 11:46:15 201102-7188 kernel: [ 9367.463064] CPU1: Package power limit normal
Jun 22 11:46:15 201102-7188 kernel: [ 9367.463066] CPU3: Package power limit normal
Jun 22 11:46:15 201102-7188 kernel: [ 9367.463067] CPU2: Package power limit normal
Jun 22 11:46:15 201102-7188 kernel: [ 9367.463069] CPU0: Package power limit normal
Jun 22 14:28:23 201102-7188 kernel: imklog 5.8.6, log source = /proc/kmsg started.

So a few seconds after starting the run I get the package power limit messages, and then the system crashed. The timestamp shows I reset it about 2:45 hours after the last power limit message, but I think it crashed long before I noticed (I was out having lunch).

tags: added: performing-bisect
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Hi Daniel,

Can you test the Oneiric[0] kernel to confirm this is a regression?

Thanks in advance!

[0] https://launchpad.net/~canonical-kernel-team/+archive/ppa/+build/3570193

tags: added: needs-bisect
removed: performing-bisect
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Kate Stewart (kate.stewart) wrote :

make it explicit this is hardware certification top 3 blocking bug, and make sure targetted to release/milestone properly.

Changed in linux (Ubuntu Precise):
status: New → Incomplete
importance: Undecided → High
milestone: none → ubuntu-12.04.1
Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi Joseph,

I'm retesting with the kernel you pointed me to (3.0.0-22.36), however, as I mentioned in the original report, I did go back and test with Oneiric and the system was able to complete a 2-hour stress test run without problems. I'm not sure this was with the latest 3.0.0-series kernel, so I'll repeat the test and report back in about 2 hours.

Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi Joseph,

I tested both 3.0.0-12.20 and 3.0.0-22.36, and both seem solid, they are able to complete the 2-hour test run without crashing. I tried each kernel twice. So it does look like a regression :(

Setting back to Confirmed.

Thanks!

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Precise):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

We can perform a bisect to identify the commit that caused this regression. Can you test the following kernels:

v3.1-rc2: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.1-rc2-oneiric/
v3.2-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2-rc1-oneiric/

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Also, since this is an upstream bug, would it be possible for you to open an upstream bug report at bugzilla.kernel.org [1]?

[1] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Changed in linux (Ubuntu Precise):
status: Confirmed → Incomplete
Revision history for this message
Daniel Manrique (roadmr) wrote :

Hi Joseph,

I need to redo a few of the tests on this machine. I just found out that, for instance, with the 3.1rc2 kernel you asked me to test, the GUI becomes completely unresponsive, but I'm able to SSH into the system. From an end-user's point of view this is immaterial, as the system becomes unusable and requires to be restarted (restarting just lightdm didn't work), but from our debugging perspective it may indicate a different cause for this.

I'll get back to you when I've done this confirmation with all the involved kernels: 3.0, 3.2 (precise), 3.5 mainline, 3.1-rc2 and 3.2-rc1.

Thanks! (still working!)

Changed in linux (Ubuntu Precise):
milestone: ubuntu-12.04.1 → ubuntu-12.04.2
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Ara Pulido (ara)
tags: added: regression-release
Colin Watson (cjwatson)
Changed in linux (Ubuntu Precise):
milestone: ubuntu-12.04.2 → ubuntu-12.04.3
Lukasz (lukaszek130388)
Changed in linux (Ubuntu Precise):
status: Incomplete → Confirmed
status: Confirmed → Fix Released
Changed in linux (Ubuntu):
status: Expired → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.