[Lenovo Edge 11 AMD] system locks up completely running the "stress" tool

Bug #774947 reported by Jeff Lane 
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Medium
Colin Ian King
Natty
Fix Released
Medium
Colin Ian King
Oneiric
Fix Released
Medium
Colin Ian King

Bug Description

stress is a system test tool found in the Ubuntu repos. We use it during certification testing to loadtest a system and make sure it will function under abnormally high usage.

On the Edge 11, the system simply locks up. No saving it... it just hard locks and requires a power cycle to get going again.
---
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: SB [HDA ATI SB], device 0: CONEXANT Analog [CONEXANT Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 1381 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'SB'/'HDA ATI SB at 0xd0900000 irq 16'
   Mixer name : 'Conexant CX20582 (Pebble)'
   Components : 'HDA:14f15066,17aa21ca,00100302'
   Controls : 8
   Simple ctrls : 5
Card1.Amixer.info:
 Card hw:1 'HDMI'/'HDA ATI HDMI at 0xd0510000 irq 19'
   Mixer name : 'ATI RS690/780 HDMI'
   Components : 'HDA:1002791a,00791a00,00100000'
   Controls : 4
   Simple ctrls : 1
Card1.Amixer.values:
 Simple mixer control 'IEC958',0
   Capabilities: pswitch pswitch-joined penum
   Playback channels: Mono
   Mono: Playback [on]
Card29.Amixer.info:
 Card hw:29 'ThinkPadEC'/'ThinkPad Console Audio Control at EC reg 0x30, fw 87HT21WW-1.166000'
   Mixer name : 'ThinkPad EC 87HT21WW-1.166000'
   Components : ''
   Controls : 1
   Simple ctrls : 1
Card29.Amixer.values:
 Simple mixer control 'Console',0
   Capabilities: pswitch pswitch-joined penum
   Playback channels: Mono
   Mono: Playback [on]
DistroRelease: Ubuntu 11.04
HibernationDevice: RESUME=UUID=68903084-6df4-4d3c-9b27-4fc9d9db9f37
InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Release i386 (20110427.1)
MachineType: LENOVO 254522U
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=en_US:en
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.38-8-generic root=UUID=c1b22889-fe11-4c30-b4b6-6c53acb1922f ro quiet splash initcall_debug vt.handoff=7
ProcVersionSignature: Ubuntu 2.6.38-8.42-generic 2.6.38.2
RelatedPackageVersions:
 linux-restricted-modules-2.6.38-8-generic N/A
 linux-backports-modules-2.6.38-8-generic N/A
 linux-firmware 1.52
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
Tags: natty running-unity
Uname: Linux 2.6.38-8-generic i686
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm admin cdrom dialout lpadmin plugdev sambashare
dmi.bios.date: 01/10/2011
dmi.bios.vendor: LENOVO
dmi.bios.version: 87ET35WW (1.09 )
dmi.board.name: 254522U
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr87ET35WW(1.09):bd01/10/2011:svnLENOVO:pn254522U:pvrThinkPadEdge:rvnLENOVO:rn254522U:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 254522U
dmi.product.version: ThinkPad Edge
dmi.sys.vendor: LENOVO

Revision history for this message
Jeff Lane  (bladernr) wrote : AcpiTables.txt

apport information

tags: added: blocks-hwcert
tags: added: apport-collected natty running-unity
description: updated
Revision history for this message
Jeff Lane  (bladernr) wrote : AlsaDevices.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : AplayDevices.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : BootDmesg.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : Card0.Amixer.values.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : Card0.Codecs.codec.0.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : Card1.Codecs.codec.0.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : IwConfig.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : Lspci.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : Lsusb.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : PciMultimedia.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : ProcCpuinfo_.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : ProcModules.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : UdevDb.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : UdevLog.txt

apport information

Revision history for this message
Jeff Lane  (bladernr) wrote : WifiSyslog.txt

apport information

Revision history for this message
Ara Pulido (ara) wrote :

Chris, could you please have a look to this one?

Changed in linux (Ubuntu):
assignee: nobody → Chris Van Hoof (vanhoof)
importance: Undecided → Medium
status: New → Triaged
Changed in linux (Ubuntu Natty):
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Chris Van Hoof (vanhoof) wrote :

@Colin -- Mind taking a peek at this one?

Changed in linux (Ubuntu):
assignee: Chris Van Hoof (vanhoof) → Colin King (colin-king)
Revision history for this message
Colin Ian King (colin-king) wrote :

Can I have some information into how is stress being run? Are there any specific test modes that cause the failure?

Changed in linux (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Jeff Lane  (bladernr) wrote :

Colin:

The stress test uses the "stress" tool found in the package of the same name in universe.

so 'apt-get install stress' should get it for you.

As for the actual runtime parameters used:

stress --cpu `cpuinfo_resource | awk '/count:/ {print $2}'` --vm `awk '/MemTotal/ {num_vm = $2/262144; if (num_vm != int(num_vm)) num_vm = int(num_vm) + 1; print num_vm}' /proc/meminfo` --timeout 7200

From the man page for 'stress':
--cpu N spawn N workers spinning on sqrt()
--vm N spawn N workers spinning on malloc()/free()

Revision history for this message
Colin Ian King (colin-king) wrote :

@Jeff, what does cpuinfo_resource do? Can you run:

cpuinfo_resource | awk '/count:/ {print $2}'` --vm `awk '/MemTotal/ {num_vm = $2/262144; if (num_vm != int(num_vm)) num_vm = int(num_vm) + 1; print num_vm}' /proc/meminfo

..and add that to the bug so I can see what --cpu setting is being used. Thanks!

Revision history for this message
Colin Ian King (colin-king) wrote :

@Jeff, can you run the test in two phases, one with just a --cpu test and one with just a --vm stress test to see if it's CPU loading or memory traffic loading (hence overheating on the northbridge) that causes the failure?

Revision history for this message
Jeff Lane  (bladernr) wrote :

it just parses /proc/cpuinfo and generates some formatted output like this:

bladernr@klaatu:~/development/stats/checkbox/scripts$ /usr/share/checkbox/scripts/cpuinfo_resource
model_revision: 5
bogomips: 3192
model_version: 30
speed: 1597
count: 8
cache: 6291456
model_number: 6
platform: x86_64
other: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
model: Intel(R) Core(TM) i7 CPU Q 720 @ 1.60GHz
type: GenuineIntel

As for re-running, Marc Legris will have to as the system is in Lexington, otherwise, I'd be happy to.

Revision history for this message
Jeff Lane  (bladernr) wrote :

I should add, that means that --cpu is being set to something like 2 or 4 (whatever the core count is for that system (including hyperthreading))

Revision history for this message
Marc Legris (maaarc-deactivatedaccount-deactivatedaccount-deactivatedaccount-deactivatedaccount-deactivatedaccount) wrote :

Rerunning stress test, interesting note: seems to cause wifi to lose it's connection every 5 minutes or so.

Revision history for this message
Colin Ian King (colin-king) wrote :

Hi there, can one re-try the CPU stress test with one of the kernels in:

http://zinc.canonical.com/~cking/lp-774947

Thanks.

Chris Van Hoof (vanhoof)
Changed in linux (Ubuntu Natty):
assignee: nobody → Marc Legris (maaarc)
status: Triaged → Incomplete
Revision history for this message
Colin Ian King (colin-king) wrote :

Anyone care to re-test with this new kernel? Thanks.

Revision history for this message
Marc Legris (maaarc-deactivatedaccount-deactivatedaccount-deactivatedaccount-deactivatedaccount-deactivatedaccount) wrote :

Colin -- retested with the 2.6.38-10-generic kernel, no issues were observed. Tried stress runs with cpu, vm, and then both cpu and vm.

Changed in linux (Ubuntu Natty):
status: Incomplete → Confirmed
Revision history for this message
Colin Ian King (colin-king) wrote :

Seems like the machine was overheating and stopping because of a bug in the C state selection when stress loading the CPU.

Revision history for this message
Colin Ian King (colin-king) wrote :

FYI, the kernel that was tested used upstream patch "cpuidle: menu: fixed wrapping timers at 4.294 seconds" (merge
commit f310642123e0d32d919c60ca3fab5acd130c4ba3 also found as 2.38.y stable commit 4a1163dff6592dcee594b2bee597aafd749b93ee).

Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Natty):
status: Confirmed → Fix Committed
Tim Gardner (timg-tpi)
Changed in linux (Ubuntu Oneiric):
status: In Progress → Fix Released
Changed in linux (Ubuntu Natty):
assignee: Marc Legris (maaarc) → Chris Van Hoof (vanhoof)
Chris Van Hoof (vanhoof)
Changed in linux (Ubuntu Natty):
assignee: Chris Van Hoof (vanhoof) → Colin King (colin-king)
Revision history for this message
Chris Van Hoof (vanhoof) wrote :

Marking as fix released as 2.6.38-10 was released today

Changed in linux (Ubuntu Natty):
status: Fix Committed → Fix Released
Revision history for this message
Herton R. Krzesinski (herton) wrote :

This bug misses SRU Justification, please update it as explained on
https://wiki.ubuntu.com/KernelTeam/StableHandbook/StableProcess#Workflow_for_SRU_Patches

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-natty' to 'verification-done-natty'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-natty
Revision history for this message
Colin Ian King (colin-king) wrote :

SRU Request:

When running the 'stress' stress testing tool a Lenovo Edge 11 locks
up because of CPU overheating. This is fixed using upstream patch
"cpuidle: menu: fixed wrapping timers at 4.294 seconds" (merge
commit f310642123e0d32d919c60ca3fab5acd130c4ba3 and 2.38.y stable
commit 4a1163dff6592dcee594b2bee597aafd749b93ee).

This fixes errors in predicted sleep times which selected
incorrect C states, causing increased power consumption and on some
machines overheating and critical thermal shutdown. The bug also
broke cpuidle state residency statistics which made it hard to spot
this bug.

With this fix, the Lenovo runs the stress test without locking up.
Also this fix reduces overall power consumption and hence longer
battery life.

Chris Van Hoof (vanhoof)
Changed in linux (Ubuntu Natty):
status: Fix Released → Fix Committed
Revision history for this message
Herton R. Krzesinski (herton) wrote :

Ok, seems the same fix came in via 2.6.38.8 stable update, but as we applied it before 2.6.38.8 update by hand, the commit on ubuntu-natty tree didn't have a BugLink for the stable update tracking bug, thus I thought it needed verification.

So sorry, please disregard comment #35, as commits coming via stable updates are not subject to the standard bug verification process, thus adding verification-done-natty tag here.

If you still want to verify this and add a comment with testing results, no problem.

tags: added: verification-done-natty
removed: verification-needed-natty
Revision history for this message
Colin Ian King (colin-king) wrote :

@Herton, I'm happy with this.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.4 KiB)

This bug was fixed in the package linux - 2.6.38-11.48

---------------
linux (2.6.38-11.48) natty-proposed; urgency=low

  [Herton R. Krzesinski]

  * Release Tracking Bug
    - LP: #818175

  [ Upstream Kernel Changes ]

  * Revert "HID: magicmouse: ignore 'ivalid report id' while switching
    modes"
    - LP: #814250

linux (2.6.38-11.47) natty-proposed; urgency=low

  [Steve Conklin]

  * Release Tracking Bug
    - LP: #811180

  [ Keng-Yu Lin ]

  * SAUCE: Revert: "dell-laptop: Toggle the unsupported hardware
    killswitch"
    - LP: #775281

  [ Ming Lei ]

  * SAUCE: fix yama_ptracer_del lockdep warning
    - LP: #791019

  [ Stefan Bader ]

  * SAUCE: Re-enable RODATA for i386 virtual
    - LP: #809838

  [ Tim Gardner ]

  * [Config] Add grub-efi as a recommended bootloader for server and
    generic
    - LP: #800910
  * SAUCE: rtl8192se: Force a build for a 2.6/3.0 kernel
    - LP: #805494

  [ Upstream Kernel Changes ]

  * Revert "bridge: Forward reserved group addresses if !STP"
    - LP: #793702
  * Fix up ABI directory
  * bonding: Incorrect TX queue offset, CVE-2011-1581
    - LP: #792312
    - CVE-2011-1581
  * fs/partitions/efi.c: corrupted GUID partition tables can cause kernel
    oops
    - LP: #795418
    - CVE-2011-1577
  * usbnet/cdc_ncm: add missing .reset_resume hook
    - LP: #793892
  * ath5k: Disable fast channel switching by default
    - LP: #767192
  * mm: vmscan: correctly check if reclaimer should schedule during
    shrink_slab
    - LP: #755066
  * mm: vmscan: correct use of pgdat_balanced in sleeping_prematurely
    - LP: #755066
  * ALSA: hda - Use LPIB for ATI/AMD chipsets as default
    - LP: #741825
  * ALSA: hda - Enable snoop bit for AMD controllers
    - LP: #741825
  * ALSA: hda - Enable sync_write workaround for AMD generically
    - LP: #741825
  * cpuidle: menu: fixed wrapping timers at 4.294 seconds
    - LP: #774947
  * drm/i915: Fix gen6 (SNB) missed BLT ring interrupts.
    - LP: #761065
  * USB: ehci: remove structure packing from ehci_def
    - LP: #791552
  * drm/i915: disable PCH ports if needed when disabling a CRTC
    - LP: #791752
  * kmemleak: Do not return a pointer to an object that kmemleak did not
    get
    - LP: #793702
  * kmemleak: Initialise kmemleak after debug_objects_mem_init()
    - LP: #793702
  * Fix _OSC UUID in pcc-cpufreq
    - LP: #793702
  * CPU hotplug, re-create sysfs directory and symlinks
    - LP: #793702
  * Fix memory leak in cpufreq_stat
    - LP: #793702
  * net: recvmmsg: Strip MSG_WAITFORONE when calling recvmsg
    - LP: #793702
  * ftrace: Only update the function code on write to filter files
    - LP: #793702
  * qla2xxx: Fix hang during driver unload when vport is active.
    - LP: #793702
  * qla2xxx: Fix virtual port failing to login after chip reset.
    - LP: #793702
  * qla2xxx: Fix vport delete hang when logins are outstanding.
    - LP: #793702
  * powerpc/kdump64: Don't reference freed memory as pacas
    - LP: #793702
  * powerpc/kexec: Fix memory corruption from unallocated slaves
    - LP: #793702
  * x86, cpufeature: Fix cpuid leaf 7 feature detection
    - LP: #793702
  * ath9k_hw: do noise floor calibration only on required chain...

Changed in linux (Ubuntu Natty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.