System sporadically freezes during suspend to RAM

Bug #746860 reported by Thilo-Alexander Ginkel on 2011-04-01
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
Medium
Unassigned

Bug Description

After issuing a suspend to RAM operation using pm-suspend the system sporadically hangs while initiating the suspend with the following symptoms:
- HDDs are already spun down
- Fans are still running
- The power LED is constantly illuminated (instead of the typical blinking sequence in S3)
- MagicSysRq is no longer working

This bug is a regression from Maverick (Kernel 2.6.35) where suspend to RAM was working reliably. The issue can also be reproduced with a vanilla 2.6.38 kernel (even with a 2.6.36 kernel). Attempts to bisect the root cause have failed so far because of the sporadic nature of the issue.

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: linux-image (not installed)
Regression: Yes
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.38-7.39-generic 2.6.38
Uname: Linux 2.6.38-7-generic x86_64
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
Architecture: amd64
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf9ff8000 irq 46'
   Mixer name : 'Realtek ALC1200'
   Components : 'HDA:10ec0888,104382fe,00100101'
   Controls : 40
   Simple ctrls : 22
Card1.Amixer.info:
 Card hw:1 'U0x46d0x809'/'USB Device 0x46d:0x809 at usb-0000:00:1d.7-1.4, high speed'
   Mixer name : 'USB Mixer'
   Components : 'USB046d:0809'
   Controls : 2
   Simple ctrls : 1
Card1.Amixer.values:
 Simple mixer control 'Mic',0
   Capabilities: cvolume cvolume-joined cswitch cswitch-joined penum
   Capture channels: Mono
   Limits: Capture 0 - 17
   Mono: Capture 0 [0%] [6.00dB] [off]
Card2.Amixer.info:
 Card hw:2 'Live'/'SB Live! 5.1 [SB0060] (rev.7, serial:0x80611102) at 0xe880, irq 18'
   Mixer name : 'SigmaTel STAC9708,11'
   Components : 'AC97a:83847608'
   Controls : 224
   Simple ctrls : 45
CurrentDmesg: [ 199.247673] ppdev: user-space parallel port driver
Date: Fri Apr 1 00:26:39 2011
IwConfig:
 lo no wireless extensions.

 eth0 no wireless extensions.
LiveMediaBuild: Ubuntu 11.04 "Natty Narwhal" - Beta amd64 (20110330)
MachineType: System manufacturer P5QL PRO
ProcEnviron:
 LANGUAGE=en_US:en
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: file=/cdrom/preseed/hostname.seed boot=casper initrd=/casper/initrd.lz quiet splash -- maybe-ubiquity
RelatedPackageVersions:
 linux-restricted-modules-2.6.38-7-generic N/A
 linux-backports-modules-2.6.38-7-generic N/A
 linux-firmware 1.49
RfKill:

SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/01/2009
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1004
dmi.board.asset.tag: To Be Filled By O.E.M.
dmi.board.name: P5QL PRO
dmi.board.vendor: ASUSTeK Computer INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Asset-1234567890
dmi.chassis.type: 3
dmi.chassis.vendor: Chassis Manufacture
dmi.chassis.version: Chassis Version
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1004:bd07/01/2009:svnSystemmanufacturer:pnP5QLPRO:pvrSystemVersion:rvnASUSTeKComputerINC.:rnP5QLPRO:rvrRev1.xx:cvnChassisManufacture:ct3:cvrChassisVersion:
dmi.product.name: P5QL PRO
dmi.product.version: System Version
dmi.sys.vendor: System manufacturer

The probability of hitting the bug is probably > 0.1, maybe even > 0.33.

tags: removed: needs-upstream-testing

I identified the faulty commit (using a git bisect):

bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a is the first bad commit
commit bd25f4dd6972755579d0ea50d1a5ace2e9b00d1a
Author: Arnd Bergmann <email address hidden>
Date: Sun Jul 11 15:34:05 2010 +0200

    HID: hiddev: use usb_find_interface, get rid of BKL

    This removes the private hiddev_table in the usbhid
    driver and changes it to use usb_find_interface
    instead.

    The advantage is that we can avoid the race between
    usb_register_dev and usb_open and no longer need the
    big kernel lock.

    This doesn't introduce race condition -- the intf pointer could be
    invalidated only in hiddev_disconnect() through usb_deregister_dev(),
    but that will block on minor_rwsem and not actually remove the device
    until usb_open().

    Signed-off-by: Arnd Bergmann <email address hidden>
    Cc: Jiri Kosina <email address hidden>
    Cc: "Greg Kroah-Hartman" <email address hidden>
    Signed-off-by: Jiri Kosina <email address hidden>

:040000 040000 4ae14b3ba486373d7a354874e9ad334858f094e3
8041ffda20ca3020a6b60d64235ae179f8186bf0 M drivers

Further details in the referenced upstream bug report.

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed

All right. Seems that there were two bugs with similar symptoms. The one from comment #3 is already fixed, but a new bisect revealed the following root cause for the current regression:

dcd989cb73ab0f7b722d64ab6516f101d9f43f88 is the first bad commit
commit dcd989cb73ab0f7b722d64ab6516f101d9f43f88
Author: Tejun Heo <email address hidden>
Date: Tue Jun 29 10:07:14 2010 +0200

   workqueue: implement several utility APIs

   Implement the following utility APIs.

    workqueue_set_max_active() : adjust max_active of a wq
    workqueue_congested() : test whether a wq is contested
    work_cpu() : determine the last / current cpu of a work
    work_busy() : query whether a work is busy

   * Anton Blanchard fixed missing ret initialization in work_busy().

   Signed-off-by: Tejun Heo <email address hidden>
   Cc: Anton Blanchard <email address hidden>

:040000 040000 8b7443c650f0af36f1deba560586a91f6a88abcc
065589a95857a2fb73b94dc242c50ba558179a2a M include
:040000 040000 84ca2de78af16483fa60a423f4f2d6eee0279eed
27487850f11a1e7ee9e4eaac54fd88f16d420d47 M kernel

Brad Figg (brad-figg) on 2011-04-07
Changed in linux (Ubuntu):
status: New → Confirmed

BTW, there is still a bug in the above bisect result. I'll give it another
try after the weekend.

A new (pretty tedious) bisect brought up this change:

| e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c is the first bad commit
| commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c
| Author: Tejun Heo <email address hidden>
| Date: Tue Jun 29 10:07:14 2010 +0200
|
| workqueue: implement concurrency managed dynamic worker pool

Reverting this commit reliably fixes the issue.

In the meantime, I have also been able to reproduce the issue in a KVM virtual machine, i.e., it is not hardware-specific, but must probably affects suspend on any SMP system (but the probability may depend on the number of CPUs).

Further details bout my system config (which is also mimicked by my VM):
- lvm running on top of
- dmcrypt (luks) running on top of
- md raid1

In another attempt to further isolate the system parameters that trigger the issue, I set up a couple of minimal systems using various combinations of dmcrypt and raid1. As it turned out, having raid1 enabled is sufficient for triggering the issue, so commit e22bee782b3b00bd4534ae9b1c5fb2e8e6573c5c apparently broke suspend for all raid(1?) multi-core systems.

Changed in linux:
status: Confirmed → Incomplete

A fix is now available: Any chance that makes it into a fixed linux-image package for Natty?

Details at: https://lkml.org/lkml/2011/4/29/198

The fix has now reached the stable kernel 2.6.38.6.

Changed in linux:
status: Incomplete → Fix Released

Thilo-Alexander Ginkel, thank you for reporting this and helping make Ubuntu better. Natty reached EOL on October, 2012.
Please see this document for currently supported Ubuntu releases:
https://wiki.ubuntu.com/Releases

We were wondering if this is still an issue in a supported release? If so, could you please test for this with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested and remove the tag:
needs-upstream-testing

This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the text:
needs-upstream-testing

If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested.

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested.

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested.

Please let us know your results. Thank you for your understanding.

Helpful Bug Reporting Tips:
https://help.ubuntu.com/community/ReportingBugs

tags: added: needs-upstream-testing
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.