kernel hangs in pm-hibernate on saving to disk

Bug #797922 reported by Kees van Vloten
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Medium
Seth Forshee

Bug Description

Machine hangs with black screen during hibernation. Only hard powerdown / up helps, but no hibernation image is found on boot.
Problem occurs in kernels 2.6.38-8 and 3.0-0, it does NOT in occur in 2.6.35-28.
I have tried maverick and natty with all 3 kernels. Also tried kernel hibernation, uswsusp and tuxonice (from tuxonice ppa), neither make any difference in the behaviour.
I have tried to unload modules but without any loaded modules the hanging behaviour is still consistent.

Hardware is a Toshiba Qosmio X500 laptop with
Intel core-i7 and Nvidia GeForce GTS 360M
---
Architecture: amd64
DistroRelease: Ubuntu 10.10
Package: linux (not installed)
ProcEnviron:
 LANGUAGE=nl_NL:
 LANG=nl_NL.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 3.0-0.1-generic 3.0.0-rc2
Tags: maverick
Uname: Linux 3.0-0-generic x86_64
UnreportableReason: The running kernel is not an Ubuntu kernel
UserGroups:
---
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/by-path', '/dev/snd/controlC1', '/dev/snd/hwC1D0', '/dev/snd/hwC1D1', '/dev/snd/hwC1D2', '/dev/snd/hwC1D3', '/dev/snd/pcmC1D3p', '/dev/snd/pcmC1D7p', '/dev/snd/pcmC1D8p', '/dev/snd/pcmC1D9p', '/dev/snd/controlC0', '/dev/snd/hwC0D0', '/dev/snd/pcmC0D0c', '/dev/snd/pcmC0D0p', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info: Error: [Errno 2] No such file or directory
Card0.Amixer.values: Error: [Errno 2] No such file or directory
Card1.Amixer.info: Error: [Errno 2] No such file or directory
Card1.Amixer.values: Error: [Errno 2] No such file or directory
CurrentDmesg:
 [ 19.225994] eth0: no IPv6 routers present
 [ 153.092323] atl1c 0000:0b:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.
DistroRelease: Ubuntu 10.10
HibernationDevice: RESUME=UUID=c1ec3bcc-2bee-4ab8-a91e-536000964679
IwConfig: Error: [Errno 2] No such file or directory
MachineType: TOSHIBA QOSMIO X500
Package: linux (not installed)
ProcCmdLine: BOOT_IMAGE=/vmlinuz-2.6.38-8-generic root=/dev/mapper/hostname_vg-root_lv ro DEBCONF_DEBUG=5
ProcEnviron:
 LANGUAGE=nl_NL:
 LANG=nl_NL.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.38-8.42-generic 2.6.38.2
Regression: Yes
RelatedPackageVersions: linux-firmware 1.54
Reproducible: Yes
RfKill: Error: [Errno 2] No such file or directory
Tags: maverick kernel-power hibernate resume regression-release needs-upstream-testing
Uname: Linux 2.6.38-8-generic x86_64
UserGroups:

dmi.bios.date: 12/10/2010
dmi.bios.vendor: TOSHIBA
dmi.bios.version: V2.90
dmi.board.name: QOSMIO X500
dmi.board.vendor: TOSHIBA
dmi.board.version: Not Applicable
dmi.chassis.asset.tag: No Asset Tag
dmi.chassis.type: 10
dmi.chassis.vendor: TOSHIBA
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnTOSHIBA:bvrV2.90:bd12/10/2010:svnTOSHIBA:pnQOSMIOX500:pvrPQX33E-04J005DU:rvnTOSHIBA:rnQOSMIOX500:rvrNotApplicable:cvnTOSHIBA:ct10:cvrN/A:
dmi.product.name: QOSMIO X500
dmi.product.version: PQX33E-04J005DU
dmi.sys.vendor: TOSHIBA

Kees van Vloten (kvv)
tags: added: natty
tags: added: oneiric
Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Seth Forshee (sforshee) wrote :

Please run 'apport-collect 797922' from a terminal on the affected machine to attach the relevant information to this bug report. Thanks!

Kees van Vloten (kvv)
tags: added: apport-collected
description: updated
Revision history for this message
Kees van Vloten (kvv) wrote : AcpiTables.txt

apport information

description: updated
Revision history for this message
Kees van Vloten (kvv) wrote : AlsaDevices.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : BootDmesg.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : Card0.Codecs.codec.0.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : Card1.Codecs.codec.0.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : Card1.Codecs.codec.1.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : Card1.Codecs.codec.2.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : Card1.Codecs.codec.3.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : Lspci.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : Lsusb.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : PciMultimedia.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : ProcModules.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : UdevDb.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : UdevLog.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote : WifiSyslog.txt

apport information

Revision history for this message
Kees van Vloten (kvv) wrote :

At the moment of running the apport-collect I had Maverick installed with the kernels of Maverick, Natty and Oneiric.
I tried to run apport-collect with the Oneiric kernel but it is recognized as a vanilla kernel, therefore I reran the process with the Natty kernel. The results are posted above.

Revision history for this message
Seth Forshee (sforshee) wrote :

Please try the steps under the "Debugging Hibernate" section of the following link and report back you findings.

  https://wiki.ubuntu.com/DebuggingKernelSuspendHibernateResume

Also, have you tested to see if suspend to RAM works? We generally recommend using suspend instead of hibernate, as it tends to be more reliable, unless you have a specific need for using hibernate instead.

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Confirmed
assignee: nobody → Seth Forshee (sforshee)
status: Confirmed → Incomplete
Revision history for this message
Kees van Vloten (kvv) wrote :

I started off with suspend and switched to hibernate as it seemed to work little better.
Anyway I have tried suspend using the info from the wiki. It looks like the machine suspends correctly: it switches off the fans and in the end the screen too (although I have set the boot parameter 'no_console_suspend') and power-led starts blinking slowly.
However when I wake the machine by presing any key on the keyboard, it turns on the fans but the screen remains off and it seems too hang. I have tried to type 'reboot' in the blind (should be possible as I was logged in as root before suspend) but the machine does not react. The only way to get out is to hold the powerbutton 5 seconds.

For hibernate, this will hang up the machine (black screen, fans on, no reaction on keyboard):
echo devices > /sys/power/pm_test
echo platform > /sys/power/disk
echo disk > /sys/power/state

Revision history for this message
Seth Forshee (sforshee) wrote :

Can you clarify your hibernation problem? Does the machine hang when hibernating, or does it hibernate okay and then hang when resuming from hibernation? If the hang is during resume then it looks likely from what you've described that the problem might be the same as the one you have resuming from suspend.

Revision history for this message
Kees van Vloten (kvv) wrote :

I do not know whether the behaviour is related.
With hibernation it hangs with black screen during hibernation
With suspend it hangs on resume.

The hibernation problem occurs in kernels 2.6.38-8 and 3.0-0, it does NOT in occur in 2.6.35-28.
The suspend problem occurs in kernels 2.6.38-8 and 3.0-0, in 2.6.35-28 resume works but does not switch on the screen (when I type in the blind commands are accepted)

Revision history for this message
Seth Forshee (sforshee) wrote :

Okay, they are probably different problems then.

Let's try this next. Boot with no_console_suspend, then switch to a virtual terminal 1 (i.e. press Ctrl-Alt-1). Then press Alt-Sysrq-9, and you should see a message that says "Loglevel set to 9". Now run 'sudo pm-hibernate' and report back what appears on the screen when the machine hangs. You can just take a picture of the screen if you wish.

Revision history for this message
Kees van Vloten (kvv) wrote :

The boot parameter no_console_suspend is already there from earlier tests (I check in the grub-menu that is is actually passed at boot)
Done alt-sysrq-9, I see the loglevel message.
After pm-hibernate it writes 3 lines to the screen, clears the screen, then the following lines:

Looking for splash... system
s2disk: snapshotting system

And then the screen is blanked and the machine hangs.

Shall I file another bug for the suspend/resume problem?

Kees van Vloten (kvv)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Seth Forshee (sforshee) wrote :

Please try the steps under "Per sub-system hibernate testing" on the page below and report back your results.

https://wiki.ubuntu.com/DebuggingKernelSuspendHibernateResume

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Kees van Vloten (kvv) wrote :

Done that, see #21

Revision history for this message
Seth Forshee (sforshee) wrote :

So you have. And based on that it looks like a problem suspending devices.

Our debugging options start getting more difficult at this point. If you have a serial port (and another machine with a serial port) we can print out the console on it. If this is possible I can provide instructions to set it up. You can also try booting into recovery mode and hibernating from the command-line (again with no_console_suspend) and see whether that works and whether you get anything useful on the screen.

Otherwise we can try using a tool called systemtap to try and store some information about how far we get into the rtc. Getting that set up is somewhat involved, so please check on the other two, and if they don't work out I'll give you some instructions for getting systemtap set up.

Revision history for this message
Kees van Vloten (kvv) wrote :

Sorry, this laptop does not have a serial port.

pm-hibernate in recovery mode + no_console_suspend + alt-sysreq-9 gives the same result as #25.

I have just installed systemtap but never used it yet.

Revision history for this message
Kees van Vloten (kvv) wrote :

Repeated the test in #29 but added the 'nomodeset' parameter.
Now I get the crash on the screen. I have tried to set the consolefont smaller but it switches back to standard VGA right before the crash.
Attached are some pictures, I managed to take.

Revision history for this message
Kees van Vloten (kvv) wrote :
Revision history for this message
Seth Forshee (sforshee) wrote :

That's something. There's probably a lot of scrollback that we can't see that would be really helpful there though.

What I can see is in firewire. You said you tried unloading modules before, but would you try unloading firewire_ohci and firewire_core before hibernating and see what you get on the screen?

For systemtap debugging, you'll need to start with the instructions at the link below, but those are for debugging S3 and we want to debug S4. There's some S4 debug stuff in the same repository, but I haven't used it so I don't know how well it works. It looks like the steps should be essentially the same, except that you'll have to use s4test instead of s3test. Let me know if you need any help.

Revision history for this message
Kees van Vloten (kvv) wrote :

The firewire driver is the problem!

Which is odd because as said I did do tests with driver unloading until all of them were unloaded. But there are differences between this and previous tests, one of them is a kernel change: was 2.6.38-8, is now 2.6.38-10.
With 'nomodeset' on, resume from hibernation fails initialize the screen properly which then blinks like a chrismas tree, fortunately without it works.

Normally the laptop runs from a LUKS encrypted harddisk, but I reinstalled it with basic partition setup for the test. A test before the reinstall (i.e. with encrypted partition) also had problems with resume but it had the 'nomodeset' parameter on. I am reinstalling it (with encrypted partition) to repeat the test and report back later.

The link for system debugging you refer to is missing :-)

Revision history for this message
Kees van Vloten (kvv) wrote :

Full install: 2 harddisks are in raid0 with lvm and LUKS on top of it, kubuntu applications are installed, root password is set to enable root-login on the console.
The minimal install used earlier uses sda directly with lvm (no raid and no LUKS) and has a minimal set of applications (no X), i.e. commandline only.

In #33 there was a success with hibernate and resume when unloading firewire on the minimal installation. With full installation the symptoms are a little different.
On the full installation, on the console, unloaded firewire:
* with 'nomodeset' hibernate succeeds but resume hangs with blank screen
* without 'nomodeset' hibernate hangs with blank screen
This makes me suspect that it has something to do with nouveau and/or KMS

Revision history for this message
Seth Forshee (sforshee) wrote :

If you suspect nouveau you could try blacklisting it and rebooting. To blacklist it you just need to add a line that reads 'blacklist nouveau' to /etc/modprobe.d/blacklist.conf.

Revision history for this message
Kees van Vloten (kvv) wrote :

Blacklisting nouveau does not prevent it from loading, even the boot parameter 'rdblacklist=nouveau' does not work. It is loaded and in use, i.e. no way to get it removed. According to http://kubuntuforums.net/forums/index.php?topic=3110471.10;wap2 the 'nomodeset' parameter prevents nouveau from loading. But even with 'nomodeset' nouveau loads, however the use counter is zero, so it is not used and can be unloaded. But then again the unload is not necessary because we already found that with 'nomodeset', i.e. unused nouveau, pm-hibernate works as expected.

The other thing I found is that resume after hibernate works sometimes (from encrypted swap) when uswsusp is uninstalled. On power-on after suspend, I enter the LUKS password, the machine starts loading the image. It turns out that resume works when nouveau was unloaded before suspend. It hangs on after loading the image when nouveau was not unloaded.

Now the difficulty here is that I can do without firewire (I don't use it anyway) but a videodriver is a pretty critical component.

Revision history for this message
Seth Forshee (sforshee) wrote :

You might check that with nomodeset whether resume really hung or the display just didn't come on. I.e. ping the device on the network, or install an ssh server and try to log in.

Here's the link I forgot to include earlier for the systemtap debugging. Since it doesn't look like you have any other way to get output during these hangs, this is probably the next thing to try.

https://wiki.canonical.com/PlatformServices/HardwareEnablement/Documentation/PowerManagement/DebugTools

Revision history for this message
Kees van Vloten (kvv) wrote :

Test with nouveau loaded and nomodeset (i.e. nouveau use count=0), suspends correct. On resume print a one line message:
[ 90.708043] ata1.01: failed to resume link (SControl 0)

I see this message right before suspend as well (also without nouveau loaded and resume working fine) so I think this is harmless
While I am typing this the screen is got filled with this (10 times with different numbers for 32 and 5599)

INFO: task kworker/u:32:5599 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

There is no reply on ping at any time after the resume.
The keyboard is responsive because I can scroll up with shift-pgup

Revision history for this message
Kees van Vloten (kvv) wrote :

The link you sent replies: "You are not allowed to view this page". If I click on login, it does authenticate me, however the message "You are not to view this page" is persistent :-(

Is it page for Canonical internal use only, perhaps?

Revision history for this message
Seth Forshee (sforshee) wrote :

It is an internal wiki page (probably because it's still fairly error-prone); I didn't notice that earlier. Here are the instructions.

1. Download the pmdebug utilities. These are currently available in a git repository, so you need to have git installed (sudo apt-get install git). Then run 'git clone git://kernel.ubuntu.com/cking/pmdebug.git'.

2. Install kernel debug ddeb (this will take a while). First run 'echo "deb http://ddebs.ubuntu.com $(lsb_release -cs) main restricted universe multiverse" | sudo tee -a /etc/apt/sources.list.d/ddebs.list' to create ddebs.list and then run 'sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 428D7C01' to import the signing key. Next run 'sudo apt-get update' followed by 'sudo apt-get install linux-image-$(uname -r)-dbgsym' to install the ddeb package.

3. Install systemtap. I think you said you've already done this, but the command is 'sudo apt-get install systemtap'.

4. Build the locatehang program. Cd into the locatehange directory within the pmdebug directory you downloaded in step 1, then run 'make'.

5. Cd to the pmdebug/systemtap directory and run 'sudo ./s4test -h'. If all goes well this will automatically hibernate the machine.

6. After the machine hangs, reboot it. Immediately after rebooting run the locatehang program you built in step 4. This must be done within a few minutes of the machine hanging, or else the data that was stored in the RTC could be corrupted.

After you've done all of this, attach the various *.log files in the pmdebug/systemtap directory and the output from locatehang. Let me know if you encounter any problems.

Thanks!

Revision history for this message
Kees van Vloten (kvv) wrote :

Mixed results so far. I have tested 2 situations:

1) without 'nomodeset' , i.e. nouveau loaded and module use counter=2. (in this situation is hangs on hibernate)
s4test -h : prints on the screen that it hibernates, switches off the screen and machine is unresponsive, no disk activitity. (I did not see the memory getting saved on the disk by blinking of the disk led)
Hard power off and power on, run locatehang returns: "Hash did not match any known kernel functions" and all of the log files are empty

2) with 'nomodeset' and nouveau loaded, but module use counter=0 . (in this situation it hangs on resume)
s4test -h hibernates the correctly, I see disk activity to save the image, machine powers down correctly.
On power on the driver crash happens and unlike other times the screen is switched on and I am able to scroll back. Attached pictures are from this second situation.

Revision history for this message
Kees van Vloten (kvv) wrote :
Revision history for this message
Kees van Vloten (kvv) wrote :
Revision history for this message
Kees van Vloten (kvv) wrote :
Revision history for this message
Kees van Vloten (kvv) wrote :

In the kern.log I found some logging from situation 1. From the timestamps in the logging after reboot you can see that the RTC was used to store a value.

Revision history for this message
Seth Forshee (sforshee) wrote :

Unfortunately I'm not able to tell much from that crash dump. I might be able to tell a little more if you will attach /boot/System.map-$(uname -r) to the bug report so I can look up what some of the addresses I see correspond to.

Did you run locatehang after the hang on resume with nomodeset? If not I'd suggest you trigger the hang on suspend, power the machine off immediately, then reboot and run the locatehang program.

Revision history for this message
Seth Forshee (sforshee) wrote :

I don't understand what you mean in comment #45. If there's something in the log that indicates a value was written to the RTC, I'm overlooking it.

Note that the timestamps in the kernel log always start from 0 after boot and represent how many seconds the machine has been running.

Revision history for this message
Kees van Vloten (kvv) wrote :

Situation 1:
rtl8192, the wifi driver, is on the last log line. Re-run the 's4test -h' test but unloaded the r8192se_pci module first. There is not mucht difference. The last line before the reboot in kern.log is now:

PM: Syncing filesystems ... called -> sys_sync()

Revision history for this message
Kees van Vloten (kvv) wrote :

What I mean is in #45 is that the date/time flips from July 29th to March 11th, my guess is that that value comes from the RTC...
Attached is the System.map

Revision history for this message
Kees van Vloten (kvv) wrote :

I have run the 'hibernate hang' (aka situation 1) quite a lot of times now, but only one time I managed to catch something with locatehang:

Looking for function that matches hash from the Magic Number from the kernel log.
  Magic: 2:752:371 maps to hash: 5a7cf2
  Hash matches: __initcall_trace_init_flags_exit__recvfromearly() (address: ffffffff81b89340)

Usually the hash does not match any know kernel function.
Another phenomenem I noticed is that quite often when I power the machine on after the hard power-off, it boots up and when it is just about ready to show a login prompt it switches of and reboots. It seems to continue the last bit of the hibernation proces which it could not finish earlier (because of the hang and hard power-off)

Revision history for this message
Seth Forshee (sforshee) wrote :

Does your machine come back from hibernation successfully if you use these commands?

  echo shutdown > /sys/power/disk
  echo disk > /sys/power/state

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.