Ubuntu

[Dell Studio XPS 1340] Doesn't enter suspend mode

Reported by Daniel Manrique on 2011-04-08
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
High
Seth Forshee
Natty
High
Unassigned
Oneiric
High
Unassigned
Precise
High
Unassigned

Bug Description

Steps to reproduce:
- try to enter suspend mode (I used sudo pm-suspend)

Expected result:
- The system enters sleep mode and upon pressing the power switch, resumes successfully

Actual result:
- The system fails to enter sleep mode, although it continues to be usable (desktop is OK, I can keep typing and using the system). It just doesn't enter sleep mode.

The sequence of drivers preparing to enter sleep mode, and then resuming immediately afterwards, can be seen in the attached dmesg.

This constitutes a regression, as this system is able to suspend/resume without issues with the 2.6.35-series kernel as shipped with Maverick (and was tested with the latest SRU kernel 2 days ago).

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: linux-image-2.6.38-8-generic 2.6.38-8.41
Regression: Yes
Reproducible: Yes
ProcVersionSignature: Ubuntu 2.6.38-8.41-generic 2.6.38.2
Uname: Linux 2.6.38-8-generic i686
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.23.
Architecture: i386
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: NVidia [HDA NVidia], device 0: STAC92xx Analog [STAC92xx Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: ubuntu 1335 F.... pulseaudio
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'NVidia'/'HDA NVidia at 0xf0880000 irq 17'
   Mixer name : 'Nvidia MCP79/7A HDMI'
   Components : 'HDA:111d7675,10280271,00100103 HDA:10de0007,10280271,00100100'
   Controls : 20
   Simple ctrls : 11
Date: Fri Apr 8 10:43:11 2011
HibernationDevice: RESUME=UUID=6e005f8d-e2d5-474b-b466-c55ffc400d24
InstallationMedia: Ubuntu 11.04 "Natty Narwhal" - Beta i386 (20110407.1)
Lsusb:
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 002: ID 05ca:18a0 Ricoh Co., Ltd
 Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Dell Inc. Studio XPS 1340
ProcEnviron:
 LANGUAGE=en_US:en
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.38-8-generic root=UUID=36d4feb1-e938-4a2a-b049-c593e85d4793 ro quiet splash vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-2.6.38-8-generic N/A
 linux-backports-modules-2.6.38-8-generic N/A
 linux-firmware 1.50
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 09/08/2009
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A11
dmi.board.name: 0K183D
dmi.board.vendor: Dell Inc.
dmi.board.version: A11
dmi.chassis.asset.tag: 1234567890
dmi.chassis.type: 8
dmi.chassis.vendor: Dell Inc.
dmi.chassis.version: A11
dmi.modalias: dmi:bvnDellInc.:bvrA11:bd09/08/2009:svnDellInc.:pnStudioXPS1340:pvrA11:rvnDellInc.:rn0K183D:rvrA11:cvnDellInc.:ct8:cvrA11:
dmi.product.name: Studio XPS 1340
dmi.product.version: A11
dmi.sys.vendor: Dell Inc.

Daniel Manrique (roadmr) wrote :
description: updated
Ara Pulido (apulido) wrote :

Can you guys have a look to this regression, please?

Changed in linux (Ubuntu):
assignee: nobody → Canonical Platform QA Team (canonical-platform-qa)
importance: Undecided → High
Daniel Manrique (roadmr) wrote :

Tested this mainline kernel:

http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/linux-image-2.6.39-999-generic_2.6.39-999.201104080911_i386.deb

With this one, the system enters suspend mode, but upon trying to resume, the system is unresponsive: backlight doesn't come on, there is no display, keyboard is unresponsive, and system doesn't respond to pings on the network.

Brian Murray (brian-murray) wrote :

From dmesg:

[ 74.522347] PM: suspend of drv:scsi dev:host0 complete after 1597.165 msecs
[ 74.522499] PM: suspend of drv:pci dev:0000:00:09.0 complete after 1270.188 msecs
[ 74.532679] vmap allocation for size 1052672 failed: use vmalloc=<size> to increase size.
[ 74.576773] [drm] nouveau 0000:02:00.0: ... failed: -12
[ 74.576776] [drm] nouveau 0000:02:00.0: Re-enabling acceleration..
[ 74.576794] pci_legacy_suspend(): nouveau_pci_suspend+0x0/0x360 [nouveau] returns -12
[ 74.576800] pm_op(): pci_pm_suspend+0x0/0x100 returns -12
[ 74.576805] PM: suspend of drv:nouveau dev:0000:02:00.0 complete after 1602.740 msecs
[ 74.576809] PM: Device 0000:02:00.0 failed to suspend async: error -12
[ 74.576859] PM: suspend of drv:ahci dev:0000:00:0b.0 complete after 1346.055 msecs
[ 74.960080] HDA Intel 0000:00:08.0: PCI INT A disabled
[ 74.976069] HDA Intel 0000:00:08.0: power state changed by ACPI to D3
[ 74.976076] PM: suspend of drv:HDA Intel dev:0000:00:08.0 complete after 453.879 msecs
[ 75.453244] [drm] nouveau 0000:03:00.0: And we're gone!
[ 75.453271] nouveau 0000:03:00.0: PCI INT A disabled
[ 75.468020] PM: suspend of drv:nouveau dev:0000:03:00.0 complete after 2493.957 msecs
[ 75.468056] PM: Some devices failed to suspend

Changed in linux (Ubuntu):
status: New → Triaged
Changed in linux (Ubuntu):
assignee: Canonical Platform QA Team (canonical-platform-qa) → Canonical Kernel Team (canonical-kernel-team)
Seth Forshee (sforshee) wrote :

This failure is due to the lack of a sufficiently large virtual address range available in the vmalloc area to satisfy what the driver is asking for at suspend to store GPU objects. A quick Google search shows a lot of reports of nouveau being a vmalloc hog for normal operation, neglecting suspend, so it may be doing itself in here. Or there could other drivers contributing to heavy vmalloc usage. I don't know if there are any tools to analyze how the kernel vmalloc space is being used; I'll look to see if I can find any.

Realistically the only options are probably to either reduce vmalloc usage or increase vmalloc size. The simple solution is to follow the advice of the kernel and pass vmalloc=<size> on the command-line (maybe start with 128M and go from there). Another (much more complicated) potential solution would be to see if it's possible to unmap the driver mmio space at suspend and remap it at resume.

Running a 64-bit kernel should also take care of the problem.

Daniel Manrique (roadmr) wrote :

Thanks for the suggestions Seth, I tested and here's what I got:

1- I tried increasing vmalloc (went as far up as 256M). What I see then is the system freezing when I do pm-suspend. It doesn't go into suspension, but the keyboard and mouse stop responding, only remaining option is to reboot.

2- Same behavior when using the 64-bit kernel (in fact, a whole new 64-bit installation), it freezes, becomes unresponsive and I have to reboot.

3- I then installed the proprietary nvidia drivers:

[ 18.486] (II) NVIDIA(0): Creating default Display subsection in Screen section
[ 19.144] (II) NVIDIA(0): NVIDIA GPU GeForce 9400M G (C79) at PCI:3:0:0 (GPU-0)
[ 19.168] (II) NVIDIA(0): Assigned Display Device: DFP-0
[ 19.168] (II) NVIDIA(0): Validated modes:
[ 19.168] (II) NVIDIA(0): ""nvidia-auto-select""

With these proprietary drivers, the system successfully suspends, and comes back from restore with some garbling on the screen, what I did was maximizing the terminal (F11) and that basically "sweeps" the display and it's usable, although the background itself turns white. So it's better and possibly usable, it might need some work done, but more importantly, confirms your diagnosis about nouveau keeping the system from successfully suspending.

Let me know if more testing is needed.

Thanks again,
- Daniel

Seth Forshee (sforshee) wrote :

/proc/vmallocinfo shows all the vmalloc mappings. We could take a look at that to get an idea of what's consuming the address space. The VmallocTotal, VmallocUsed, and VmallocChunk fields in /proc/meminfo would also be useful to look at.

Seth Forshee (sforshee) wrote :

Do you get any kind of panic message or anything else when the system freezes after you increase vmalloc? Try running pm-suspend from within vt1 to see if anything shows up on the screen. You might also try using magic sysrq when it's frozen, try alt-sysrq-p to get a dump of the current task's state.

The freeze may well be a different problem.

Daniel Manrique (roadmr) wrote :

Hi, I'm attaching vmallocinfo and meminfo.

Daniel Manrique (roadmr) wrote :
Seth Forshee (sforshee) wrote :

Ah, you're using a 64-bit kernel now.

Some of the biggest vmalloc areas are coming from nouveau.

0xffffc90002480000-0xffffc90002c81000 8392704 nouveau_load+0xa7/0x550 [nouveau] phys=ae000000 ioremap
0xffffc90002d00000-0xffffc90004d01000 33558528 nouveau_load+0x2f1/0x550 [nouveau] phys=ac000000 ioremap
0xffffc90006100000-0xffffc90006901000 8392704 nouveau_load+0xa7/0x550 [nouveau] phys=aa000000 ioremap
0xffffc90006980000-0xffffc90008981000 33558528 nouveau_load+0x2f1/0x550 [nouveau] phys=cc000000 ioremap

That's over 80 MB from nouveau (and I omitted several others of a couple of pages each). Some other areas are using sizable amounts as well, the most notable of which is audio. meminfo shows the amount of vmalloc used as 138284 kB, and if it's anywhere near that on a 32-bit install it's not hard to see why it might start having problems.

Daniel Manrique (roadmr) wrote :

OK, so I reinstalled the 32-bit kernel, however I'm unable to go to a vt to see if it gives a panic message while suspending, when I press say ctrl+alt+f1 the graphical cursor disappears but the rest of the desktop stays visible, looks like the display didn't get reset to text mode, so I can't see what I'm typing (and certainly no debugging messages). if I press alt+f7 I "go back" to graphical mode, the cursor reappears and the screen is responsive again.

The only way I found to get to a console was to use xforcevesa nomodeset, though of course that probably changes things in other ways (like not using the nouveau driver). The system enters suspend and of course, upon resuming, the screen is blank (with no backlight). However other than that, the system appears to have recovered, as I was able to ssh in and recover a dmesg file I'm attaching.

meminfo says the following about vmalloc:
VmallocTotal: 122880 kB
VmallocUsed: 22356 kB
VmallocChunk: 93500 kB

The top entries in vmalloc (sorted by size) are:

0xf84e5000-0xf8508000 143360 kvmalloc+0x3f/0x50 pages=34 vmalloc
0xf85d6000-0xf85f9000 143360 kvmalloc+0x3f/0x50 pages=34 vmalloc
0xf82ec000-0xf8314000 163840 module_alloc_update_bounds+0x19/0x70 pages=39 vmalloc
0xf84a9000-0xf84d6000 184320 module_alloc_update_bounds+0x19/0x70 pages=44 vmalloc
0xf842a000-0xf846a000 262144 module_alloc_update_bounds+0x19/0x70 pages=63 vmalloc
0xf83d6000-0xf8420000 303104 module_alloc_update_bounds+0x19/0x70 pages=73 vmalloc
0xf85fa000-0xf8693000 626688 module_alloc_update_bounds+0x19/0x70 pages=152 vmalloc
0xf8201000-0xf82b2000 724992 sys_swapon+0x428/0x8a0 pages=176 vmalloc
0xf9380000-0xf94b1000 1249280 0xf80361a9 phys=cd000000 ioremap
0xff000000-0xff400000 4194304 pcpu_get_vm_areas+0x0/0x4c0 vmalloc
0xf8711000-0xf8b12000 4198400 snd_malloc_sgbuf_pages+0x1a5/0x202 [snd_page_alloc] vmap
0xf8b13000-0xf8f14000 4198400 snd_malloc_sgbuf_pages+0x1a5/0x202 [snd_page_alloc] vmap
0xf8f15000-0xf9316000 4198400 snd_malloc_sgbuf_pages+0x1a5/0x202 [snd_page_alloc] vmap

Seth Forshee (sforshee) wrote :

That's really strange that you can't switch vts. We may be looking at more than one bug here. Does the same thing happen if you run 'sudo chvt 1' in a console?

Are the meminfo and vmalloc dumps from a boot with xforcevesa nomodeset? Because it really isn't using much of the vmalloc area, so I'd be surprised to see the vmap failures from the original "won't suspend" problem in that situation.

It might be best to try to attack the various issues one at a time. Starting with the vmalloc problem, I guess the first thing is for you to tell me whether the meminfo/vmalloc information you just supplied is from a "xforcevesa nomodeset" or not. If it is, I'd be interested to see the same information with the default kernel command-line.

The next step is probably to test the 2.6.38.3 mainline build since that's closest to natty's kernel and will tell us which direction we need to start looking to track down the regression. You can grab this build at:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.38.3-natty/

Daniel Manrique (roadmr) wrote :

Hi Seth,

Yes, I apologize, I guess I panicked about being unable to switch consoles so I started doing nonsense.

So let's go one step at a time:

here's vmalloc (top offenders) and meminfo from a boot without xforcevesa and nomodeset (i.e. using the nouveau driver as during the observed failures).

0xf845e000-0xf8481000 143360 kvmalloc+0x3f/0x50 pages=34 vmalloc
0xf8482000-0xf84a5000 143360 kvmalloc+0x3f/0x50 pages=34 vmalloc
0xf8303000-0xf832b000 163840 module_alloc_update_bounds+0x19/0x70 pages=39 vmalloc
0xf84da000-0xf8507000 184320 module_alloc_update_bounds+0x19/0x70 pages=44 vmalloc
0xf83ef000-0xf842f000 262144 module_alloc_update_bounds+0x19/0x70 pages=63 vmalloc
0xf839b000-0xf83e5000 303104 module_alloc_update_bounds+0x19/0x70 pages=73 vmalloc
0xf86a4000-0xf873d000 626688 module_alloc_update_bounds+0x19/0x70 pages=152 vmalloc
0xf8201000-0xf82b2000 724992 sys_swapon+0x428/0x8a0 pages=176 vmalloc
0xf87fc000-0xf89fd000 2101248 drm_ht_create+0x54/0xd0 [drm] pages=512 vmalloc
0xfb602000-0xfb803000 2101248 drm_ht_create+0x54/0xd0 [drm] pages=512 vmalloc
0xfed76000-0xfef77000 2101248 snd_malloc_sgbuf_pages+0x1a5/0x202 [snd_page_alloc] vmap
0xfb300000-0xfb601000 3149824 ttm_bo_kmap+0xff/0x120 [ttm] phys=d000c000 ioremap
0xfe180000-0xfe571000 4132864 ttm_bo_kmap+0xff/0x120 [ttm] phys=b000c000 ioremap
0xff000000-0xff400000 4194304 pcpu_get_vm_areas+0x0/0x4c0 vmalloc
0xfe572000-0xfe973000 4198400 snd_malloc_sgbuf_pages+0x1a5/0x202 [snd_page_alloc] vmap
0xfe974000-0xfed75000 4198400 snd_malloc_sgbuf_pages+0x1a5/0x202 [snd_page_alloc] vmap
0xf8a00000-0xf9201000 8392704 nouveau_load+0x9c/0x4f0 [nouveau] phys=ae000000 ioremap
0xfb880000-0xfc081000 8392704 nouveau_load+0x9c/0x4f0 [nouveau] phys=aa000000 ioremap
0xf9280000-0xfb281000 33558528 nouveau_load+0x2a9/0x4f0 [nouveau] phys=ac000000 ioremap
0xfc100000-0xfe101000 33558528 nouveau_load+0x2a9/0x4f0 [nouveau] phys=cc000000 ioremap

VmallocTotal: 122880 kB
VmallocUsed: 112688 kB
VmallocChunk: 4088 kB

I will test the mainline build you suggested and report back as soon as I have something.

Finally, for the vt switching problem, I ran the command you suggested and I had the same behavior (i.e. can't switch to the vt). However, I've seen this on one other system (a Dell Vostro 3400 with dual graphics which is using the intel driver), so I think I'll do some more testing about that and report it as a different bug.

Thanks so much for your help!

Daniel Manrique (roadmr) wrote :

I tested the 2.6.38-3-natty mainline kernel. The system boots and is usable, enters suspend mode, but upon attempting to resume is unresponsive, the backlight doesn't come up, no display, keyboard is unresponsive and system doesn't respond to pings on the network. At some point I saw capslock flashing but it stopped after a bit.

I also tested v2.6.35.12-maverick, with this kernel the system boots but when trying to enter graphical mode becomes unresponsive, no keyboard, network ping or display (screen is black). Maybe some Ubuntu-specific modifications are what enabled the actual shipped Maverick kernel to work?

Seth Forshee (sforshee) on 2011-04-15
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Seth Forshee (sforshee)
status: Triaged → In Progress
Seth Forshee (sforshee) wrote :

Okay, that vmalloc information is more like what I expected. Space is pretty tight.

It's interesting that the 2.6.38.3 mainline build doesn't have the same problems with suspend. So to start we can try to figure out what's different there that's making suspend fail in natty. Can you grab the vmalloc and meminfo dumps with the mainline kernel so we can check if anything there is significantly different? And I'll scan our patches on top of mainline to see if anything jumps out as potentially related.

Thanks!

Daniel Manrique (roadmr) wrote :

Hi,
Seeing as to how the kernel shipped with Maverick worked, whereas the mainline 2.6.35 one doesn't, maybe some ubuntu-specific patch enables things to work correctly on that one.

In any case, I'm currently with this kernel:

Linux 200912-4906 2.6.38-02063803-generic #201104150912 SMP Fri Apr 15 10:37:38 UTC 2011 i686 i686 i386 GNU/Linux

meminfo is manageable so I'm posting that here:

MemTotal: 2314600 kB
MemFree: 1905148 kB
Buffers: 27732 kB
Cached: 204272 kB
SwapCached: 0 kB
Active: 156192 kB
Inactive: 188916 kB
Active(anon): 113824 kB
Inactive(anon): 1876 kB
Active(file): 42368 kB
Inactive(file): 187040 kB
Unevictable: 0 kB
Mlocked: 0 kB
HighTotal: 1448712 kB
HighFree: 1102132 kB
LowTotal: 865888 kB
LowFree: 803016 kB
SwapTotal: 2880508 kB
SwapFree: 2880508 kB
Dirty: 60 kB
Writeback: 0 kB
AnonPages: 113140 kB
Mapped: 42952 kB
Shmem: 2600 kB
Slab: 24348 kB
SReclaimable: 13192 kB
SUnreclaim: 11156 kB
KernelStack: 2344 kB
PageTables: 2992 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 4037808 kB
Committed_AS: 989152 kB
VmallocTotal: 122880 kB
VmallocUsed: 102720 kB
VmallocChunk: 10812 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 4096 kB
DirectMap4k: 12280 kB
DirectMap4M: 897024 kB

I'm also attaching vmalloc-mainline.txt which is a bit larger, but I thought it would be good to have the complete file here.

Let me know when any more information or testing is needed, remember this machine lives for testing so we can be as invasive with it as necessary.

Thanks again for all your help!

Changed in linux (Ubuntu Natty):
milestone: none → ubuntu-11.04
milestone: ubuntu-11.04 → natty-updates
Changed in linux (Ubuntu Oneiric):
status: New → In Progress
importance: Undecided → High
Seth Forshee (sforshee) wrote :

@Daniel, you don't happen to have the full /proc/vmalloc dump for the 32-bit natty kernel, do you? That would help with determining what's using more vmalloc space in natty versus mainline.

Daniel Manrique (roadmr) wrote :

Hi Seth,

Here it is, produced with this kernel:

Linux 200912-4906 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:50 UTC 2011 i686 i686 i386 GNU/Linux

Seth Forshee (sforshee) wrote :

The vmap differences can be almost completely attributed to a patch we're carrying to the snd-hda-intel driver to increase the audio buffer size to "improve the audio experience." I'll look into it but I'd guess there's a good reason for the patch.

Moving on to the next problem. I'd like to see what's going on with the suspend hang when you increase the vmalloc size, but having working vt's might be helpful. Did you file a bug for the vt problem?

Since you don't have vt's, you could try booting in recovery mode (boot with no_console_suspend) and run pm-suspend to see if it works there and if you get any interesting output. You can also try some of the steps in the following wiki pages to see if they yield anything useful.

https://wiki.ubuntu.com/DebuggingKernelSuspend
https://wiki.ubuntu.com/DebuggingKernelSuspendHibernateResume

Seth Forshee (sforshee) on 2011-04-19
Changed in linux (Ubuntu Natty):
status: In Progress → Incomplete
Daniel Manrique (roadmr) wrote :

hi Seth,

I did the DebuggingKernelSuspend procedure and this is what popped out:

[ 1.100693] Magic number: 0:846:402
[ 1.100695] hash matches /build/buildd/linux-2.6.38/drivers/base/power/main.c:535
[ 1.100720] pci 0000:03:00.0: hash matches

I'm attaching the entire dmesg from that run to this comment.

I also tried no_console_suspend in combination with vmalloc=256M, when I issue pm-suspend from a terminal (still can't switch to a vt) the system "freezes" (same behavior as in #6). I see no useful messages :(

Changed in linux (Ubuntu Natty):
status: Incomplete → In Progress
Seth Forshee (sforshee) wrote :

Daniel,

Thanks for testing. If that PCI id is accurate then it seems we're still looking for some kind of problem with the nouveau driver. When you see the hang, is your caps lock led blinking?

I think my wording for one of my suggested test cases was confusing. I think it would be useful if you could boot into recovery mode by holding left-shift when booting to get the grub menu and selecting the "recovery mode" boot option. Also modify the kernel command-line when you boot into recovery mode to include "vmalloc=256M no_console_suspend". This should boot you to a text-mode terminal where you can run pm-suspend without the graphical UI in the way. Chances are that you'll either get working suspend-resume or won't see any useful output, but it's worth a try.

Daniel Manrique (roadmr) wrote :

Hi Seth,

I repeated the suspend to get the system to hang again, there's no flashing caps lock :(

Also, I tried your suggestion to boot in single-user text mode. There seems to be some problem initializing the display, I'm attaching a picture of what I see, and nothing I type shows up. What's odd is that the system is "alive", again, I can blind-type and I can even issue commands to reboot the system, bring up X (which comes up just fine) or make it suspend in that state. Of course, this is not useful as I still can't see actual text on the console :( I tried using the nosplash kernel parameter but there was no change in the display behavior.

Seth Forshee (sforshee) wrote :

Boy, that machine just has all sorts of problems with graphics. Kind of makes it hard to decide what to pound on first.

I checked, and we aren't carrying any patches to nouveau in the natty kernel. So I'm a bit puzzled why you see different behavior with natty (with vmalloc size increased) versus mainline unless it's due to something external to nouveau. Looking closer at the pm_trace it seems that it really only traces device resume and not suspend, so it's a bit interesting that pm_trace showed anything at all. Makes me wonder if something went wrong elsewhere in suspend and then the machine hung in the nouveau code while trying to back out of the failed suspend.

Probably the most effective thing to do at this point is to open upstream bug reports against the issues you see when running a mainline build. If you don't mind filing the bugs yourself it might be easier since you actually have the hardware, otherwise I can do it. The appropriate location for the upstream bug reports is:

https://bugs.freedesktop.org/

This seems similar to:
suspend hibernation not working on dell 1749
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/748994

Ara Pulido (apulido) wrote :

Seth, any updates on this bug?

Seth Forshee (sforshee) wrote :

No updates. My suggestion in comment #24 was to start working with upstream on the issue. I offered to help with filing the bug upstream if needed but haven't seen any response from Daniel. If possible I think it's more efficient for him to interface directly since I don't have the hardware, but I'm willing to be an intermediary or at least file the initial bug report.

Seth Forshee (sforshee) wrote :

Daniel, if you do file an upstream bug also be sure to link to it here. I certainly intend to follow the progress and provide support as needed.

Download full text (3.5 KiB)

This problem was originally reported here:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/754711

The system works correctly with the Ubuntu-patched 2.6.35 Kernel as shipped with Ubuntu 10.10, but fails with the 2.6.38 kernel from Ubuntu 11.04.

While tracking down this problem, I tested several mainline kernels, all of which also fail to suspend/resume on this system, as I describe below.

Steps to reproduce:
- try to enter suspend mode (I used sudo pm-suspend)

Expected result:
- The system enters sleep mode and upon pressing the power switch, resumes successfully

Actual result:
I tried this with the following kernels (all mainline):

2.6.39-999.201104080911 - The system enters suspend mode, but upon trying to resume, the system is unresponsive: backlight doesn't come on, there is no display, keyboard is unresponsive, and system doesn't respond to pings on the network.

2.6.38-3-natty mainline kernel. The system boots and is usable, enters suspend mode, but upon attempting to resume is unresponsive, the backlight doesn't come up, no display, keyboard is unresponsive and system doesn't respond to pings on the network. At some point I saw capslock flashing but it stopped after a bit.

v2.6.35.12-maverick mainline kernel, with this kernel the system boots but when trying to enter graphical mode becomes unresponsive, no keyboard, network ping or display (screen is black).

Here is some relevant information on the system, please let me know if any more tests are needed.

dmi.bios.date: 09/08/2009
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A11
dmi.board.name: 0K183D
dmi.board.vendor: Dell Inc.
dmi.board.version: A11
dmi.chassis.asset.tag: 1234567890
dmi.chassis.type: 8
dmi.chassis.vendor: Dell Inc.
dmi.chassis.version: A11
dmi.modalias: dmi:bvnDellInc.:bvrA11:bd09/08/2009:svnDellInc.:pnStudioXPS1340:pvrA11:rvnDellInc.:rn0K183D:rvrA11:cvnDellInc.:ct8:cvrA11:
dmi.product.name: Studio XPS 1340
dmi.product.version: A11
dmi.sys.vendor: Dell Inc.

GraphicsCard:
02:00.0 VGA compatible controller [0300]: nVidia Corporation G98 [GeForce 9200M GS] [10de:06e8] (rev a1) (prog-if 00 [VGA controller])
 Subsystem: Dell Device [1028:0271]
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin A routed to IRQ 23
 Region 0: Memory at ae000000 (32-bit, non-prefetchable) [size=16M]
 Region 1: Memory at d0000000 (64-bit, prefetchable) [size=256M]
 Region 3: Memory at ac000000 (64-bit, non-prefetchable) [size=32M]
 Region 5: I/O ports at 4000 [size=128]
 Capabilities: <access denied>
 Kernel driver in use: nouveau
 Kernel modules: nouveau, nvidiafb

03:00.0 VGA compatible controller [0300]: nVidia Corporation C79 [GeForce 9400M G] [10de:0866] (rev b1) (prog-if 00 [VGA controller])
 Subsystem: Dell Device [1028:0271]
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 ...

Read more...

Seth Forshee (sforshee) wrote :

Daniel, I think maybe you linked to the wrong upstream bug report.

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Daniel Manrique (roadmr) wrote :

Linked to the wrong bug upstream - fixing

Changed in linux:
importance: Medium → Unknown
status: Confirmed → Unknown
Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Daniel Manrique (roadmr) wrote :

By the way, I tested with the latest mainline kernel as of today:

http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/linux-image-2.6.39-999-generic_2.6.39-999.201105240905_i386.deb

With this kernel I'm still seeing faulty resume behavior as described for 2.6.38-3 in comment #15 (it enters suspend but fails to resume). However, I am able to switch to a VT when the system is working (i.e. before suspend). So it's probably not worth filing an upstream bug for the VT issue as I'll likely be told that it's been solved.

I could file an Ubuntu bug for the VT problem, as it's still present on the latest proposed Natty kernel (2.6.38-9) but NOT, as I mentioned, on mainline.

Let me know if it makes sense for me to do this to also keep track of the VT switching problem.

Seth Forshee (sforshee) wrote :

I'd suggest testing the latest .38 stable mainline build (2.6.38.7) to see if it's been fixed there. If it has it will filter into natty eventually. If it's not fixed go ahead and file an Ubuntu bug and we'll try to get the fix into natty. Thanks!

http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.38.7-natty/

Daniel Manrique (roadmr) wrote :

Seth,

I tested the 2.6.38.7-natty mainline kernel, and on that one, vt switching works correctly.

So I won't be filing a bug about that one, guess it's about waiting for the nouveau resume upstream bug to get looked at (linked to the correct one a while ago, sorry about that).

Thanks,

- Daniel

Seth Forshee (sforshee) wrote :

Cool. Might also be worth trying the recovery mode boot with that kernel to see if that problem got fixed as well.

Just curious if any testing has been done with the Oneiric 3.0 kernel? 3.0.0-5.6 is the most recent upload and was based on the v3.0-rc7 upstream kernel.

Changed in linux (Ubuntu Oneiric):
status: In Progress → Incomplete
Daniel Manrique (roadmr) wrote :

Hi Leann,

I installed Oneiric from the 2011-07-15 image, and retried the sleep test on this machine.

It now enters sleep, and the power light starts blinking, after a while I press the power button to wake up, but then the system is unresponsive, no backlight, no display or response to keyboard (caps lock doesn't react) and no network connection is active.

So behavior has changed but is still bad :(

I could try to install the proprietary nvidia drivers , not sure if those are working with Oneiric yet.

I can also try the power management debug tools from cking, though as of two days ago, they were known to not work on Oneiric.

Changed in linux (Ubuntu Oneiric):
status: Incomplete → In Progress

I've got the same graphic card on Dell studio XPS 1330, and exactly the same frozen configurations after a wake up from suspend-to-ram (no problem with suspend-to-disc).
I tried with several older kernels (debian sid), and I had everytime the same bug.

The /var/log/pm-suspend.log do not contain any line corresponding to the wake-up.

Ara Pulido (apulido) on 2011-08-15
tags: added: oneiric
Brad Figg (brad-figg) on 2011-09-02
tags: added: rls-mgr-o-tracking
Seth Forshee (sforshee) on 2011-09-16
Changed in linux (Ubuntu Natty):
assignee: Seth Forshee (sforshee) → nobody
Changed in linux (Ubuntu Oneiric):
milestone: none → oneiric-updates
Changed in linux (Ubuntu Precise):
status: New → In Progress
importance: Undecided → High
tags: added: rls-mgr-p-tracking
removed: rls-mgr-o-tracking
James M. Leddy (jm-leddy) wrote :

Is this bug still on the radar? There hasn't been much activity on the fdo.org bug, is there anything else that we can do?

Daniel Manrique (roadmr) wrote :

Hi,

Here's an update on this bug. I finally got around to trying systemtap on kernel 3.0 with Oneiric. If I run the systemtap diagnostics as referenced here:

https://wiki.ubuntu.com/Kernel/Reference/S3SystemTapDebug

the system begins suspend preparations, flickers the screen as if trying to suspend, and then *doesn't* suspend, returning me to the OS with this message on the terminal:

Suspending machine..
PM-INFO: Ready to run S3 test
PM-TEST: 255 tasks frozen successfully.
PM-TEST: __device_suspend(): device platform () failed to suspend.
PM-TEST: __device_suspend(): device platform () failed to suspend.
PM-TEST: dpm_suspend(): failed to suspend all devices.
PM-TEST: 2 of 554 devices failed to suspend.
PM-INFO: Devices that failed to suspend: 0000:02:00.0.
PM-TEST: dpm_suspend_start(): dpm_prepare() or dom_suspend() failed, cannot prepare devices for PM transistion and suspend.
PM-TEST: 544 devices resumed correctly.
PM-TEST: suspend_devices_and_enter(): failed to suspend devices and enter the desired system sleep state.
PM-TEST: enter_state(): failed because either suspend_prepare() or suspend_devices_and_enter() failed.
PM-TEST: state_store(): Expecting a return value of 3, got -12 instead.
PM-INFO: 32 functions entered, 30 functions returned.
PM-INFO: 255 tasks frozen and 255 tasks thawed.
PM-INFO: S3 test completed.

It indicates 2 devices failed to suspend, the indicated one (02:00.0) is the "VGA compatible controller [0300]: nVidia Corporation G98 [GeForce 9200M GS] [10de:06e8] (rev a1)".

If I run just pm-suspend, without systemtap debugging, the system behaves as originally reported, apparently entering suspend mode, but when I press the power button it fails to resume and just becomes unresponsive.

I also tried the latest mainline, 3.2.0-rc2 kernel from Ubuntu's mainline repository. Behavior is the same as originally reported. So this bug is still an issue.

I'm attaching all the systemtap-produced logfiles.

Daniel Manrique (roadmr) wrote :
Daniel Manrique (roadmr) wrote :
Daniel Manrique (roadmr) wrote :
Daniel Manrique (roadmr) wrote :
Daniel Manrique (roadmr) wrote :
Daniel Manrique (roadmr) wrote :
Daniel Manrique (roadmr) wrote :
Seth Forshee (sforshee) wrote :

It seems the nouveau driver suspend returned -ENOMEM, so I'd suspect the same vmalloc failure that we identified previously. dmesg might provide more clues as to whether or not that's the case.

Please note on the freedesktop bug that this is still a problem in 3.2-rc2.

Just to update on this, I tested with a 3.2-rc2 kernel, where this problem is still present as described originally.

Thanks!

Daniel Manrique (roadmr) wrote :

Hi Seth!

I'm attaching dmesg, it was a bit difficult to obtain because the system is still unresponsive after a resume, so I'm unable to fetch the file.

However if I run the suspend thing using systemtap, it manages to detect the failure to suspend and doesn't die. This dmesg was obtained after running the systemtap s3test script. Hopefully it'll contain something useful.

I also updated the freedesktop bug with our latest tests on kernel 3.2.

Seth Forshee (sforshee) wrote :

Daniel: Sorry, I should have been more specific. I was intending to ask for the trace from the systemtap case, since I was looking at the messages from the systemtap logs.

It does look like the same or an extremely similar failure -- no sufficient address range in the vmalloc area to satisfy the allocations nouveau is making during suspend.

But I don't think what you see with systemtap is the same bug as what you're seeing without it. The bug you see with systemtap is recoverable (you've seen it recover in fact) and happens during suspend. The bug without it is on the resume side and isn't recovering. As a result I don't think systemtap is adding any useful information for this bug.

Have you tried a serial console to see if you can get more data during resume? I assume you don't have a serial port on the machine, but sometimes you can still get data with console on a USB serial adapter. It may or may not work, but if you have access to an adapter it might be worth trying.

Ara Pulido (apulido) wrote :

This one does not block certification, as it works fine with the proprietary drivers

tags: removed: blocks-hwcert

Can you provide dmesg from failed suspend on 64-bit kernel? I bet it will fail differently than 32-bit kernel.

2 256MB cards + 32-bit kernel without CONFIG_HIGHMEM4G (only 876 MB of RAM directly accessible) = fail. Maybe it's fixable, but I wouldn't count on it.

(In reply to comment #3)
> Can you provide dmesg from failed suspend on 64-bit kernel? I bet it will fail
> differently than 32-bit kernel.
>
> 2 256MB cards + 32-bit kernel without CONFIG_HIGHMEM4G (only 876 MB of RAM
> directly accessible) = fail. Maybe it's fixable, but I wouldn't count on it.

Marcin: I take it you are referring to failure to suspend due to vmalloc failures in the nouveau driver? Indeed, Daniel has verified that this problem can be avoided by using a 64-bit kernel or by passing vmalloc=128M to the kernel (see the reference bug in Launchpad for details).

When that issue is avoided the system does appear to suspend successfully, but then hangs when resuming.

According to the latest dmesg (in comment #47)

Nouveau fails to "Suspend the GPU objects" due to NOMEM (-12 reported in the log),

I.e. nouveau fails to vmalloc(gpuobj->size), that is needed to store current gpu objects (so that they can be restored after resume)

jtheuer (mail-jtheuer) wrote :

I cannot enter suspend, too, I get:

# sudo pm-suspend
flock: 3: Bad file descriptor

Is this issue related? or should I file a new bug?

dino99 (9d9) wrote :
Changed in linux (Ubuntu Natty):
status: In Progress → Invalid
Changed in linux (Ubuntu Oneiric):
status: In Progress → Invalid
Changed in linux (Ubuntu Precise):
status: In Progress → Invalid
Changed in linux (Ubuntu):
status: In Progress → Invalid
status: Invalid → Confirmed
status: Confirmed → In Progress

With the 3.9.x kernel, this bug is fixed on my computer.

Marking as fixed per the last comment. If this is still an issue, feel free to reopen with fresh logs/etc from a new kernel.

Changed in linux:
status: Confirmed → Fix Released
dino99 (9d9) on 2013-08-29
Changed in linux (Ubuntu):
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

Remote bug watches

Bug watches keep track of this bug in other bug trackers.