[i915] Crash after suspending (NULL pointer dereference in intel_crt_detect())

Bug #553176 reported by Milan Bouchet-Valat
106
This bug affects 18 people
Affects Status Importance Assigned to Milestone
X.Org X server
Fix Released
High
linux (Ubuntu)
Invalid
High
Manoj Iyer
Nominated for Lucid by Milan Bouchet-Valat

Bug Description

This happens about one hour after returning from suspend, and makes X crash. See bug 553174 for the X crash report, where I have put the informations and the link to upstream report.

ProblemType: KernelOops
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-19-generic
Regression: No
Reproducible: Yes (just suspend and resume)
TestedUpstream: Yes (drm-intel-next branch, see upstream report)
ProcVersionSignature: Ubuntu 2.6.32-19.28-generic 2.6.32.10+drm33.1
Uname: Linux 2.6.32-19-generic i686
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Annotation: Your system might become unstable now and might need to be restarted.
Architecture: i386
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: milan 1445 F.... pulseaudio
 /dev/snd/seq: timidity 1186 F.... timidity
CRDA: Error: [Errno 2] Aucun fichier ou dossier de ce type
Card0.Amixer.info:
 Card hw:0 'ICH6'/'Intel ICH6 with ALC250 at irq 17'
   Mixer name : 'Realtek ALC250 rev 2'
   Components : 'AC97a:414c4752'
   Controls : 33
   Simple ctrls : 21
CurrentDmesg:
 [ 59.755859] lib80211_crypt: registered algorithm 'TKIP'
 [ 66.656439] apm: BIOS not found.
 [ 67.324827] ppdev: user-space parallel port driver
 [ 69.416020] eth1: no IPv6 routers present
 [ 99.487128] padlock: VIA PadLock not detected.
Date: Thu Apr 1 11:53:06 2010
Failure: oops
Frequency: This has only happened once.
HibernationDevice: RESUME=UUID=dd37abed-09dd-499f-bbbe-c2c3866cb9cf
Lsusb:
 Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: TOSHIBA Satellite A80
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcCmdLine: BOOT_IMAGE=/vmlinuz-2.6.32-19-generic root=UUID=2dcc00f7-0fad-4823-9469-9e4e8dd841d0 ro crashkernel=384M-2G:64M,2G-:128M quiet splash
RelatedPackageVersions: linux-firmware 1.33
RfKill:

SourcePackage: linux
Title: BUG: unable to handle kernel NULL pointer dereference at 00000108
WpaSupplicantLog:

dmi.bios.date: 02/23/2005
dmi.bios.vendor: TOSHIBA
dmi.bios.version: V1.40
dmi.board.name: EAT10/EAT20
dmi.board.vendor: TOSHIBA
dmi.board.version: Null
dmi.chassis.asset.tag: *
dmi.chassis.type: 10
dmi.chassis.vendor: TOSHIBA
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnTOSHIBA:bvrV1.40:bd02/23/2005:svnTOSHIBA:pnSatelliteA80:pvrPSA80E-02V024FR:rvnTOSHIBA:rnEAT10/EAT20:rvrNull:cvnTOSHIBA:ct10:cvrN/A:
dmi.product.name: Satellite A80
dmi.product.version: PSA80E-02V024FR
dmi.sys.vendor: TOSHIBA

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Created an attachment (id=33884)
:0.log

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Created an attachment (id=33887)
Xorg.0.log.old

To be clear, I must add that those log files come from the failsafe X session that Ubuntu started after the crash. So they may not contain information about the crash itself, just about the fact that X is not able to restart correctly after the crash has occurred (I couldn't even switch to the consoles).

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Still happening with xserver-video-intel 2.10.902+git20100317.31d5f84b. Anything I can do to help debugging?

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Created an attachment (id=34354)
gdb trace of the SIGPIPE

Here's a gdb trace I could get of the crash, with kernel vmlinuz-2.6.33-997-generic (drm-intel-next), X.Org server 1.6.4, Intel driver 2.9.0. (Note this happens with more recent versions too, it's just that I'm using these as they don't suffer from other small issues.)

A funny thing is that while attached to gdb, X doesn't actually crash: I only get a SIGPIPE signal, and everything just works if I type 'continue'. When I detached Xorg, the signal killed the server (I guess that's intended).

Note that at the top of the trace, the call
__libc_writev (fd=-1218895884, vector=0xbfd61438, count=1)
is always exactly the same accross different SIGPIPES. Values of the parameters were the same the first time and the second time I received that signal (without restarting X).

Hope this helps, please, please ask if you need more details!

Revision history for this message
In , Chris Wilson (ickle) wrote :

This is a gpu hang, so the most interesting information would be i915_error_state and register dumps [as suspend and resume is complicit we need to ensure we are restoring the gpu state correctly].

In terms of driver packages, the most important one to make sure is up-to-date is perhaps libdrm, preferably from drm.git but 2.4.19 at a minimum.

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Ah, thanks for the feedback! When should I get the GPU dump? Right after the crash occurred, while gdb is blocking Xorg?

Revision history for this message
In , Chris Wilson (ickle) wrote :

If you grab a intel-gpu-tools/tools/intel_reg_dump before suspending and after resume, and if xorg-edgers is recent enough, then the gpu dump will be in /sys/kernel/debug/dri/0/i915_error_state following a hang.

Revision history for this message
In , Julien Cristau (jcristau) wrote :

> --- Comment #4 from Milan Bouchet-Valat <email address hidden> 2010-03-23 04:03:29 PST ---
> A funny thing is that while attached to gdb, X doesn't actually crash: I only
> get a SIGPIPE signal, and everything just works if I type 'continue'. When I
> detached Xorg, the signal killed the server (I guess that's intended).
>
X ignores SIGPIPE, you need to do the same in gdb. 'handle SIGPIPE
noprint nostop' at the gdb prompt should do the trick.

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Created an attachment (id=34403)
second gdb trace of the crash, SIGPIPE handling disabled

So here's a new gdb trace with SIGPIPE handling disabled, as asked above.

The screen turned uniformly orange-pink, and typing 'continue' didn't change anything to it. Hitting Ctrl+Alt+F[1-8] provoked another interruption in gdb, continuing didn't trigger anything new; hitting Ctrl+Alt+F[1-8] again had no effect, but going back to Ctrl+Alt+F7 provoked interruption in gdb, without changing the screen state.

Software versions:
xserver-xorg-core 1.6.5+git20091107+server-1.6-branch.2dbcb06a
xserver-xorg-video-intel 2.10.902+git20100317.31d5f84b
libdrm-intel1 2.4.19+git20100318.56712821
kernel drm-intel-next 2.6.33-997

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Created an attachment (id=34404)
output of intel_reg_dumper before suspending

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Created an attachment (id=34405)
output of intel_reg_dumper after returning from suspend

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Created an attachment (id=34406)
output of intel_reg_dumper after lock (during gdb interruption)

Here are the GPU dumps. Hope this is what you need, I didn't completely understand your comment about grabbing 'a intel-gpu-tools/tools/intel_reg_dump".

/sys/kernel/debug/dri/0/i915_error_state always said there was no error to report, at all of the 3 stages I checked it.

Does that help debugging?

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

I've just found out that a report in Ubuntu's Launchpad has 165 people marked as affected, with about 30 duplicate reports. I think that deserves a higher priority - it's been more than a year suspend is broken on i915 chips!

See https://bugs.launchpad.net/ubuntu/lucid/+source/xserver-xorg-video-intel/+bug/447159, where more similar stacktraces are available.

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Thanks to the new report mechanism in Ubuntu 10.04, I've been able to get traces for the kernel oops that occurs before the X crash, and of that X crash at the same time. Do you think I should open a bug in bugzilla.kernel.org rather?

See
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/553176
https://bugs.launchpad.net/ubuntu/+source/xorg-server/+bug/553174

Particularly interesting is:
http://launchpadlibrarian.net/42769271/OopsText.txt

In which we can see the trace leading to the oops:
 [<f857cac9>] ? intel_crt_detect+0x69/0xe0 [i915]
 [<f80ceeee>] ? drm_helper_probe_single_connector_modes+0x26e/0x300 [drm_kms_helper]
 [<f8368d5e>] ? drm_mode_object_find+0x4e/0x70 [drm]
 [<f8369b7f>] ? drm_mode_getconnector+0x2df/0x380 [drm]
 [<c0589b59>] ? mutex_lock+0x19/0x40
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<f835e7cd>] ? drm_ioctl+0x25d/0x3e0 [drm]
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<f83698a0>] ? drm_mode_getconnector+0x0/0x380 [drm]
 [<f835e570>] ? drm_ioctl+0x0/0x3e0 [drm]
 [<c0215f71>] ? vfs_ioctl+0x21/0x90
 [<c0216259>] ? do_vfs_ioctl+0x79/0x310
 [<c058d210>] ? do_page_fault+0x160/0x3a0
 [<c0216557>] ? sys_ioctl+0x67/0x80
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<c01033ec>] ? syscall_call+0x7/0xb
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140

Revision history for this message
In , Chris Wilson (ickle) wrote :

Ah, an OOPS! That makes a little more sense.

The userspace stacktraces are irrelevant, as any GPU hang or OOPS may trigger such a trace -- that one identical symptom may imply any number of bugs, i.e. all the duplicates are not necessary duplicate bugs.

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote : [i915] Crash after suspending (unable to handle kernel NULL pointer dereference at 00000108)

This happens about one hour after returning from suspend, and makes X crash. I'm posting a link to the X report where I put the informations and the link to upstream report.

ProblemType: KernelOops
DistroRelease: Ubuntu 10.04
Package: linux-image-2.6.32-19-generic
Regression: Yes
Reproducible: No
TestedUpstream: No
ProcVersionSignature: Ubuntu 2.6.32-19.28-generic 2.6.32.10+drm33.1
Uname: Linux 2.6.32-19-generic i686
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.21.
Annotation: Your system might become unstable now and might need to be restarted.
Architecture: i386
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: milan 1445 F.... pulseaudio
 /dev/snd/seq: timidity 1186 F.... timidity
CRDA: Error: [Errno 2] Aucun fichier ou dossier de ce type
Card0.Amixer.info:
 Card hw:0 'ICH6'/'Intel ICH6 with ALC250 at irq 17'
   Mixer name : 'Realtek ALC250 rev 2'
   Components : 'AC97a:414c4752'
   Controls : 33
   Simple ctrls : 21
CurrentDmesg:
 [ 59.755859] lib80211_crypt: registered algorithm 'TKIP'
 [ 66.656439] apm: BIOS not found.
 [ 67.324827] ppdev: user-space parallel port driver
 [ 69.416020] eth1: no IPv6 routers present
 [ 99.487128] padlock: VIA PadLock not detected.
Date: Thu Apr 1 11:53:06 2010
Failure: oops
Frequency: This has only happened once.
HibernationDevice: RESUME=UUID=dd37abed-09dd-499f-bbbe-c2c3866cb9cf
Lsusb:
 Bus 005 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: TOSHIBA Satellite A80
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
ProcCmdLine: BOOT_IMAGE=/vmlinuz-2.6.32-19-generic root=UUID=2dcc00f7-0fad-4823-9469-9e4e8dd841d0 ro crashkernel=384M-2G:64M,2G-:128M quiet splash
RelatedPackageVersions: linux-firmware 1.33
RfKill:

SourcePackage: linux
Title: BUG: unable to handle kernel NULL pointer dereference at 00000108
WpaSupplicantLog:

dmi.bios.date: 02/23/2005
dmi.bios.vendor: TOSHIBA
dmi.bios.version: V1.40
dmi.board.name: EAT10/EAT20
dmi.board.vendor: TOSHIBA
dmi.board.version: Null
dmi.chassis.asset.tag: *
dmi.chassis.type: 10
dmi.chassis.vendor: TOSHIBA
dmi.chassis.version: N/A
dmi.modalias: dmi:bvnTOSHIBA:bvrV1.40:bd02/23/2005:svnTOSHIBA:pnSatelliteA80:pvrPSA80E-02V024FR:rvnTOSHIBA:rnEAT10/EAT20:rvrNull:cvnTOSHIBA:ct10:cvrN/A:
dmi.product.name: Satellite A80
dmi.product.version: PSA80E-02V024FR
dmi.sys.vendor: TOSHIBA

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :
Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :

See bug 553174 for X.org side of the crash.

description: updated
description: updated
summary: - [i915] Crash after suspending (unable to handle kernel NULL pointer
- dereference at 00000108)
+ [i915] Crash after suspending (NULL pointer dereference in
+ intel_crt_detect())
tags: added: lucid
Revision history for this message
S. Christian Collins (s-chriscollins) wrote :

I am also affected by this bug.

** My System **
PC: HP Pavilion dv1550se laptop
CPU: Intel(R) Pentium(R) M processor 1.60GHz
RAM: 1GB DDR400
Video: Mobile 915GM/GMS/910GML Express Graphics Controller
Sound: 82801FB/FBM/FR/FW/FRW (ICH6 Family) AC'97 Audio Controller
OS: Ubuntu 10.04 w/ all updates as of 4/2/10 (including 2.6.32-19 kernel)

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :

Glad to know I'm not alone then! ;-) Marking as Triaged.

Changed in linux (Ubuntu):
status: New → Triaged
importance: Undecided → High
affects: linux → xorg-server
tags: added: metabug
Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

A lot of people seem to be affected by this bug (see all the duplicates) and it happens in a lot more places than just suspend and resume. The call traces are very similar in a lot of the duplicates, except they have intel_tv_detect on top, which I assume is a related function to intel_crt_detect.

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :

I don't think these are all duplicates. I've kept all the duplicates where the oops happens in intel_crt_detect(), and created bug 525801 for the ones in intel_tv_detect(). It seems Apport only considers the NULL pointer dereference as the common part, but I fail to see how this can ensure the bugs are identical. So better handle them separately than get tons of contradictory feedback from reporters.

Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

OK, I will break the bugs apart into this one for intel_crt_detect() and bug 525801 for the intel_tv_detect(). However, from the upstream bug report, "The userspace stacktraces are irrelevant, as any GPU hang or OOPS may trigger
such a trace -- that one identical symptom may imply any number of bugs, i.e.
all the duplicates are not necessary duplicate bugs." For now we can keep them together for efficiency's sake, but it may well be that they are different.

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :

I think upstream were talking about the X trace, not the Oops, which is the only way to get the real cause of the X crash.

Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

Oh yes, I see that now. But if the oops traces do point to the same bug, then couldn't the same bug be causing the intel_tv_detect() NULL pointer dereference? They have very similar stacktraces:

[<f857cac9>] ? intel_crt_detect+0x69/0xe0 [i915]
 [<f80ceeee>] ? drm_helper_probe_single_connector_modes+0x26e/0x300 [drm_kms_helper]
 [<f8368d5e>] ? drm_mode_object_find+0x4e/0x70 [drm]
 [<f8369b7f>] ? drm_mode_getconnector+0x2df/0x380 [drm]
 [<c0589b59>] ? mutex_lock+0x19/0x40
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<f835e7cd>] ? drm_ioctl+0x25d/0x3e0 [drm]
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<f83698a0>] ? drm_mode_getconnector+0x0/0x380 [drm]
 [<f835e570>] ? drm_ioctl+0x0/0x3e0 [drm]
 [<c0215f71>] ? vfs_ioctl+0x21/0x90
 [<c0216259>] ? do_vfs_ioctl+0x79/0x310
 [<c058d210>] ? do_page_fault+0x160/0x3a0
 [<c0216557>] ? sys_ioctl+0x67/0x80
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<c01033ec>] ? syscall_call+0x7/0xb
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140
 [<c04c64a7>] ? ethtool_get_drvinfo+0x137/0x140

 vs

 [<f868540f>] ? intel_tv_detect+0x8f/0x1c0 [i915]
 [<f8322c46>] ? drm_helper_probe_single_connector_modes+0x296/0x300 [drm_kms_helper]
 [<f84c6e8e>] ? drm_mode_object_find+0x4e/0x70 [drm]
 [<f84c83bf>] ? drm_mode_getconnector+0x2df/0x380 [drm]
 [<f84bd815>] ? drm_ioctl+0x185/0x370 [drm]
 [<c04c64a7>] ? hidinput_hid_event+0x1d7/0x3a0
 [<f84c80e0>] ? drm_mode_getconnector+0x0/0x380 [drm]
 [<c04c64a7>] ? hidinput_hid_event+0x1d7/0x3a0
 [<c02f1c84>] ? security_file_permission+0x14/0x20
 [<c0213e5b>] ? vfs_ioctl+0x7b/0x90
 [<c04c64a7>] ? hidinput_hid_event+0x1d7/0x3a0
 [<c0214159>] ? do_vfs_ioctl+0x79/0x310
 [<c0205750>] ? do_sync_write+0x0/0x100
 [<c0214457>] ? sys_ioctl+0x67/0x80
 [<c04c64a7>] ? hidinput_hid_event+0x1d7/0x3a0
 [<c010344c>] ? syscall_call+0x7/0xb
 [<c04c64a7>] ? hidinput_hid_event+0x1d7/0x3a0
 [<c04c64a7>] ? hidinput_hid_event+0x1d7/0x3a0

The bottom of the traces don't really matter, but they both have the same 3 functions leading up to the crash (besides the last one) which seems to indicate that at some common point, e.g. drm_mode_getconnector, NULL is passed as an argument to the next function in the call stack when it shouldn't be, thus causing the NULL pointer dereference later on down the road.

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :

Yeah, you're right, that could well be the same bug. But since we don't really know what leads to one or another trace... Could it be different hardware? In this case we could consider the traces as the same.

Revision history for this message
Gabe Gorelick (gabegorelick) wrote :

> Could it be different hardware?
All the bug reports are on i915.

My guess would be that intel_crt_detect gets called when a CRT monitor is detected, while intel_tv_detect is for LCDs. Can anyone who knows the driver comment on this?

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Can I do anything to ease debugging on this bug? I'd really like to help and get this fixed, this is quite annoying, and it seems to affect many users, seeing the number of duplicates only in Ubuntu.

Revision history for this message
S. Christian Collins (s-chriscollins) wrote :

I'm still experiencing the problem as of 10.04 RC1 (with updates as of 4/28/10). My problem is exactly as described by the OP. No CRT monitor is involved.

** My System **
PC: HP Pavilion dv1550se laptop
CPU: Intel(R) Pentium(R) M processor 1.60GHz
RAM: 1GB DDR400
Video: Mobile 915GM/GMS/910GML Express Graphics Controller
Sound: 82801FB/FBM/FR/FW/FRW (ICH6 Family) AC'97 Audio Controller
OS: Ubuntu 10.04 i386 w/ all updates as of 4/28/10 (including 2.6.32-21 kernel)

Revision history for this message
roger64 (rogqip-suse) wrote :

7th of May 2010

Toshiba A 80 - Intel Pentium M 1.73 Mhz
RAM: 1.5 GB
Video: Mobile 915GM/GMS/910GML Express Graphics Controller
Sound: 82801FB/FBM/FR/FW/FRW (ICH6 Family) AC'97 Audio Controller

This bug (crash of X between 10 to 60 minutes at the most) existed during the whole lifespan of Karmic.
It's still on with Lucid, even with kernel 2.6.33.3 and driver 2.11.

Of course present with current Lucid kernel 2.6.32. 22.23 and driver 2.9 (xserver-xorg-video-intel).

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Still happening with 2.6.34:
[ 2329.012081] PM: resume of devices complete after 1945.462 msecs
[ 2329.012241] PM: resume devices took 1.948 seconds
[ 2329.012273] PM: Finishing wakeup.
[ 2329.012276] Restarting tasks ... done.
[ 2329.050531] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
[wait one hour or so]
[ 3529.748173] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 3529.748409] render error detected, EIR: 0x00000000
[ 3529.748455] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 527114 at 527113)

Manoj Iyer (manjo)
tags: added: kernel-graphics kernel-reviewed
Revision history for this message
Manoj Iyer (manjo) wrote :

upstream has made some changes in this area, here is the mainline kernel build for ubuntu

http://kernel.ubuntu.com/~kernel-ppa/mainline/daily/current/

Can you please try the mainline kernel and see if this problem still exists?

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :

Confirmed with 2.6.34 here, no hope, it's been like that for more than a year...

Revision history for this message
Manoj Iyer (manjo) wrote :

Since this is also a bug in upstream kernel, I opened a report upstream.

https://bugzilla.kernel.org/show_bug.cgi?id=16123

Changed in linux (Ubuntu):
status: Triaged → Incomplete
assignee: nobody → Manoj Iyer (manjo)
Revision history for this message
In , Chris Wilson (ickle) wrote :

Massive memory corruption following hibernation should be fixed with:

commit 985b823b919273fe1327d56d2196b4f92e5d0fae
Author: Linus Torvalds <email address hidden>
Date: Fri Jul 2 10:04:42 2010 +1000

    drm/i915: fix hibernation since i915 self-reclaim fixes

    Since commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 ("drm/i915:
    Selectively enable self-reclaim"), we've been passing GFP_MOVABLE to the
    i915 page allocator where we weren't before due to some over-eager
    removal of the page mapping gfp_flags games the code used to play.

    This caused hibernate on Intel hardware to result in a lot of memory
    corruptions on resume. See for example

      http://bugzilla.kernel.org/show_bug.cgi?id=13811

I suspect that is is the memory corruption that is the root cause here.

Revision history for this message
In , Chris Wilson (ickle) wrote :

2.6.35-rc6 has a further fix for corruption on hibernation which nobody has been able to break (so far).

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Sorry, but it's still here, but in a different form (apparently no oops):
$ uname -r
2.6.35-020635rc6-generic

/var/log/kern.log:
[ 1467.408347] PM: Finishing wakeup.
[ 1467.408350] Restarting tasks ... done.
[ 1467.434616] [drm:drm_mode_getfb] *ERROR* invalid framebuffer id
[ 1467.747233] sky2 0000:02:00.0: eth0: enabling interface [...]
[ 1512.204160] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1512.205452] [drm:i915_do_wait_request] *ERROR* i915_do_wait_request returns -5 (awaiting 11072 at 11071)

At this point, the X server is killed, and won't restart:
Fatal server error:
Failed to submit batchbuffer: Input/output error

Should I try with a more recent X version? Seems to me that the bug is still in the kernel itself, so it may not change anything.

As always, please just ask if you need more testing. This bug is really a bitch...

Revision history for this message
In , Chris Wilson (ickle) wrote :

That doesn't appear to be the same bug. And I should have pointed that out in comment 17...

The original bug with the OOPs could only be the result of memory corruption. The invalid framebuffer id could have been a symptom of the same memory corruption but now appears to be a more subtle issue.

Milan, please open a fresh bug report that focuses on the framebuffer id error. This is simply to try and keep the report coherent and so easier to review. [Otherwise when developers read the first few comments to familiarise themselves with the bug, then skip to the end to catch the new updates, the report no longer makes any sense.]

Revision history for this message
In , Milan Bouchet-Valat (nalimilan) wrote :

Filed as https://bugzilla.kernel.org/show_bug.cgi?id=16488

Sorry, I know mixing problems on a single report is messy, but I've already tracked this bug using about five different reports, which were closed, and I'm losing track myself... ;-)

Revision history for this message
In , Chris Wilson (ickle) wrote :

> --- Comment #22 from Milan Bouchet-Valat <email address hidden> 2010-08-01 01:57:36 PDT ---
> Filed as https://bugzilla.kernel.org/show_bug.cgi?id=16488
>
> Sorry, I know mixing problems on a single report is messy, but I've already
> tracked this bug using about five different reports, which were closed, and I'm
> losing track myself... ;-)

Thanks. We are getting closer, it looks like there may be a few related
bugs across the components that are complicating the issue,
e.g. bug 29320.

Changed in xorg-server:
importance: Unknown → High
status: Unknown → Fix Released
Changed in xorg-server:
importance: High → Unknown
Changed in xorg-server:
importance: Unknown → High
Revision history for this message
dino99 (9d9) wrote :

This version has expired

Changed in linux (Ubuntu):
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.