[regression] 3.5.0-26-generic and 3.2.0-39-generic GPU hangs on Sandybridge

Bug #1140716 reported by luca on 2013-03-02
This bug affects 395 people
Affects Status Importance Assigned to Milestone
DRI
In Progress
Medium
Mesa
Unknown
Unknown
linux (Debian)
New
Unknown
linux (Fedora)
Unknown
Unknown
linux (Ubuntu)
Critical
Unassigned
Precise
Critical
Unassigned
Quantal
Critical
Unassigned
Raring
Critical
Unassigned
linux-lts-quantal (Ubuntu)
Critical
Unassigned
Precise
Critical
Unassigned
Quantal
Critical
Unassigned
Raring
Critical
Unassigned
linux-lts-raring (Ubuntu)
Critical
Unassigned
Precise
Critical
Unassigned
mesa (Ubuntu)
Critical
Unassigned
Precise
Critical
Unassigned

Bug Description

I'm getting errors about GPU hangs every minute or so (usually only when using FF and scrolling a webpage or something). I also get an annoying ubuntu dialog saying there is a "system error".

This didn't happen with 3.5.0-24-generic.

Here is the dmesg:
[15169.033709] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[15169.034517] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[15628.480216] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[15628.480570] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[15844.231372] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[15844.231773] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[20173.232593] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[20173.233211] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[26285.650393] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[26285.650980] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[26285.658405] ------------[ cut here ]------------
[26285.658472] WARNING: at /build/buildd/linux-3.5.0/drivers/gpu/drm/i915/intel_pm.c:2505 gen6_enable_rps+0x706/0x710 [i915]()
[26285.658474] Hardware name: SATELLITE Z830
[26285.658476] Modules linked in: sdhci_pci sdhci btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs reiserfs ext2 snd_hda_codec_hdmi snd_hda_codec_realtek joydev btusb coretemp kvm_intel kvm arc4 ghash_clmulni_intel aesni_intel cryptd aes_x86_64 snd_hda_intel snd_hda_codec snd_hwdep uvcvideo snd_pcm videobuf2_core microcode videodev bnep iwlwifi videobuf2_vmalloc snd_seq_midi psmouse videobuf2_memops snd_rawmidi rfcomm pcspkr snd_seq_midi_event serio_raw snd_seq bluetooth mac80211 snd_timer snd_seq_device i915 drm_kms_helper cfg80211 drm toshiba_acpi snd sparse_keymap soundcore wmi i2c_algo_bit toshiba_bluetooth snd_page_alloc parport_pc mei video mac_hid lpc_ich ppdev nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc lp parport e1000e ahci libahci [last unloaded: sdhci]
[26285.658537] Pid: 23433, comm: kworker/u:0 Not tainted 3.5.0-26-generic #40-Ubuntu
[26285.658539] Call Trace:
[26285.658549] [<ffffffff81051bef>] warn_slowpath_common+0x7f/0xc0
[26285.658553] [<ffffffff81051c4a>] warn_slowpath_null+0x1a/0x20
[26285.658569] [<ffffffffa02d32e6>] gen6_enable_rps+0x706/0x710 [i915]
[26285.658584] [<ffffffffa02bf3f6>] intel_modeset_init_hw+0x66/0xa0 [i915]
[26285.658595] [<ffffffffa02954b4>] i915_reset+0x1a4/0x6e0 [i915]
[26285.658601] [<ffffffff8101257b>] ? __switch_to+0x12b/0x420
[26285.658612] [<ffffffffa029a943>] i915_error_work_func+0xc3/0x110 [i915]
[26285.658618] [<ffffffff8107097a>] process_one_work+0x12a/0x420
[26285.658629] [<ffffffffa029a880>] ? gen6_pm_rps_work+0xe0/0xe0 [i915]
[26285.658632] [<ffffffff8107152e>] worker_thread+0x12e/0x2f0
[26285.658636] [<ffffffff81071400>] ? manage_workers.isra.26+0x200/0x200
[26285.658640] [<ffffffff81076023>] kthread+0x93/0xa0
[26285.658644] [<ffffffff8168a3e4>] kernel_thread_helper+0x4/0x10
[26285.658649] [<ffffffff81075f90>] ? kthread_freezable_should_stop+0x70/0x70
[26285.658652] [<ffffffff8168a3e0>] ? gs_change+0x13/0x13
[26285.658654] ---[ end trace 59c6162fdfcbffee ]---
[26756.021167] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[26756.021426] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[26766.014093] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[26766.014397] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[26932.376233] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[26932.376544] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[26932.384285] ------------[ cut here ]------------
[26932.384354] WARNING: at /build/buildd/linux-3.5.0/drivers/gpu/drm/i915/intel_pm.c:2505 gen6_enable_rps+0x706/0x710 [i915]()
[26932.384356] Hardware name: SATELLITE Z830
[26932.384358] Modules linked in: sdhci_pci sdhci btrfs zlib_deflate libcrc32c ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs reiserfs ext2 snd_hda_codec_hdmi snd_hda_codec_realtek joydev btusb coretemp kvm_intel kvm arc4 ghash_clmulni_intel aesni_intel cryptd aes_x86_64 snd_hda_intel snd_hda_codec snd_hwdep uvcvideo snd_pcm videobuf2_core microcode videodev bnep iwlwifi videobuf2_vmalloc snd_seq_midi psmouse videobuf2_memops snd_rawmidi rfcomm pcspkr snd_seq_midi_event serio_raw snd_seq bluetooth mac80211 snd_timer snd_seq_device i915 drm_kms_helper cfg80211 drm toshiba_acpi snd sparse_keymap soundcore wmi i2c_algo_bit toshiba_bluetooth snd_page_alloc parport_pc mei video mac_hid lpc_ich ppdev nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc lp parport e1000e ahci libahci [last unloaded: sdhci]
[26932.384421] Pid: 24262, comm: kworker/u:2 Tainted: G W 3.5.0-26-generic #40-Ubuntu
[26932.384422] Call Trace:
[26932.384431] [<ffffffff81051bef>] warn_slowpath_common+0x7f/0xc0
[26932.384436] [<ffffffff81051c4a>] warn_slowpath_null+0x1a/0x20
[26932.384451] [<ffffffffa02d32e6>] gen6_enable_rps+0x706/0x710 [i915]
[26932.384466] [<ffffffffa02bf3f6>] intel_modeset_init_hw+0x66/0xa0 [i915]
[26932.384476] [<ffffffffa02954b4>] i915_reset+0x1a4/0x6e0 [i915]
[26932.384482] [<ffffffff8101257b>] ? __switch_to+0x12b/0x420
[26932.384493] [<ffffffffa029a943>] i915_error_work_func+0xc3/0x110 [i915]
[26932.384500] [<ffffffff8107097a>] process_one_work+0x12a/0x420
[26932.384511] [<ffffffffa029a880>] ? gen6_pm_rps_work+0xe0/0xe0 [i915]
[26932.384514] [<ffffffff8107152e>] worker_thread+0x12e/0x2f0
[26932.384517] [<ffffffff81071400>] ? manage_workers.isra.26+0x200/0x200
[26932.384521] [<ffffffff81076023>] kthread+0x93/0xa0
[26932.384526] [<ffffffff8168a3e4>] kernel_thread_helper+0x4/0x10
[26932.384531] [<ffffffff81075f90>] ? kthread_freezable_should_stop+0x70/0x70
[26932.384534] [<ffffffff8168a3e0>] ? gs_change+0x13/0x13
[26932.384536] ---[ end trace 59c6162fdfcbffef ]---

ProblemType: Bug
DistroRelease: Ubuntu 12.10
Package: linux-image-3.5.0-26-generic 3.5.0-26.40
ProcVersionSignature: Ubuntu 3.5.0-26.40-generic 3.5.7.6
Uname: Linux 3.5.0-26-generic x86_64
ApportVersion: 2.6.1-0ubuntu10
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: luca 2084 F.... pulseaudio
CheckboxSubmission: f8b82cd9bc23fe075e5068a9824afda5
CheckboxSystem: b1865df84255b8716d3bcc269ff410d1
Date: Sat Mar 2 22:25:14 2013
HibernationDevice: RESUME=UUID=20fe6da8-7d68-4660-953f-6e4ae1d348a7
InstallationDate: Installed on 2012-04-26 (310 days ago)
InstallationMedia: Ubuntu 12.04 LTS "Precise Pangolin" - Release amd64 (20120425)
MachineType: TOSHIBA SATELLITE Z830
MarkForUpload: True
ProcFB: 0 inteldrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.5.0-26-generic root=UUID=36929bf3-a158-44d9-a80d-3adac2840fa8 ro quiet splash acpi_backlight=vendor i915.i915_enable_rc6=1 i915.lvds_downclock=1 vt.handoff=7
RelatedPackageVersions:
 linux-restricted-modules-3.5.0-26-generic N/A
 linux-backports-modules-3.5.0-26-generic N/A
 linux-firmware 1.95
SourcePackage: linux
UpgradeStatus: Upgraded to quantal on 2012-10-28 (125 days ago)
dmi.bios.date: 07/31/2012
dmi.bios.vendor: TOSHIBA
dmi.bios.version: Version 1.70
dmi.board.asset.tag: 0000000000
dmi.board.name: Portable PC
dmi.board.vendor: TOSHIBA
dmi.board.version: Version A0
dmi.chassis.asset.tag: 0000000000
dmi.chassis.type: 10
dmi.chassis.vendor: TOSHIBA
dmi.chassis.version: Version 1.0
dmi.modalias: dmi:bvnTOSHIBA:bvrVersion1.70:bd07/31/2012:svnTOSHIBA:pnSATELLITEZ830:pvrPT22LE-00300GGR:rvnTOSHIBA:rnPortablePC:rvrVersionA0:cvnTOSHIBA:ct10:cvrVersion1.0:
dmi.product.name: SATELLITE Z830
dmi.product.version: PT22LE-00300GGR
dmi.sys.vendor: TOSHIBA

luca (llucax) wrote :

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed

Another note, when these hungs happen, I get graphic corruption (usually in the fonts/text).

luca (llucax) wrote :

kernel 3.5.0-25-generic also seems to work fine.

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: regression-update
luca (llucax) wrote :

After I resumed from suspension with kernel 3.5.0-25-generic I got again the annoying dialogs saying there was a GPU hung detected asking me to report a bug that I have no idea where is going, but looking at dmesg I can't see anything strange [1]. How can I see why those dialogs are being open to see if there is something wrong?

I had the annoying dialog several times in a very short period of time, like 10 times in about 5 minutes and then it stopped. After that I suspended and resumed my laptop a couple of times and it didn't happen again so far.

[1] Except for messages like this but I'm getting this since I bought this computer about an year ago and never had those annoying dialog about any GPU hang:
[52682.020386] CPU1: Package power limit notification (total events = 5770)
[52682.020389] CPU3: Package power limit notification (total events = 5769)
[52682.020391] CPU2: Package power limit notification (total events = 5761)
[52682.020393] CPU0: Package power limit notification (total events = 5746)
[52682.021517] CPU3: Package power limit normal
[52682.021520] CPU1: Package power limit normal
[52682.021521] CPU2: Package power limit normal
[52682.021526] CPU0: Package power limit normal

luca (llucax) wrote :

Also, I couldn't see any graphic corruption this last time with kernel 3.5.0-25

Craig McQueen (cmcqueen1975) wrote :

This affects me, but in my case I'm running Ubuntu 12.04, and the problem seems to be with kernel 3.2.0-39. Booting to kernel 3.2.0-38 seems to have fixed it.

1 comments hidden view all 501 comments
Hans (old-man999) wrote :
luca (llucax) wrote :

Seemsto befixed in linux-image-3.5.0-26-generic 3.5.0-26.42

luca (llucax) wrote :

Nope, stil getting it with linux-image-3.5.0-26-generic 3.5.0-26.42

[32861.907463] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[32861.907470] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[32861.911988] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
...
[39199.903510] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[39199.903846] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

The same here, previous kernel 3.5.0-25-generic works without problems, 3.5.0-26.42 hanged just now:

$ dmesg|grep i915
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.5.0-26-generic root=/dev/mapper/System-root ro quiet splash i915.i915_enable_rc6=1 vt.handoff=7
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.5.0-26-generic root=/dev/mapper/System-root ro quiet splash i915.i915_enable_rc6=1 vt.handoff=7
[ 1.667363] i915 0000:00:02.0: power state changed by ACPI to D0
[ 1.667367] i915 0000:00:02.0: power state changed by ACPI to D0
[ 1.667652] i915 0000:00:02.0: setting latency timer to 64
[ 1.687950] i915 0000:00:02.0: irq 44 for MSI/MSI-X
[ 2.429882] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[ 330.684154] i915 0000:00:02.0: power state changed by ACPI to D3
[ 331.826825] i915 0000:00:02.0: power state changed by ACPI to D0
[ 331.826829] i915 0000:00:02.0: power state changed by ACPI to D0
[ 331.826830] i915 0000:00:02.0: setting latency timer to 64
[ 1677.075872] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1677.075876] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state

Content of i915_error_state attached.

Will disable rc6 and test what happens then.

With rc6 off the hangup happened 2 minutes after booting:

$ dmesg|grep i915
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.5.0-26-generic root=/dev/mapper/System-root ro quiet splash i915.i915_enable_rc6=0 vt.handoff=7
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.5.0-26-generic root=/dev/mapper/System-root ro quiet splash i915.i915_enable_rc6=0 vt.handoff=7
[ 0.857239] i915 0000:00:02.0: power state changed by ACPI to D0
[ 0.857242] i915 0000:00:02.0: power state changed by ACPI to D0
[ 0.857458] i915 0000:00:02.0: setting latency timer to 64
[ 0.877771] i915 0000:00:02.0: irq 44 for MSI/MSI-X
[ 1.619983] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[ 128.787009] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 128.787013] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 254.699283] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

Seems it's time to return to 3.5.0-25-generic.

Laurent (l-perlat) wrote :

Same problem here :

Visual corruptions + "GPU hang" error when scrolling in Firefox with 3.5.0-26.

Everything back to normal on 3.5.0-25 (Linux 3.5.0-25-generic #39-Ubuntu SMP Mon Feb 25 18:26:58 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux)

tobyS (tobias-schlitt) wrote :

Beside visual corruptions and hangs I also experience complete system hang ups (no reaction until hard reboot) and occasional kernel panics. I therefore wonder why this report does not receive higher prio?

czigor (czigor) on 2013-03-20
summary: - [regression] 3.5.0-26-generic CPU hangs
+ [regression] 3.5.0-26-generic GPU hangs

Same issue here. Happens very often during firefox usage, but also on other ocasions.

[51278.392895] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[51278.392901] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[51278.397785] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

shuerhaaken (shkn) wrote :

This is really getting annoying, is anybody taking care of this?

czigor (czigor) wrote :

@shkn:
Using 3.5.0-25-generic made my PC usable again. I get an error message only at login.

Timo Aaltonen (tjaalton) wrote :

it's one of these commits (from the quantal kernel), likely the top one since it's happening on sandybridge:

817e8fdee14b05d drm/i915: Implement WaDisableHiZPlanesWhenMSAAEnabled
4c443ec9afe7f6f drm/i915: GFX_MODE Flush TLB Invalidate Mode must be '1' for scanline waits
f534135423c7028 drm/i915: Disable AsyncFlip performance optimisations
c0c1fd8a18479f0 drm/i915: Invalidate the relocation presumed_offsets along the slow path

Changed in linux (Ubuntu):
assignee: nobody → Ubuntu Kernel Team (ubuntu-kernel-team)
importance: Medium → Critical
Changed in linux (Ubuntu Quantal):
importance: Undecided → Critical
status: New → Confirmed
Changed in linux (Ubuntu Precise):
importance: Undecided → Critical
status: New → Confirmed
Timo Aaltonen (tjaalton) wrote :

note that I'm not sure it's affecting raring, maybe not.

Adam Conrad (adconrad) on 2013-03-22
Changed in linux (Ubuntu Precise):
status: Confirmed → Invalid
Changed in linux-lts-quantal (Ubuntu Precise):
status: New → Confirmed
Changed in linux-lts-quantal (Ubuntu Quantal):
status: New → Invalid
Changed in linux-lts-quantal (Ubuntu Raring):
status: New → Invalid
Changed in linux-lts-quantal (Ubuntu Precise):
importance: Undecided → Critical
Changed in linux (Ubuntu Precise):
importance: Critical → Undecided
tags: added: performing-bisect
Joseph Salisbury (jsalisbury) wrote :

I'd like to perform a kernel bisect to identify the exact commit that introduced this regression. However, it would be good to test the latest mainline and a test kernel with commit 817e8fdee14b05d reverted.

The latest mainline kernel can be downloaded from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc3-raring/

Can folks affected by this bug test the v3.9-rc3 kernel?

One thing to note, you will need to install both the linux-image and linux-image-extra .deb packages.

I will also build a Quantal test kernel with commit 817e8fdee14b05d reverted and post a link shortly.

Thanks in advance!

Joseph Salisbury (jsalisbury) wrote :

I built a Quantal test kernel with commit 817e8fdee14b05d reverted. The kernel can be downloaded from:
http://people.canonical.com/~jsalisbury/lp1140716/

Can folks affected by this bug test this kernel and report back if it fixes the issue?

Gard Spreemann (gspreemann) wrote :

@jsalisbury: I could not successfully test the kernel you linked to in comment #22, as it rendered my system unusable. X started at 640x480, there was no working keyboard/mouse, and I could not SSH in.

franglais.125 (franglais.125) wrote :

@jsalisbury: Thanks for pointing to this kernel version. I have been able to successfully test kernel v3.9-rc3 on Precise 12.04.2 (I am running with quantal-lts xorg stack).
I have been running on it for ~ 3 hours so far with success. It usually took some time for me to hit this bug on my Dell V131, so some more testing might be required.
I will report back if I hit the bug again. So far so good.

luca (llucax) wrote :

Also initial success for now. Still getting the annoying dialog at startup though (but no signs of GPU hungs in dmesg).

@jsakusbury: I tested your kernel 3.5.0-27-generic #45~lp1140716v1 (from comment #22), it was no improvement for my system. I got two hangups within the first hour (one S3 cycle at 1985), the second one forced me to turn off the system:

[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.5.0-27-generic root=/dev/mapper/System-root ro quiet splash i915.i915_enable_rc6=1 vt.handoff=7
[ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.5.0-27-generic root=/dev/mapper/System-root ro quiet splash i915.i915_enable_rc6=1 vt.handoff=7
[ 0.804805] i915 0000:00:02.0: power state changed by ACPI to D0
[ 0.804809] i915 0000:00:02.0: power state changed by ACPI to D0
[ 0.805030] i915 0000:00:02.0: setting latency timer to 64
[ 0.824988] i915 0000:00:02.0: irq 43 for MSI/MSI-X
[ 1.563280] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
[ 1894.853449] i915 0000:00:02.0: power state changed by ACPI to D3
[ 1896.202702] i915 0000:00:02.0: power state changed by ACPI to D0
[ 1896.202708] i915 0000:00:02.0: power state changed by ACPI to D0
[ 1896.202720] i915 0000:00:02.0: setting latency timer to 64
[ 1984.429241] i915 0000:00:02.0: power state changed by ACPI to D3
[ 1985.767157] i915 0000:00:02.0: power state changed by ACPI to D0
[ 1985.767160] i915 0000:00:02.0: power state changed by ACPI to D0
[ 1985.767168] i915 0000:00:02.0: setting latency timer to 64
[ 2132.278551] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 2132.278555] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 3504.895781] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

luca (llucax) wrote :

torsten, maybe you are having a different issue, note that your hang doesn't look like related to rc6 state.

 [51278.397785] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

BTW, my system is still surviving without hangs with the patched 3.5 kernel.

luca (llucax) wrote :

I just had a burst of dialogs informing of non-existent GPU hangs (with kernel 3.5 patched). The GPU hans are not reported in dmesg though, so I don't know where is it getting from. Also no corruption or anything. Seems like the dialog madness is started when an unrelated program crashes. Maybe is just an apport bug? How should I proceed to see what's really going on?

@jsalisbury: I've been running your 3.5.0-27-generic #45~lp1140716v1 for 5 hours and I've already had 3 hangs. No improvement here.

[ 5733.121323] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 5733.121330] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 5733.124957] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

luca (llucax) wrote :

OK, it took a while but I got the GPU hang finally with kernel3.5.0-27-generic #45~lp1140716v1 :

[22344.085044] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[22344.085051] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[22344.090106] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

luca (llucax) wrote :

Always happens with firefox, an only with certain sites (consistently).

luca (llucax) wrote :

I got this with a second hang:

[22344.085044] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[22344.085051] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[22344.090106] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[23652.138382] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[23652.138898] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[23652.146420] ------------[ cut here ]------------
[23652.146491] WARNING: at /home/jsalisbury/bugs/lp1140716/ubuntu-quantal/drivers/gpu/drm/i915/intel_pm.c:2505 gen6_enable_rps+0x706/0x710 [i915]()
[23652.146495] Hardware name: SATELLITE Z830
[23652.146497] Modules linked in: sdhci_pci sdhci snd_hda_codec_hdmi snd_hda_codec_realtek joydev btusb coretemp kvm_intel kvm ghash_clmulni_intel aesni_intel arc4 cryptd aes_x86_64 snd_hda_intel snd_hda_codec snd_hwdep snd_pcm uvcvideo videobuf2_core videodev snd_seq_midi videobuf2_vmalloc videobuf2_memops snd_rawmidi microcode snd_seq_midi_event iwlwifi snd_seq snd_timer snd_seq_device i915 bnep rfcomm mac80211 toshiba_acpi sparse_keymap drm_kms_helper wmi toshiba_bluetooth snd pcspkr bluetooth drm i2c_algo_bit cfg80211 soundcore psmouse mac_hid snd_page_alloc video serio_raw mei lpc_ich parport_pc ppdev nfsd nfs lockd fscache auth_rpcgss nfs_acl lp sunrpc parport ahci libahci e1000e [last unloaded: sdhci]
[23652.146578] Pid: 3451, comm: kworker/u:0 Not tainted 3.5.0-27-generic #45~lp1140716v1
[23652.146581] Call Trace:
[23652.146592] [<ffffffff81051bef>] warn_slowpath_common+0x7f/0xc0
[23652.146599] [<ffffffff81051c4a>] warn_slowpath_null+0x1a/0x20
[23652.146621] [<ffffffffa03f6316>] gen6_enable_rps+0x706/0x710 [i915]
[23652.146640] [<ffffffffa03e2446>] intel_modeset_init_hw+0x66/0xa0 [i915]
[23652.146655] [<ffffffffa03b84b4>] i915_reset+0x1a4/0x6e0 [i915]
[23652.146663] [<ffffffff8101257b>] ? __switch_to+0x12b/0x420
[23652.146679] [<ffffffffa03bd943>] i915_error_work_func+0xc3/0x110 [i915]
[23652.146688] [<ffffffff8107098a>] process_one_work+0x12a/0x420
[23652.146701] [<ffffffffa03bd880>] ? gen6_pm_rps_work+0xe0/0xe0 [i915]
[23652.146707] [<ffffffff8107153e>] worker_thread+0x12e/0x2f0
[23652.146712] [<ffffffff81071410>] ? manage_workers.isra.26+0x200/0x200
[23652.146719] [<ffffffff81076033>] kthread+0x93/0xa0
[23652.146726] [<ffffffff8168ab24>] kernel_thread_helper+0x4/0x10
[23652.146732] [<ffffffff81075fa0>] ? kthread_freezable_should_stop+0x70/0x70
[23652.146737] [<ffffffff8168ab20>] ? gs_change+0x13/0x13
[23652.146740] ---[ end trace 2153106cc632835c ]---

Gard Spreemann (gspreemann) wrote :

I'm confused as to where the commits referenced by tjaalton in comment #19 live, but for what it's worth, I seem to have a stable system after applying reverse diffs of the following commits from the linux-3.5.y branch of git://kernel.ubuntu.com/ubuntu/linux.git to the 3.5.0-27.45 sources:

2964148 - drm/i915: Implement WaDisableHiZPlanesWhenMSAAEnabled
899b550 - drm/i915: GFX_MODE Flush TLB Invalidate Mode must be '1' for scanline waits

Just reverting the first, or using jsalisbury's kernel from comment #22 (ignore my comment #23, I was being an idiot and forgot the modules) gives me a GPU hang and/or graphics corruption within minutes, especially quickly if opening Firefox. After reverting both of the above, I haven't been able to hang the system yet.

Kernel 3.9.0-030900rc3-generic from comment #21 is much more stable for me, no problems so far after 4h of operation.

franglais.125 (franglais.125) wrote :

@jsalisbury: After a few days of use and many suspend-resume cycles, I am yet to encounter a problem with kernel 3.9-rc3 (as indicated in comment #21). No problems whatsoever on my Dell v131 (i5 Sandybridge)...

Peter Saunderson (peteasa) wrote :

I got this a lot with Kernel: 3.5.0-26-generic and used a quick workround to avoid the problem: http://askubuntu.com/questions/225356/how-can-i-enable-the-sna-acceleration-method-for-intel-cards-under-ubuntu-12-04

SNA does not seem to have the same issue just UXA. If I have time I can try a new kernel but I spent so much time on this already it may be a few days before I get the time to try the new kernel.

Max Rameau (afrimax-e) wrote :

I had the problem right after updating (not upgrading) on Saturday using 12.04.

I was able to control it by logging into 2D and immediately opening the System Monitor and shutting down the three instances of Ubuntu One (login, synch and launch), because the machine would freeze upon login to Ubuntu One. I then had to shut down zeitgeist-fts, because that would start eating up resourced (upto 300mb of memory at one point).

At that point, I just decided to reinstall into 12.10. I did that and it worked fine for an hour, so I started transfering over my backed up files and logged into Ubuntu One while running the updates. The problems began immediately, including resource use going up to 100% for long periods of time, mainly through the multiplication of the gkts (?) service. It only used 3.8MB at a time, but at one point there were 20 instances of it open. I concluded it was Ubuntu One causing the problem, so I reinstalled again, this time not logging into Ubuntu One. No problems for 4 hours, even as I installed software. Then I ran the automatic software update, and the problems began again immediately.

Constant crashing, crazy graphic corruption and other issues. Ran the system log and got the similar error:

kernel [224.243459] [drm: Enable RC6 States: RC6 off, RC6p off, RC6p off]
kernal [246.465377] [drm: i95_hangcheck_hung] *ERROR* Hangcheck timer elapsed GPU hung

etc., etc.

This is a nightmare. Need a fix.

Matthew Eaton (powder) wrote :

Test kernel did not fix the issue for me.

Linux matt-work 3.5.0-27-generic #45~lp1140716v1 SMP Fri Mar 22 15:50:00 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Mar 25 08:12:15 matt-work kernel: [ 158.302349] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Mar 25 08:12:15 matt-work kernel: [ 158.302353] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Mar 25 08:12:15 matt-work kernel: [ 158.305230] [drm] Enabling RC6 states: RC6 off, RC6p off, RC6pp off
Mar 25 08:12:36 matt-work kernel: [ 179.663557] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Mar 25 08:12:36 matt-work kernel: [ 179.663780] [drm] Enabling RC6 states: RC6 off, RC6p off, RC6pp off

Joseph Salisbury (jsalisbury) wrote :

Thanks, everyone for testing. So it sounds like my test kernel did not fix this bug. However, it sounds like this bug is fixed in the v3.9 mainline kernel, at least in rc3.

I can perform a "Reverse" kernel bisect to identify the commit that fixes this bug. It will first require us to identify the first v3.9 release candidate that does not exhibit this bug.

We know that it is fixed in rc3, so it would be good to test rc1 and rc2. Can folks affected by this bug test those two release candidates:

v3.9-rc1: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc1-raring/
v3.9-rc2: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc2-raring/

Matthew Eaton (powder) wrote :

I've been on the rc1 kernel for about 3 hours with no problem.

Linux matt-work 3.9.0-030900rc1-generic #201303060659 SMP Wed Mar 6 12:00:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

tags: added: kernel-key
Robert Hooker (sarvatt) on 2013-04-02
Changed in linux (Ubuntu Precise):
status: Invalid → Confirmed
importance: Undecided → Critical
Changed in linux (Ubuntu Raring):
assignee: Ubuntu Kernel Team (ubuntu-kernel-team) → nobody
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Robert Hooker (sarvatt) on 2013-04-02
summary: - [regression] 3.5.0-26-generic GPU hangs
+ [regression] 3.5.0-26-generic and 3.2.0-39-generic GPU hangs
summary: - [regression] 3.5.0-26-generic and 3.2.0-39-generic GPU hangs
+ [regression] 3.5.0-26-generic and 3.2.0-39-generic GPU hangs on
+ Sandybridge
Robert Hooker (sarvatt) on 2013-04-02
Changed in linux (Ubuntu Raring):
status: Confirmed → Invalid
Changed in linux (Ubuntu Quantal):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu Precise):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu Raring):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
tags: added: apport-collected
Aymeric PETIT (mulx) on 2013-04-09
tags: added: precise
tags: removed: kernel-key
tags: removed: performing-bisect
Tim Gardner (timg-tpi) on 2013-04-09
Changed in linux (Ubuntu Quantal):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Precise):
status: Confirmed → Fix Committed
Steve Conklin (sconklin) on 2013-04-15
tags: added: verification-needed-precise
tags: added: verification-needed-quantal
tags: added: verification-done-precise
removed: verification-needed-precise
tags: added: verification-done-quantal
removed: verification-needed-quantal
tags: added: verification-failed-precise
removed: verification-done-precise
tags: added: verification-done-precise
removed: verification-failed-precise
Sergio (sergio-otero) on 2013-04-23
Changed in linux (Ubuntu Precise):
status: Fix Committed → Fix Released
Brad Figg (brad-figg) on 2013-04-23
Changed in linux (Ubuntu Precise):
status: Fix Released → Fix Committed
Changed in linux (Ubuntu Raring):
status: Invalid → New
Changed in linux (Ubuntu Raring):
status: New → Confirmed
tags: added: kernel-da-key
Changed in linux (Ubuntu Precise):
status: Fix Committed → Fix Released
Changed in linux-lts-quantal (Ubuntu Precise):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Quantal):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu Raring):
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
tags: added: kernel-stable-key
Changed in linux (Debian):
status: Unknown → New
Changed in linux:
importance: Unknown → Medium
status: Unknown → Invalid
Changed in dri:
importance: Unknown → Medium
status: Unknown → Confirmed
no longer affects: linux-lts-raring (Ubuntu Quantal)
no longer affects: linux-lts-raring (Ubuntu Raring)
tags: added: patch
Changed in linux-lts-raring (Ubuntu Precise):
status: New → Confirmed
Changed in linux-lts-raring (Ubuntu):
status: New → Confirmed
gokul (gokulnathonline) on 2013-10-08
information type: Public → Public Security
information type: Public Security → Public
tags: added: saucy
Changed in linux-lts-quantal (Ubuntu):
importance: Undecided → Critical
Changed in linux-lts-quantal (Ubuntu Quantal):
importance: Undecided → Critical
Changed in linux-lts-quantal (Ubuntu Raring):
importance: Undecided → Critical
Changed in linux-lts-raring (Ubuntu):
importance: Undecided → Critical
Changed in linux-lts-raring (Ubuntu Precise):
importance: Undecided → Critical
Changed in linux-lts-quantal (Ubuntu Precise):
status: Fix Released → Invalid
Changed in linux (Ubuntu Raring):
status: Confirmed → Invalid
Changed in linux (Ubuntu Quantal):
status: Fix Released → Invalid
Changed in linux (Ubuntu Precise):
status: Fix Released → Invalid
Changed in linux-lts-raring (Ubuntu):
status: Confirmed → Triaged
Changed in linux-lts-raring (Ubuntu):
status: Triaged → Invalid
Changed in linux-lts-raring (Ubuntu Precise):
status: Confirmed → Invalid
Changed in mesa (Ubuntu):
importance: Undecided → Critical
status: New → Triaged
Changed in linux (Ubuntu Precise):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
Changed in linux (Ubuntu Quantal):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
Changed in linux (Ubuntu Raring):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
Changed in mesa (Ubuntu Precise):
status: New → Triaged
importance: Undecided → Critical
Changed in dri:
status: Confirmed → In Progress
421 comments hidden view all 501 comments

*** Bug 82666 has been marked as a duplicate of this bug. ***

*** Bug 82691 has been marked as a duplicate of this bug. ***

*** Bug 82901 has been marked as a duplicate of this bug. ***

*** Bug 83098 has been marked as a duplicate of this bug. ***

*** Bug 83156 has been marked as a duplicate of this bug. ***

*** Bug 83326 has been marked as a duplicate of this bug. ***

*** Bug 83473 has been marked as a duplicate of this bug. ***

*** Bug 83661 has been marked as a duplicate of this bug. ***

Is there any ongoing development to fix this bug? I still see it with
Linux <hostname> 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

And the latest intel drivers as provided by intel linux graphics installer from
https://01.org/linuxgraphics/

Many times my system freezes few minutes after starting to watch a movie with vlc. I have my screen connected through a receiver (hdmi for audio + video) with the linux system. The probability for a freeze is higher when the hdmi receiver was powered of for some time before playing the movie than when I do a reboot and hdmi is always on.

I'm happy to help with crashdumps as far as I'm able to collect them.

(In reply to comment #183)

I recommend configuring i915.semaphores=0. I did it and it doesn't freeze anymore.

*** Bug 83721 has been marked as a duplicate of this bug. ***

*** Bug 83783 has been marked as a duplicate of this bug. ***

Hi Chris,

meanwhile my current kernel is 3.16.1-46.1.g90bc0f1
I'm wondering (after a reinstall) that the semaphore bug hasn't occured yet, which was the case before (after a fresh install).

This leads me to 4 definable possible reasons:

1. the named kernel revision somehow contains a fix for it. looking at the changes I could'nt get an affirmation to that assumption.
2. cgroup_memory=disabled has a relation to it. (That's why I removed it for now).
3. the BIOS settings (which could be different now) might have something to do with it.
4. I haven't installed KVM suppport yet.

I'll post again if I find a reproducible explanation.
Frank

2. of course I meant cgroup_disable=memory

Hi Chris,

OK, nothing of the above was the reason. In my case it's simply this:

/etc/X11/xorg.conf.d/20-intel.conf

Section "Device"
   Identifier "Intel Graphics"
   Driver "intel"
   Option "TearFree" "true"
EndSection

I added it when the tearing scrolling through large webpages annoyed me.
As soon as I added it, the problems quickly started.

Selfmade problem.

Frank

(In reply to comment #189)
> Hi Chris,
>
> OK, nothing of the above was the reason. In my case it's simply this:
>
> /etc/X11/xorg.conf.d/20-intel.conf
>
> Section "Device"
> Identifier "Intel Graphics"
> Driver "intel"
> Option "TearFree" "true"
> EndSection
>
>
> I added it when the tearing scrolling through large webpages annoyed me.
> As soon as I added it, the problems quickly started.
>
> Selfmade problem.

Not really, https://bugs.freedesktop.org/show_bug.cgi?id=70764 tracks that this hang is more likely with TearFree (fundamentally the hang is still the same hardware issue, but it is interesting that TearFree has a higher chance of hitting it).

If you want to experiment:

 http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=requests

should have an interesting fix, at least for trying to prevent the TearFree leading to the semaphore hang.

What information is most useful for these repeating issues, as it just happened again:

 Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139690] [drm] stuck on render ring
 Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139699] [drm] stuck on blitter ring
 Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.140239] [drm] GPU HANG: ecode 0:0xf4e9fffe, in Xorg [26353], reason: Ring hung, action: reset
 Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.140750] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
 Sep 16 08:32:59 arrowsmithlap1 kernel: [drm] stuck on render ring
 Sep 16 08:32:59 arrowsmithlap1 kernel: [drm] stuck on blitter ring
 Sep 16 08:32:59 arrowsmithlap1 kernel: [drm] GPU HANG: ecode 0:0xf4e9fffe, in Xorg [26353], reason: Ring hung, action: reset
 Sep 16 08:32:59 arrowsmithlap1 kernel: [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
 Sep 16 08:33:01 arrowsmithlap1 kernel: [1182244.142445] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
 Sep 16 08:33:01 arrowsmithlap1 kernel: [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

The only thing under my /etc/X11/xorg.conf.d/ is 00-keyboard.conf (system generated).

Do you want a copy of /sys/class/drm/card0/error every time?

(In reply to comment #191)
> What information is most useful for these repeating issues, as it just
> happened again:
>
> Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139690] [drm] stuck on
> render ring
> Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139699] [drm] stuck on
> blitter ring

So long as it is the same event, there is no more information we need other than testing feedback for an eventual workaround.

(In reply to comment #184)
> (In reply to comment #183)
>
> I recommend configuring i915.semaphores=0. I did it and it doesn't freeze
> anymore.

Meanwhile I tested both i915.semaphores=0 and i915.semaphores=1 neither of which did help in my case. But with i915.semaphores=0 my system became much more unstable and even crashed on its own after some days without stress on graphics (just ran some desktop apps like thunar or vlc for music only - no movies). With i915.semaphores=1 the system is at least stable (for some weeks) as long as I don't heavily use desktop applications.

*** Bug 85194 has been marked as a duplicate of this bug. ***

*** Bug 85333 has been marked as a duplicate of this bug. ***

*** Bug 85609 has been marked as a duplicate of this bug. ***

I am also experiencing this, on a Gentoo system running on a ThinkPad T440s. I'm not doing anything related to XBMC, simply using xrandr for multihead. The interesting thing is that DRI works fine on my laptop screen (glxgears reports 60fps, which is the refresh rate of my screen), but breaks when I move a window trying to use DRI (e.g. Chrome, glxgears) to the external monitor connected to the mini Display Port output.

I see this stuff in dmesg:

[ 3561.424762] [drm:ring_stuck] *ERROR* Kicking stuck wait on blitter ring
[ 3561.424770] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 3561.424772] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 3561.424774] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 3561.424776] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 3561.424778] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 3566.422957] [drm:ring_stuck] *ERROR* Kicking stuck wait on blitter ring
[ 3571.425143] [drm:ring_stuck] *ERROR* Kicking stuck wait on blitter ring
[ 3575.423680] [drm:ring_stuck] *ERROR* Kicking stuck wait on blitter ring

Seems like the same issue. I'm trying to downgrade X, mesa, et al., to try and get the system back in working order.

*** Bug 79675 has been marked as a duplicate of this bug. ***

*** Bug 85972 has been marked as a duplicate of this bug. ***

*** Bug 86058 has been marked as a duplicate of this bug. ***

For those running Ubuntu, here is a build of a kernel based on 3.17.1 with the patches Chris Willson wants you to test:

- Those patches have other regressions (so be careful to only test your specific issue).

https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.17.1simonickle_3.17.1simonickle-10.00.Custom_amd64.deb
https://dl.dropboxusercontent.com/u/55728161/linux-image-3.17.1simonickle_3.17.1simonickle-10.00.Custom_amd64.deb

Those kernels are based on: https://bugs.freedesktop.org/show_bug.cgi?id=83677#c35

Beware, don't switch VTs.

I've tryed the mentioned kernel on my Fedora 21 Beta and still hangs after for example Netbeans opens main window for the whole screen.

*** Bug 86437 has been marked as a duplicate of this bug. ***

*** Bug 86765 has been marked as a duplicate of this bug. ***

*** Bug 86836 has been marked as a duplicate of this bug. ***

*** Bug 86925 has been marked as a duplicate of this bug. ***

*** Bug 87710 has been marked as a duplicate of this bug. ***

*** Bug 87776 has been marked as a duplicate of this bug. ***

*** Bug 88541 has been marked as a duplicate of this bug. ***

*** Bug 88626 has been marked as a duplicate of this bug. ***

*** Bug 88723 has been marked as a duplicate of this bug. ***

*** Bug 88789 has been marked as a duplicate of this bug. ***

*** Bug 89078 has been marked as a duplicate of this bug. ***

Changed in linux:
importance: Medium → Unknown
status: Invalid → Unknown
affects: linux → mesa
tags: removed: saucy
tags: added: metabug

*** Bug 89299 has been marked as a duplicate of this bug. ***

Displaying first 40 and last 40 comments. View all 501 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.