Ubuntu

MASTER: [i855] GPU lockup (apport-crash)

Reported by Geir Ove Myhr on 2010-03-18
This bug affects 321 people
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Undecided
Unassigned
xf86-video-intel
Fix Released
Medium
linux (Ubuntu)
Undecided
Unassigned
Lucid
Undecided
Unassigned
xserver-xorg-video-intel (Ubuntu)
Wishlist
Unassigned
Lucid
High
Unassigned

Bug Description

Binary package hint: xserver-xorg-video-intel

This is a MASTER bug report, i.e. not a real bug report, but a tool to help manage other bug reports.

Most bug reports on i855 are probably due to the CPU/GPU incoherency problem that is now consolidated upstream at http://bugs.freedesktop.org/show_bug.cgi?id=27187 (which was split off from a bug report for i845). For now, we mark all automatically reported GPU lockups on i855 as duplicates of this unless there is a reason not to.

A kernel with the proposed fix is available at https://launchpad.net/~brian-rogers/+archive/graphics-fixes

To use this fixed kernel, run the following commands:

sudo apt-add-repository ppa:brian-rogers/graphics-fixes
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install linux-image-2.6.35-ppa21+v9patch-generic

There is a similar master bug report for i845 at bug 541492.

This is just a bug report so that I can keep track of my own trials at fixing this issue. I'll upload a clean-up version of my patch soon

Created an attachment (id=34223)
cleanup up version of my gtt cache coherency patch

Forget about the patch for the moment, I've just noticed that it doesn't work on my i855GM, too.

Geir Ove Myhr (gomyhr) wrote :

Binary package hint: xserver-xorg-video-intel

This is a MASTER bug report, i.e. not a real bug report, but a tool to help manage other bug reports.

Most bug reports on i855 are probably due to the CPU/GPU incoherency problem that is now consolidated upstream at http://bugs.freedesktop.org/show_bug.cgi?id=26345 (which was originally reported for i845, but applies to i855 as well). For now, we mark all automatically reported GPU lockups on i855 as duplicates of this unless there is a reason not to. There are some tests you may do to help upstream with this issue, and I will come back with instructions here. For those of you who know how to patch and compile a kernel you may look at comment #61 in the upstream bug report.

There is a similar master bug report for i845 at bug 541492.

Geir Ove Myhr (gomyhr) on 2010-03-18
Changed in xserver-xorg-video-intel (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in xserver-xorg-video-intel:
status: Unknown → Confirmed

> --- Comment #4 from legolas558 <email address hidden> 2010-03-19 05:01:23 PST ---
> I think that your patch lets other bugs come out and become apparent, like the
> hangcheck timer bug.

Thanks for the list (especially the downstream bugs). btw the hangcheck
timer is the same bug, this is the kernel code noticing that the gpu just
died. Of course, if the dying gpu takes along the complete system, you're
not going to see this.

Created an attachment (id=34247)
new gtt cache coherency patch

New version of my patch. Changes:
- improved cache coherency checker. Instead of always using the same address it now constantly changes. Should increase the chance of catching an inconsistency. But still, this won't catch everything and there's still the chance that the gpu reads crap and dies before we detect that the chipset flush doesn't work.

- increased the size of the magic gtt write. I don't like this, but it seems to help. Given that testing of the last patch showed that I've only implemented a fancy delay I've tried different techniques. Unfortunately none worked.

Testing feedback highly welcome. I've you report results, please add some details about your machine: Processor (model and frequency), ram (type, speed & size). Perhaps there's a pattern there.

Oh, and legolas created a nice small script that calculates the cache failure rate of the running kernel:

https://bugs.freedesktop.org/attachment.cgi?id=34240

Geir Ove Myhr (gomyhr) on 2010-03-20
Changed in xserver-xorg-video-intel:
status: Confirmed → Unknown
description: updated

Created an attachment (id=34262)
i855/Acer TM66x: Test results with patch in attachment #34194 from bug #26345

Changed in xserver-xorg-video-intel:
status: Unknown → Confirmed

> --- Comment #7 from Bruno <email address hidden> 2010-03-20 10:57:18 PST ---
> Created an attachment (id=34262)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34262)
> i855/Acer TM66x: Test results with patch in attachment #34194 from bug #26345

Thanks for testing. Unfortunately the patch you tested is outdated and
known not to work. At least the failure rates you're getting on your
Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
please retest with my latest patch (attached to this bug)?

Created an attachment (id=34270)
BUG in i915_gem.c - obj_priv->pages_refcount is zero in i915_gem_object_put_pages()

> Thanks for testing. Unfortunately the patch you tested is outdated and
> known not to work. At least the failure rates you're getting on your
> Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
> please retest with my latest patch (attached to this bug)?

It seems to be very much a matter of mood, on friday evening I saw no single bad message while today there were a lot.

Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for sd.
Both with previous patch and current one I'm hitting a BUG() in i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No idea if it's related or not but it kills i915 driver!

Attached is kernel log of system including BUG(), but no bad chipset flush.

> --- Comment #9 from Bruno <email address hidden> 2010-03-20 16:20:41 PST ---
> Created an attachment (id=34270)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34270)
> BUG in i915_gem.c - obj_priv->pages_refcount is zero in
> i915_gem_object_put_pages()
>
> > Thanks for testing. Unfortunately the patch you tested is outdated and
> > known not to work. At least the failure rates you're getting on your
> > Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
> > please retest with my latest patch (attached to this bug)?
>
> It seems to be very much a matter of mood, on friday evening I saw no single
> bad message while today there were a lot.

Just to clarify: You're talking about the BUG in i915_gem_object_put_pages
here, not the "chipset flush failed" warning? And this is still on the old
version of the patch, right?

> Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for
> sd.
> Both with previous patch and current one I'm hitting a BUG() in
> i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No idea
> if it's related or not but it kills i915 driver!

Known issue, I get these here from time to time, too. Looks like my
unmap-inactive hack, intended to stress gtt cache flushing is uncovering
another problem somewhere else. I'll look into this more seriously now.

> > It seems to be very much a matter of mood, on friday evening I saw no single
> > bad message while today there were a lot.
>
> Just to clarify: You're talking about the BUG in i915_gem_object_put_pages
> here, not the "chipset flush failed" warning? And this is still on the old
> version of the patch, right?

The matter of mood was regarding the chipset flush failed warnings (as for GPU wedges before testing with your patch). Some day everything is running smooth, some days I have to reboot every hour (no noticeable usage difference on my side).

> > Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for
> > sd.
> > Both with previous patch and current one I'm hitting a BUG() in
> > i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No
> > idea if it's related or not but it kills i915 driver!
>
> Known issue, I get these here from time to time, too. Looks like my
> unmap-inactive hack, intended to stress gtt cache flushing is uncovering
> another problem somewhere else. I'll look into this more seriously now.

I got it with both versions of your patch, let's see what today will bring, flush warnings or BUGs (or both).

Bryce Harrington (bryce) on 2010-03-21
tags: added: lucid
Bryce Harrington (bryce) on 2010-03-22
tags: added: freeze
tags: added: crash

Created an attachment (id=34353)
new patch with totally reworked chipset flush

I've combined a few of my previous, non-working ideas and combined them into this new patch. Hopefully this adapts better to different cpu/chipset combinations (i.e. better chance that it not only works on my machine). The new chipset flush is a magic dance involving a flock of canaries ;)

The magic values (look at include/drm/intel-gtt.h, but not at the comments, their stale) probably need some tuning. But this new chipset flush is adaptive (fyi it reports the maximum number of retries in the regular chipset flush no. reporting), so I hope it works out of the box.

Unfortunately I haven't yet fixed the BUG_ON in put_pages. For some odd reason I can't reproduce it here anymore and reviewing the code hasn't revealed anything yet.

As usual, testing reports highly welcome.

Created an attachment (id=34362)
BUG again, with flush retries before

Daniel, here is the result of one run with your latest patch (attachment #34353) at the time it BUGed but also having a few flush retries listed before.
I had the previous version running yesterday without any visible issue up to 2M flushes (but I might not have run the right actions)

It happened while surfing with firefox, some site doing fancy things with javascript and transparency for fade-in/out of images and the like.

(In reply to comment #12)
> Unfortunately I haven't yet fixed the BUG_ON in put_pages. For some odd reason
> I can't reproduce it here anymore and reviewing the code hasn't revealed
> anything yet.

Seems I can reproduce it damn easily...

In Firefox on page http://habiter.luxweb.com/NR58686_Mersch_Vente_Maison.html
it's sufficient to click on the image thumbnails to hit the BUG_ON (running Firefox 3.6-r2 + Flash plugin 10.0.45.2 (Gentoo)
Hovering the thumbnail images seems also to help getting flush retries.

> --- Comment #13 from Bruno <email address hidden> 2010-03-23 05:22:10 PST ---
> Created an attachment (id=34362)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34362)
> BUG again, with flush retries before
>
> Daniel, here is the result of one run with your latest patch (attachment
> #34353) at the time it BUGed but also having a few flush retries listed before.
> I had the previous version running yesterday without any visible issue up to 2M
> flushes (but I might not have run the right actions)

Thanks for testing. Don't worry about the retries, that's an integral part
of the new chipset flush. As long as it's a small number, everything's
fine (I've gotten higher numbers than you while testing).

The important thing is whether you still get backtraces about failed
chipset flushes (there's none in the dmesg you attached). iirc the
previous patch still had problems there for you.

If the BUG is causing you too much grieve while testing, undo the two
changes to drivers/gpu/drm/i915/i915_gem.c the patch makes. But that will
greatly reduce the number of chipset flushed, so massively hampering
testing. Otherwise just check you dmesg for any "chipset flush failed"
messages and hit the reset knob ;)

Geir Ove Myhr (gomyhr) on 2010-03-23
description: updated

Created an attachment (id=34375)
dmesg with gtt flush v4 patch

dmesg summary: Complaints about chipset flush timeout, subsequent log entries show that the maximum number of retries have been reached. Eventual freeze.

(In reply to comment #16)
> dmesg summary: Complaints about chipset flush timeout

While scanning over relevant code, I also noticed this: It should probably be canary_gtt_read, canary_cpu_read in drivers/char/agp/intel-gtt.c:978 instead of canary_cpu_read, canary_cpu_read.

> --- Comment #17 from <email address hidden> 2010-03-23 12:18:05 PST ---
> (In reply to comment #16)
> > dmesg summary: Complaints about chipset flush timeout
>
> While scanning over relevant code, I also noticed this: It should probably be
> canary_gtt_read, canary_cpu_read in drivers/char/agp/intel-gtt.c:978 instead of
> canary_cpu_read, canary_cpu_read.

Yep, you're right. Fixed in my local version.

I've looked at your dmesg and the chipset flush clearly doesn't work for
your hw. Can you please try to hang the gpu again (with my latest patch)
and then capture i915_error_state from <debugfs>/dri/0. Just to make sure
that your gpu is crashing due to a cache coherency bug and not due to
something else (rather unlikely). Meanwhile I try to come up with a new
idea to fix your problem.

Created an attachment (id=34376)
error state after freeze

Here you go.

Chris Wilson suggested elsewhere that this particular discrepancy between patch results of the same hardware might be related to chipset/hardware revision and corresponding quirks, so this may or may not explain oddities. (lspci claims rev 02 for my GPU)

Created an attachment (id=34377)
new patch, improved retry flushing

New patch to (hopefully) tackle the problems uncovered by 2points. Please retest.

Bruno, I haven't yet found the problem with the BUG in put_pages. I'm hitting it about once every few days (with the same backtrace like you), but I haven't got a clue yet what's causing it.

Daniel, is this patch worth testing on i845 as well?

> --- Comment #21 from Brian Rogers <email address hidden> 2010-03-24 08:21:43 PST ---
> Daniel, is this patch worth testing on i845 as well?

Since Chris Wilson last tested this patch on his i845 (didn't work), the
patch has changed quite a bit. So it might be worth to retest it, if you
have the time to spare. I certainly appreciate any testing feedback I can
get, because every machine (even seemingly similar 855 boxes) seems to act
differently.

Created an attachment (id=34416)
dmesg with gtt flush v5 patch

I've made these observations so far: The amount of reported failed chipset flushes has gone down drastically. In four test runs, only in one so far I've spotted a chipset flush failure warning. However, the GPU will still hang after a while, often without any flush-related warnings in dmesg.

I also like to add to speculations that failure rate might be related to CPU or possibly IO load. X runs for a lot longer if I just leave glxgears running by itself, but as soon as I start doing some work (Kontact, Opera, Flash plugin etc.) GPU freeze seems more likely.

Created an attachment (id=34417)
error state after freeze, v5

Created an attachment (id=34418)
Clip solid fills

So I am working on the assumption that the residual hangs I am seeing after enabling/disable the GMCH to force the GTT flush are real batch buffer bugs...

For instance http://bugs.freedesktop.org/attachment.cgi?id=34417 looks like we attempt to write well beyond the end of the buffer. In which case the attached should workaround the issue.

Created an attachment (id=34419)
dmesg with gtt flush v5 patch

(In reply to comment #25)
> Created an attachment (id=34418) [details]
> Clip solid fills
>
> So I am working on the assumption that the residual hangs I am seeing after
> enabling/disable the GMCH to force the GTT flush are real batch buffer bugs...

Thanks, doing some tests with git HEAD and this patch. As far as chipset flushes go, I'm at slighly over 4.5M flushes right now. Since this is about four times as long as the machine usually lasts without freezing, I'd conclude that the fix is pretty effective.

As for failed chipset flushes, dmesg records three occurences now after about one hour of glxgears and various other tasks.

(In reply to comment #25)
> For instance http://bugs.freedesktop.org/attachment.cgi?id=34417 looks like we
> attempt to write well beyond the end of the buffer.

Could you explain how to see this from the file or intel_error_decode output? I'm trying to make some documentation for downstream on how to interpret the dumps.

Is it related to this? I'm not sure what those numbers mean, except that the first is obviously the start address of the batch buffer.
Buffers [13]:
...
  02821000 16384 00000048 00000000 000e354b dirty purgeable

I forgot to add myself to CC list, will now try latest patch!

Sorry, I have reported my findings on bug 26345:

- debugfs DRI dumps (see attachment 34436)
- dmesg with GTT failures (attachment 34437)

http://bugs.freedesktop.org/show_bug.cgi?id=26345#c76

Geir Ove Myhr (gomyhr) on 2010-03-25
description: updated

Created an attachment (id=34469)
gtt chipset flush v6

Changes since v6:
- tuned magic values, hopefully fixing problems seen by 2points and legolas
- some debug checks trying to catch the put_pages BUG_ON problem while it's happening.

As usual, testing feedback higly welcome.

Created an attachment (id=34473)
dmesg of 2 failures with v6 patch

Created an attachment (id=34474)
debugfs dumps after 2 failures with v6 patch

I have just tested v6 patch with 3 glxgears and I got 2 flush failures; it is not possible to state if anything sensibly changed when compared with previous patch

> --- Comment #32 from legolas558 <email address hidden> 2010-03-26 03:19:48 PST ---
> Created an attachment (id=34474)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34474)
> debugfs dumps after 2 failures with v6 patch
>
> I have just tested v6 patch with 3 glxgears and I got 2 flush failures; it is
> not possible to state if anything sensibly changed when compared with previous
> patch

Indeed, not much changed at all. The real problem is reported in the first
backtrace (grep for "chipset flush timed out" in your dmesg). It
essentially means that the code has given up. All further failed flushes
are most likely just a result of this.

The problem seems to be writes to the gtt just don't show up at the cpu
side on your box. There's a tunable in my patch in the file
drivers/char/agp/intel-gtt.c

#define I830_GTT_MAX_RETRIES 100

Can you try out whether increasing this to a ridiculous number (like 1000)
helps? This might cause your machine to stall sometimes. As soon as you
get the chipset flush timed out message (and the max retries hits the
value you've defined), give up.

Thanks alot for testing this stuff.

btw, has video stability increased further with v6?

(In reply to comment #33)
> Indeed, not much changed at all. The real problem is reported in the first
> backtrace (grep for "chipset flush timed out" in your dmesg). It
> essentially means that the code has given up. All further failed flushes
> are most likely just a result of this.
>
I have an uptime of 50 minutes (it's the same session of reported dump/dmesg files) and my ratio is still 2 / 229376 (using gttqual script in attachment 34435), so it is indeed as you said.

> The problem seems to be writes to the gtt just don't show up at the cpu
> side on your box. There's a tunable in my patch in the file
> drivers/char/agp/intel-gtt.c
>
> #define I830_GTT_MAX_RETRIES 100
>
> Can you try out whether increasing this to a ridiculous number (like 1000)
> helps? This might cause your machine to stall sometimes. As soon as you
> get the chipset flush timed out message (and the max retries hits the
> value you've defined), give up.
>
> Thanks alot for testing this stuff.
>
Yes I am gonna make this test on next reboot and see what happens.

> btw, has video stability increased further with v6?
>
I would say definitively yes. I have tried with all videos which triggered the bug and no crash yet.

Other important notes:
* I am using only the v6 patch on drm-intel from git, and nothing else, neither the patched xf86-video-intel
* my kernel command line parameters are: lapic=yes hpet=force clocksource=hpet i8042.nomux=1

This hardware (Fujitsu Amilo clone) has a known problem with ACPI/i8042 controller; the i8042.nomux=1 is used to prevent touchpad glitches (http://bugzilla.kernel.org/show_bug.cgi?id=8740), while the battery, thermal and ac modules are always unloaded because otherwise this kernel bug would be triggered: http://bugzilla.kernel.org/show_bug.cgi?id=9147

Apart from these facts, I don't know anything else which could be relevant

Created an attachment (id=34476)
dmesg of early flush failure with 2000 retries

Bug triggered instantly, no sensible (additional) slowdown. This looks like a dead-end...

Shall I attach also the debugfs dri data? Shall I make any special test?

what if we "give up" on coherency of the first n flushes? something like the screen test of arcade machines, except that we just don't check if test is OK, pretending that coherency becomes consistent after that; I have often found that the last graphics contents of the LVDS pipe are persistent between a reboot (when using a liveCD, for example) and you can actually see the last screenshot a while before screen gets properly cleared and initialized.

Sorry but I am shooting in the dark here

> --- Comment #36 from legolas558 <email address hidden> 2010-03-26 04:51:08 PST ---
> what if we "give up" on coherency of the first n flushes? something like the
> screen test of arcade machines, except that we just don't check if test is OK,
> pretending that coherency becomes consistent after that; I have often found
> that the last graphics contents of the LVDS pipe are persistent between a
> reboot (when using a liveCD, for example) and you can actually see the last
> screenshot a while before screen gets properly cleared and initialized.

That's exactly what my patch currently does - it simply gives up after too
many retries. Now if this corrupts a pixmap/texture, it just yields visual
corruptions on the screen. But this can also corrupt the gpu command
buffer (and some other vital things). And if the gpu reads crap from
these, it usually just hangs itself.

You've increased max_retries to 2000, which equals to about 1ms of delay.
And it hasn't helped at all, i.e. the chipset takes probably even longer
to reach a coherent state again. And 1 ms is an eternity for computer hw,
so this will crash your box - sooner or later.

> Sorry but I am shooting in the dark here

We all are ;) But there are some more constants to tune, this time in
include/drm/intel-gtt.h

#define I830_CC_GTT_WHACK_PAGES 16

Try to increase this (doubling it each step is sensible, the algo only
uses as much as required, this is just an upper bound). But don't go above
128, that'd be crazy (and I would have to figure out a new trick).

btw, dmesg is usually enough - I'll ask if I need anything else.

(In reply to comment #37)
> > --- Comment #36 from legolas558 <email address hidden> 2010-03-26 04:51:08 PST ---
> > what if we "give up" on coherency of the first n flushes? something like the
> > screen test of arcade machines, except that we just don't check if test is OK,
> > pretending that coherency becomes consistent after that; I have often found
> > that the last graphics contents of the LVDS pipe are persistent between a
> > reboot (when using a liveCD, for example) and you can actually see the last
> > screenshot a while before screen gets properly cleared and initialized.
>
> That's exactly what my patch currently does - it simply gives up after too
> many retries. Now if this corrupts a pixmap/texture, it just yields visual
> corruptions on the screen. But this can also corrupt the gpu command
> buffer (and some other vital things). And if the gpu reads crap from
> these, it usually just hangs itself.
>
Yes I can understand this; there's always a state machine behind the scenes.

> You've increased max_retries to 2000, which equals to about 1ms of delay.
> And it hasn't helped at all, i.e. the chipset takes probably even longer
> to reach a coherent state again. And 1 ms is an eternity for computer hw,
> so this will crash your box - sooner or later.
>
I have done some other tests and it seems that 1000 or 2000 is high enough to *never* cause a failure with mild usage, while if I make 2 glxgear windows have a clipping rectangle, the failure immediately happens. Might this help? Perhaps openGL is altering the GPU in some way that we cannot forecast?

> > Sorry but I am shooting in the dark here
>
> We all are ;) But there are some more constants to tune, this time in
> include/drm/intel-gtt.h
>
> #define I830_CC_GTT_WHACK_PAGES 16
>
> Try to increase this (doubling it each step is sensible, the algo only
> uses as much as required, this is just an upper bound). But don't go above
> 128, that'd be crazy (and I would have to figure out a new trick).
>
> btw, dmesg is usually enough - I'll ask if I need anything else.
>
Ok, thank you Daniel, I will revert the retries to 1000 and make the next possible 3 tests with the whack pages constant.

Can we state that the pre-KMS driver was working good enough because bug was harder to trigger in those conditions? I can say I experienced lockups when watching videos or when shutting down even with the pre-KMS/Xorg1.6 combo, but it was rare.

> --- Comment #38 from legolas558 <email address hidden> 2010-03-26 07:08:31 PST ---
> Can we state that the pre-KMS driver was working good enough because bug was
> harder to trigger in those conditions? I can say I experienced lockups when
> watching videos or when shutting down even with the pre-KMS/Xorg1.6 combo, but
> it was rare.

Yep, the cache coherency bug was most likely always there. kms code tends
to be faster in certain circumstances and therefore tends to hit cache
coherency problems with a higher probability. That's also why Eric's patch
from half a year ago managed to break a few i855 chipsets by slightly
changing the timings. But that's just coincidental because that piece of
code helps cache coherency on i865 chipsets.

btw, I've tried your two-glxgears-with-clipping test. No ill effects, here
...

Geir Ove Myhr (gomyhr) on 2010-03-26
description: updated
Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Triaged → Incomplete
Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Incomplete → Fix Released
Bryce Harrington (bryce) on 2010-04-14
Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Fix Released → Triaged
tags: added: patch
Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Triaged → Fix Released
Geir Ove Myhr (gomyhr) on 2010-04-26
Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Fix Released → Triaged
Steve Langasek (vorlon) on 2010-04-28
Changed in ubuntu-release-notes:
status: New → Incomplete
Bryce Harrington (bryce) on 2010-04-28
Changed in ubuntu-release-notes:
status: Incomplete → Fix Released
Filippo neri (fneri23) on 2010-05-01
Changed in ubuntu-release-notes:
status: Fix Released → In Progress
status: In Progress → Fix Committed
tags: removed: patch
tags: added: patch
Steve Langasek (vorlon) on 2010-05-08
Changed in ubuntu-release-notes:
status: Fix Committed → Fix Released
description: updated
description: updated
Changed in xserver-xorg-video-intel (Ubuntu Lucid):
assignee: nobody → Chris Halse Rogers (raof)
description: updated
description: updated
description: updated
359 comments hidden view all 439 comments
Brian Rogers (brian-rogers) wrote :

David, I just uploaded a new kernel (currently building for both Lucid and Maverick) that reverts the commit that caused the invisible cursor regression. That way you or anyone else experiencing this bug isn't stuck on an RC kernel until there's a proper fix.

The new kernel is linux-image-2.6.35-ppa20+v9+cursorfix-generic.

fossfreedom (fossfreedom) wrote :

Thank you Brian - indeed, this has fixed the cursor issue. It has also fixed a few other issues such as permanent screen dimming and panel update issues. From the short time I've been playing, the kernel looks rock solid. Many thanks again.

fossfreedom (fossfreedom) wrote :

Further update Brian - I've been running the kernel continuously today. Whilst the cursor is now visible, there are occasional freezes which make navigation annoying. For the moment, I'll stick to the rc6 kernel. Hopefully there will be good news on this front from upstream soon (here's hoping).

Brian Rogers (brian-rogers) wrote :

You mean that it freezes temporarily, then resumes, like a stuttering behavior? In that case, do new messages appear in dmesg after a freeze?

Brian, I've been using your latest ppa20+v9+cursorfix-generic kernel, and it has made my system usable again, unlike the previous attempts. The cursor freeze has disappeared, and I can once again watch videos in Dragon Player and VLC, neither of which could be used at all before. Oddly, flash video in a browser has always worked well, and the new kernel has not degraded performance there. The mouse cursor does still occasionally stutter, but there are no error messages that I can discover. Similarly, video sometimes stutters, but this seems to happen only on higher-resolution sources (like HD video), but I notice this only when playing fullscreen. Lower res video plays very well, including fullscreen.

The system still sometimes freezes solid, but this behavior is greatly reduced from previous kernels, and is now at an almost tolerable level. Unfortunately, a system freeze means just that, so I have no way to determine if new error messages appeared immediately before the freeze. Freezing doesn't seem to be triggered by any particular operation, such as certain mouse movements, and it can happen just as easily when typing in a Konsole or resizing a window as when watching a fullscreen video.

fossfreedom (fossfreedom) wrote :

Brian, I can concur with the above. I had tail -f /var/log/messages in one window and glxgears in another. The reason for glxgears was that I just needed something that had continuous movement to be displayed. When the cursor appeared to momentarily freeze, glxgears also stopped. No additional messages was traced. I haven't though seen any solid freezes as seen by scott.

A ThinkPad R51 was freezing at boot after upgrading to Lucid. I used Stefan Glasenhardt's Ubuntu PPA as described here:

https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes#GTT%20Incoherency%20Patch

and it solved the problem.

(In reply to comment #238)
> A ThinkPad R51 was freezing at boot after upgrading to Lucid. I used Stefan
> Glasenhardt's Ubuntu PPA as described here:
>
> https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes#GTT%20Incoherency%20Patch
>
> and it solved the problem.

Stefan uses the same patch which is here, which is not effective for me. I am using git linus tree; I think the patch should be sent upstream as it only improves things for all use-cases I have seen so far.

@Chris Wilson: is there a cumulative patch for your UMS "legacy" work?

Hello, i tried the latest patch with 2.6.36-rc3, and it basically is working, but every ~9-10s the mouse pointer hangs, and dmesg is flooded with a lot of this messages:

[ 214.024020] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane, expect flickering: entries required = 43, available = 42.
[ 214.024029] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane, expect flickering: entries required = 43, available = 42.

my GPU:
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02) (prog-if 00 [VGA controller])
 Subsystem: Acer Incorporated [ALI] Device 0064
 Flags: bus master, fast devsel, latency 0, IRQ 6
 Memory at e8000000 (32-bit, prefetchable) [size=128M]
 Memory at e0000000 (32-bit, non-prefetchable) [size=512K]
 I/O ports at 1800 [size=8]
 Expansion ROM at <unassigned> [disabled]
 Capabilities: <access denied>
 Kernel driver in use: i915

> --- Comment #240 from Kristijan Vrban <email address hidden> 2010-09-12 12:57:37 PDT ---
> Hello, i tried the latest patch with 2.6.36-rc3, and it basically is working,
> but every ~9-10s the mouse pointer hangs, and dmesg is flooded with a lot of
> this messages:
>
> [ 214.024020] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane,
> expect flickering: entries required = 43, available = 42.
> [ 214.024029] [drm:intel_calculate_wm] *ERROR* Insufficient FIFO for plane,
> expect flickering: entries required = 43, available = 42.

Both known problems: The hang every 10s is hotplug code wasting too much
time (and hence stalling mouse updates). The warning is harmless, the code
wasn't changed at all, it just started reporting possible causes for
flicker. Both problems have patches in drm-intel/drm-intel-fixes that
should land in -stable sooner or later.

-Daniel

Brian Rogers (brian-rogers) wrote :

A real fix was posted for the invisible cursor issue, so I incorporated it into linux-image-2.6.35-ppa21+v9patch-generic (building now).

As for the periodic freezing issue, that is covered by this bug:
https://bugs.freedesktop.org/show_bug.cgi?id=29536

The patches there don't apply cleanly to 2.6.35, so I'll have to look at them before I include them in a graphics-fixes kernel.

description: updated
fossfreedom (fossfreedom) wrote :

Thank you brian - I've tested your latest kernel and it indeed has fixed the cursor issue. On my machine, the periodic freezing issue from your last kernel is not observed. I've decided to remove the rc6 kernel and run full time on this kernel.

Question - what's the chances of the fixes you've got in this new kernel finding its way into maverick? Its a real shame for those i855 users of maverick just using vesa graphics.

Brian Rogers (brian-rogers) wrote :

The invisible cursor fix will be sent to stable and make it into Maverick that way. I don't know if it will make it in before release, though.

As for the stability fix, Daniel Vetter has said the following:

"I haven't upstreamed the patch for a few reasons:
- It's an extremely ugly approach, involving way too much duct-tape. Now if it would actually reliably work, but that's not the case.
- It has (under certain circumstances) rather severe performance implications (mostly because the eviction code is not clever enough).
Hence why I'm not satisfied and of the opinion that upstreaming might cause more harm than good. Different story for distros, though.

I have a few ideas as how to amend this, but that requires a complete rewrite of the gtt code. I've finally found time to start hacking on this [...]"

Complete comment here: https://bugs.freedesktop.org/show_bug.cgi?id=27187#c233

So it's not planned to go into 2.6.35.x, but it could be picked up individually by distros. A better patch will eventually be made for future kernels and sent upstream.

On Tue, 2010-09-14 at 00:44 +0000, Brian Rogers wrote:
> The invisible cursor fix will be sent to stable and make it into
> Maverick that way. I don't know if it will make it in before release,
> though.

Brian, do you have any idea why Fedora 13 (current kernel 2.6.34.6-54)
is not affected by this bug?

I have been wondering for some times now, why i855 notebooks were
incompatible with 9.10, 10.04, 10.10, unless doing some workaround, but
installed fine on Fedora 13 (I haven't tried any previous release).

Isn't the kernel common to all the distributions?

Brian Rogers (brian-rogers) wrote :

In Fedora's kernel package:
http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=tree;h=refs/heads/f13/master;hb=f13/master

I see drm-intel-big-hammer.patch. That's a patch that improved stability somewhat, but didn't quite solve the problem. My testcase could still kill the system. It also causes slowdowns, which can be extreme in some cases.

nomnex (nomnex) wrote :

On Tue, 2010-09-14 at 03:56 +0000, Brian Rogers wrote:
> In Fedora's kernel package:
> http://pkgs.fedoraproject.org/gitweb/?p=kernel.git;a=tree;h=refs/heads/f13/master;hb=f13/master
>
> I see drm-intel-big-hammer.patch. That's a patch that improved stability
> somewhat, but didn't quite solve the problem. My testcase could still
> kill the system. It also causes slowdowns, which can be extreme in some
> cases.
>

Thank you for the explanation and the link.
nomnex

Changed in xserver-xorg-video-intel:
importance: Unknown → Medium

Daniel, are you interested in people testing your gtt rework branch on non-i8xx hardware right now, or is it in an early enough state that the feedback would just be noise?

I could test that branch on my i965 laptop if you'd find that helpful.

https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes Workaround A alone worked for Dell Latitude D505 with (using $ lspci | grep Intel)
00:02.1 Display controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
Workaround A is..
To turn KMS back on, run this command in a Terminal window and reboot:
echo options i915 modeset=1 | sudo tee /etc/modprobe.d/i915-kms.conf
sudo update-initramfs -u
Previously Lucid with kernel 2.6.31-22 generic and all previous releases (Karmic and earlier) worked fine but Lucid kernel 2.6.32-24 generic on startup showed some graphics initially then went to a blank screen.
The problem was first seen after upgrading from Karmic to Lucid Xubuntu. 22Sep2010

Hello,
I have tested your fix and your driver for i855GM' chipset on Lucid.
I can turn on compiz, but if I want to watch a video on VLC, compiz crash. But if I don't activate the hardware acceleration (sorry I'm french so I don't know if it's the good word), it's works. But If I activate the hardware acceleration, VLC crash after 30mn.

So I test Ubuntu 10.10, because I have read that it works better on. But the hardware acceleration isn't activated so when I want to watch a video I have some lags... I have activated it manually but it isn't stable, so I have tested your driver and your fix, but I couldn't install the fix because of the kernel 2.6.35.

Can you adapt your fix to the kernel 2.6.35 for Maverick ?
Or do you think at another solution ?

Chris Halse Rogers (raof) wrote :

I'm unassigning myself from this bug; I've got the needed feedback for Maverick, and we've gone with the safe option of fbdev.

I'll leave this bug open; there's still a reasonable chance we can get a proper fix, and apparently some i8xx documentation has just been released.

Changed in xserver-xorg-video-intel (Ubuntu Lucid):
assignee: Chris Halse Rogers (raof) → nobody
ilanrab (ilanrab) wrote :

This issue has been floating around, in Ubuntu, for too long. Why wasn't it escalated?

When my upgrade from 8.04LTS to 10.04.1LTS failed so miserably (Wireless problems and locking GPU problem) I moved myself to Puppy Linux. The Puppy distro does not lock up on me when I use graphics and video.

Version 5.1 of Puppy Linux is based on the Ubuntu Lucid binaries (called Lupu). I am having difficulty understanding why Ubuntu Lucid is failing so miserably. What do PuppyLinux's drivers have that Lucid does not? I get a freeze, on average, every 15 minutes, every single day that I try to use Ubuntu Lucid, nowadays.

Maybe someone should contact Barry Kauler (of Puppy Linux) and ask some questions so that this issue GOES AWAY FAST.

Is this a Kernel version issue? What's going on?

 - Ilan -

gmud (gmud) wrote :

>When my upgrade from 8.04LTS to 10.04.1LTS failed so miserably (Wireless problems and locking GPU problem)
>I moved myself to Puppy Linux. The Puppy distro does not lock up on me when I use graphics and video.

Oh, thats interesting. Maybe it has something to do with the upgrade process (e.g. old xorg.conf, upstart or something). Would you mind testing a fresh install with Lucid (in case you did a fresh install of Puppy Linux)?

ilanrab (ilanrab) wrote :

Thank you for the suggestion, gmud. That's a good idea. Unfortunately I do not have the resources to install a fresh copy of Ubuntu on top of the existing Ubuntu partition that I am using. The Puppy Linux distro is running fully off an 8GB pendrive.

Earlier today I installed the Glasen 0.7.7 driver fix for the i855, in my Dell Latitude D400. Both the stable version, of the driver, and the experimental (exp) version acted the same way:
1. I could now run Videos (like, with VLC) with no problem. Until this fix, my GPU would lock up immediately, the second I invoked any video. The fix allows video.
2. Now, the GPU locks up when I run certain graphics only. This is an improvement.
I can make it fail consistently when I run Google Chrome with the facebook game: "Mafia Wars". The page that causes the lockup is "NewYork" stage> "properties" page. After less than a minute of pressing various buttons on this page, the GPU locks up. My xorg.conf is the standard one.

Good work, Stephan. Thank you for the fix. Good progress.

denisb (denis-bonnenfant) wrote :

The last version almost works on my Dell i855 notebook. But there are still occasional freezes when using SolidWorks/wine1.3.5 (big openGL app). If it can help, I can post logs
On the other side, under WinXP, Solidworks is using software OpenGL on this hardware...

Good work ! It's the first time I see such a big app working under wine on intel platform ! 3D is really faster than expected, so solving remaining stability issues should be a real step ahead.

Geir Ove Myhr (gomyhr) wrote :

There is a (yet another) new kernel patch in the upstream bug report that has the potential to fix this bug once and for all:
https://bugs.freedesktop.org/show_bug.cgi?id=27187#c281

I don't have the resources (full hard drive and other limitations) to build a Ubuntu kernel package with this patch. If anybody has the possibility to put a test kernel in a PPA (or some other place), that would be nice.

@Geir Ove Myhr
How difficult is it to build an Ubuntu kernel package?
Because I still have the machine that had this problem, but I am not really using it anymore. So it is free for testing. But I fear I already upgraded it to Maverick. (Is there also a way to build the older kernel in Maverick?)

Geir Ove Myhr (gomyhr) wrote :

I have previously used the instructions at https://wiki.ubuntu.com/KernelTeam/GitKernelBuild . Instead of getting the vanilla upstream kernel in #2, I would use the drm-intel-next head from git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git. Then you would need to apply the patch before going on. If you have a faster newer computer, I would recommend building on that one, because compiling a kernel on the computers that have an 855GM chipset tends to be slow. You will need to use the 32-bit version of Ubuntu on that machine, though.

It is probably better to use Maverick than Lucid for testing this. Natty even better. With Natty, you should be able to steal most of the kernel configuration from the drm-intel-next mainline build in step 4 (see https://wiki.ubuntu.com/Kernel/MainlineBuilds).

I haven't done this in quite a while, so I don't remember all there is to it.

Brian Rogers (brian-rogers) wrote :

I've made a kernel with the latest fix available in this PPA:
https://launchpad.net/~brian-rogers/+archive/graphics-fixes-testing

If it gets good feedback, I'll copy it to my regular graphics-fixes PPA.

This kernel is based on Natty's 2.6.37 kernel, has the changes in drm-intel-next applied, and the patch at https://bugs.freedesktop.org/show_bug.cgi?id=27187#c291 is added. I've produced builds for Lucid, Maverick, and Natty. It has a cache coherency checker, so you'll see periodic messages in dmesg reporting on the number of flushes and whether any problems are detected.

From Daniel Vetter's patch description:
> Poke HIC bit + wbinv + cache coherency checker
>
> Chris Wilson's latest patch with my cache coherency checker added. Spills the
> number of chipset flushes regurlarly into the dmesg and bails loudly if one
> fails.
>
> Tested-by lines (like for the previous patch attempts by me) highly welcome.

Feedback about the patch can be sent directly to the upstream bug report at
https://bugs.freedesktop.org/show_bug.cgi?id=27187

If you have issues relating to installing/booting this kernel, report them here.

fossfreedom (fossfreedom) wrote :

Brian - I've been running this new kernel on lucid all this weekend - not a single lockup. Seriously impressed. It even fixed a few niggles such as screen dimming. Note - I'm running vanilla lucid with your kernel - no other fixes or other graphic updates in my system.

Changed in xserver-xorg-video-intel:
importance: Medium → Unknown
status: Confirmed → Fix Released
tags: added: kj-triage
Changed in xserver-xorg-video-intel:
importance: Unknown → Medium
Bryce Harrington (bryce) on 2011-04-28
Changed in xserver-xorg-video-intel (Ubuntu):
importance: High → Wishlist

This bug is missing log files that will aid in dianosing the problem. From a terminal window please run:

apport-collect 541511

and then change the status of the bug back to 'New'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Lucid):
status: New → Incomplete
Geir Ove Myhr (gomyhr) wrote :

This is a master bug report, and no more logs are required despite the automated message. Besides, the upstream bug report has been closed as FIXED RESOLVED.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Lucid):
status: Incomplete → Confirmed
Christiansen (happylinux) wrote :

Since this bug, as Geir Ove Myhr points out, is solved upstream, all we just need now, is that the Intel driver is enabled by default in Xorg in Oneiric (11.10) again - instead of the now default simple, featureless and faulty frame buffer driver.

I've successfully been using Uneiric Alpha 1 & 2 and alle updates in between and after, with the intel xorg driver on i855 boxes - enabled by the Maverick workaround "Manually enable Intel driver" found at https://wiki.ubuntu.com/X/Bugs/Mavericki8xxStatus.

As the upstream bugfixes is apparently only applied to the latest kernels (3.0) and demands drivers from xserver-xorg-video-intel never than 2.14 (2.15 is in Oneiric right now), reenabling the intel driver will only work for Oneiric (11.10) at present time.

Right now the bug is seems only "solved" by not using the Intel driver in Lucid, Maverich and Natty, which may be the best way for those releases, if the patches can't be applied to those kernels and and xorg drivers.

But for Oneiric the patches are implicit and the fixed Intel driver only requires to be enabled by the Xorg packages, and wouldn't the Alpha period best suited for that purpose ?.

Stenten (stenten) wrote :

On Thu, Jul 28, 2011 at 6:39 PM, Christiansen <email address hidden>wrote:

> As the upstream bugfixes is apparently only applied to the latest
> kernels (3.0) and demands drivers from xserver-xorg-video-intel never
> than 2.14 (2.15 is in Oneiric right now), reenabling the intel driver
> will only work for Oneiric (11.10) at present time.

This is not true. The fix has been in the mainline kernel since 2.6.38-rc7.
So if you're running Maverick (which uses the .38 kernel), you already have
the patch. You just need to load the -intel driver instead of fbdev like you
mentioned. This is what I am doing right now in fact.

So to clarify for everyone who had not been paying attention: 11.10 will
work out of the box for this chipset? And everyone below will need to
enable the intel driver manually? Is that correct?

Ben Straton

On Thu, 2011-07-28 at 23:12 +0000, Stenten wrote:
> On Thu, Jul 28, 2011 at 6:39 PM, Christiansen
> <email address hidden>wrote:
>
> > As the upstream bugfixes is apparently only applied to the latest
> > kernels (3.0) and demands drivers from xserver-xorg-video-intel never
> > than 2.14 (2.15 is in Oneiric right now), reenabling the intel driver
> > will only work for Oneiric (11.10) at present time.
>
>
> This is not true. The fix has been in the mainline kernel since 2.6.38-rc7.
> So if you're running Maverick (which uses the .38 kernel), you already have
> the patch. You just need to load the -intel driver instead of fbdev like you
> mentioned. This is what I am doing right now in fact.
>

Stenten (stenten) wrote :

Sorry, I misspoke. Natty is the first release to contain the fix, not Maverick (Natty is on the .38 kernel; Maverick is on .35).

So Lucid will not work OOTB (out-of-the-box). You need to disable KMS as indicated at [1]. Maverick ships with a workaround, so it works OOTB. Natty contains both the workaround and the upstream fix, so it works OOTB. You can disable the workaround by enabling the -intel driver as indicated at [2]; it will work normally because it also ships with the upstream fix in the kernel. Christiansen is just mentioning that since Oneiric contains the fix too, it should not also ship with the workaround.

I've opened Bug 817814 to track this and will give a heads up to some of the devs. Those with i8xx chipsets on Oneiric can help by adding a comment to that bug whether or not Oneiric works using the -intel driver on i8xx chipsets. (Play a movie file in Totem or VLC and see if it freezes; also run it for a couple days and see if it freezes.)

[1]: https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes
[2]: https://wiki.ubuntu.com/X/Bugs/Mavericki8xxStatus

Stenten (stenten) wrote :

Nevermind. I'd forgotten, but found it again searching through my mailing lists:

From [1]:
"In natty, we stopped Ubuntu development support for the 8xx chipset;
   these may still work ok with the -intel driver but we no longer
   field bug reports or backport fixes for them. This policy will
   continue in oneiric, but we'll reevaluate for oneiric+1. Meanwhile,
   please work with upstream directly for 8xx problems."

[1]: https://lists.ubuntu.com/archives/ubuntu-devel/2011-June/033414.html

Randall Smith (randall-tnr) wrote :

For those like myself wishing to stick with LTS, I have the 855GM working beautifully with desktop effects, xvideo and no known problems or side effects. Using, https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes as a reference, I enabled KMS and enabled the glasen ppa to install the latest intel driver (currently 2.15.0) and related packages like libdrm2 and libdrm-intel1.

I'm also running kernel 2.6.38-10-generic #46~lucid1-Ubuntu from lucid-updates. I was running backported kernels before, but this one came up a few days ago and has run very nicely. I thought Lucid was sticking to 2.6.32, but this seems to be a supported kernel and even says in the notes that Canonical will support it until October 2011.

Brad Figg (brad-figg) wrote :

For those that have been impacted by this bug and wish to stick with the Lucid LTS, I suggest you investigate installing an lts-backport kernel which contains the necessary support. It is unlikely (not going to happen) that we will actually apply the indicated patches to the Lucid kernel.

Bryce Harrington (bryce) wrote :

As per Brad's comments, the patches that address this issue would not be candidates for SRU in Lucid. Closing out the Lucid tasks.

Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Triaged → Won't Fix
Changed in linux (Ubuntu Lucid):
status: Confirmed → Won't Fix
Bryce Harrington (bryce) wrote :

The GPU lockups typically require kernel patches, as it appears in this case, rather than changes on the X side.
Sometimes we leave the X task open to help track the issue to solution, but as mentioned in recent comments, we're no longer providing engineering support for this old hardware so don't need to track it. I'll leave the linux task open for now though.

Changed in xserver-xorg-video-intel (Ubuntu):
status: Triaged → Invalid
Tim Gardner (timg-tpi) on 2012-10-02
Changed in linux (Ubuntu):
status: Confirmed → Invalid
Displaying first 40 and last 40 comments. View all 439 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.