MASTER: [i855] GPU lockup (apport-crash)

Bug #541511 reported by Geir Ove Myhr
This bug affects 322 people
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Undecided
Unassigned
xf86-video-intel
Fix Released
Medium
linux (Ubuntu)
Undecided
Unassigned
Lucid
Undecided
Unassigned
xserver-xorg-video-intel (Ubuntu)
Wishlist
Unassigned
Lucid
High
Unassigned

Bug Description

Binary package hint: xserver-xorg-video-intel

This is a MASTER bug report, i.e. not a real bug report, but a tool to help manage other bug reports.

Most bug reports on i855 are probably due to the CPU/GPU incoherency problem that is now consolidated upstream at http://bugs.freedesktop.org/show_bug.cgi?id=27187 (which was split off from a bug report for i845). For now, we mark all automatically reported GPU lockups on i855 as duplicates of this unless there is a reason not to.

A kernel with the proposed fix is available at https://launchpad.net/~brian-rogers/+archive/graphics-fixes

To use this fixed kernel, run the following commands:

sudo apt-add-repository ppa:brian-rogers/graphics-fixes
sudo apt-get update
sudo apt-get dist-upgrade
sudo apt-get install linux-image-2.6.35-ppa21+v9patch-generic

There is a similar master bug report for i845 at bug 541492.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

This is just a bug report so that I can keep track of my own trials at fixing this issue. I'll upload a clean-up version of my patch soon

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=34223)
cleanup up version of my gtt cache coherency patch

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Forget about the patch for the moment, I've just noticed that it doesn't work on my i855GM, too.

Revision history for this message
Geir Ove Myhr (gomyhr) wrote :

Binary package hint: xserver-xorg-video-intel

This is a MASTER bug report, i.e. not a real bug report, but a tool to help manage other bug reports.

Most bug reports on i855 are probably due to the CPU/GPU incoherency problem that is now consolidated upstream at http://bugs.freedesktop.org/show_bug.cgi?id=26345 (which was originally reported for i845, but applies to i855 as well). For now, we mark all automatically reported GPU lockups on i855 as duplicates of this unless there is a reason not to. There are some tests you may do to help upstream with this issue, and I will come back with instructions here. For those of you who know how to patch and compile a kernel you may look at comment #61 in the upstream bug report.

There is a similar master bug report for i845 at bug 541492.

Geir Ove Myhr (gomyhr)
Changed in xserver-xorg-video-intel (Ubuntu):
status: New → Triaged
importance: Undecided → High
Changed in xserver-xorg-video-intel:
status: Unknown → Confirmed
Revision history for this message
In , legolas558 (legolas558) wrote :
Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #4 from legolas558 <email address hidden> 2010-03-19 05:01:23 PST ---
> I think that your patch lets other bugs come out and become apparent, like the
> hangcheck timer bug.

Thanks for the list (especially the downstream bugs). btw the hangcheck
timer is the same bug, this is the kernel code noticing that the gpu just
died. Of course, if the dying gpu takes along the complete system, you're
not going to see this.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=34247)
new gtt cache coherency patch

New version of my patch. Changes:
- improved cache coherency checker. Instead of always using the same address it now constantly changes. Should increase the chance of catching an inconsistency. But still, this won't catch everything and there's still the chance that the gpu reads crap and dies before we detect that the chipset flush doesn't work.

- increased the size of the magic gtt write. I don't like this, but it seems to help. Given that testing of the last patch showed that I've only implemented a fancy delay I've tried different techniques. Unfortunately none worked.

Testing feedback highly welcome. I've you report results, please add some details about your machine: Processor (model and frequency), ram (type, speed & size). Perhaps there's a pattern there.

Oh, and legolas created a nice small script that calculates the cache failure rate of the running kernel:

https://bugs.freedesktop.org/attachment.cgi?id=34240

Geir Ove Myhr (gomyhr)
Changed in xserver-xorg-video-intel:
status: Confirmed → Unknown
description: updated
Revision history for this message
In , Bonbons67 (bonbons67) wrote :

Created an attachment (id=34262)
i855/Acer TM66x: Test results with patch in attachment #34194 from bug #26345

Changed in xserver-xorg-video-intel:
status: Unknown → Confirmed
Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #7 from Bruno <email address hidden> 2010-03-20 10:57:18 PST ---
> Created an attachment (id=34262)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34262)
> i855/Acer TM66x: Test results with patch in attachment #34194 from bug #26345

Thanks for testing. Unfortunately the patch you tested is outdated and
known not to work. At least the failure rates you're getting on your
Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
please retest with my latest patch (attached to this bug)?

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

Created an attachment (id=34270)
BUG in i915_gem.c - obj_priv->pages_refcount is zero in i915_gem_object_put_pages()

> Thanks for testing. Unfortunately the patch you tested is outdated and
> known not to work. At least the failure rates you're getting on your
> Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
> please retest with my latest patch (attached to this bug)?

It seems to be very much a matter of mood, on friday evening I saw no single bad message while today there were a lot.

Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for sd.
Both with previous patch and current one I'm hitting a BUG() in i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No idea if it's related or not but it kills i915 driver!

Attached is kernel log of system including BUG(), but no bad chipset flush.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #9 from Bruno <email address hidden> 2010-03-20 16:20:41 PST ---
> Created an attachment (id=34270)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34270)
> BUG in i915_gem.c - obj_priv->pages_refcount is zero in
> i915_gem_object_put_pages()
>
> > Thanks for testing. Unfortunately the patch you tested is outdated and
> > known not to work. At least the failure rates you're getting on your
> > Pentium M 1.5 GHz are about the same as on my Pentium M 1.2 GHz. Can you
> > please retest with my latest patch (attached to this bug)?
>
> It seems to be very much a matter of mood, on friday evening I saw no single
> bad message while today there were a lot.

Just to clarify: You're talking about the BUG in i915_gem_object_put_pages
here, not the "chipset flush failed" warning? And this is still on the old
version of the patch, right?

> Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for
> sd.
> Both with previous patch and current one I'm hitting a BUG() in
> i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No idea
> if it's related or not but it kills i915 driver!

Known issue, I get these here from time to time, too. Looks like my
unmap-inactive hack, intended to stress gtt cache flushing is uncovering
another problem somewhere else. I'll look into this more seriously now.

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

> > It seems to be very much a matter of mood, on friday evening I saw no single
> > bad message while today there were a lot.
>
> Just to clarify: You're talking about the BUG in i915_gem_object_put_pages
> here, not the "chipset flush failed" warning? And this is still on the old
> version of the patch, right?

The matter of mood was regarding the chipset flush failed warnings (as for GPU wedges before testing with your patch). Some day everything is running smooth, some days I have to reboot every hour (no noticeable usage difference on my side).

> > Since a crash I'm running now with 2.6.34-rc2 + your patch and one of mine for
> > sd.
> > Both with previous patch and current one I'm hitting a BUG() in
> > i915_gem_object_put_pages(). obj_priv->pages_refcount is zero in there. No
> > idea if it's related or not but it kills i915 driver!
>
> Known issue, I get these here from time to time, too. Looks like my
> unmap-inactive hack, intended to stress gtt cache flushing is uncovering
> another problem somewhere else. I'll look into this more seriously now.

I got it with both versions of your patch, let's see what today will bring, flush warnings or BUGs (or both).

Bryce Harrington (bryce)
tags: added: lucid
Bryce Harrington (bryce)
tags: added: freeze
tags: added: crash
Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=34353)
new patch with totally reworked chipset flush

I've combined a few of my previous, non-working ideas and combined them into this new patch. Hopefully this adapts better to different cpu/chipset combinations (i.e. better chance that it not only works on my machine). The new chipset flush is a magic dance involving a flock of canaries ;)

The magic values (look at include/drm/intel-gtt.h, but not at the comments, their stale) probably need some tuning. But this new chipset flush is adaptive (fyi it reports the maximum number of retries in the regular chipset flush no. reporting), so I hope it works out of the box.

Unfortunately I haven't yet fixed the BUG_ON in put_pages. For some odd reason I can't reproduce it here anymore and reviewing the code hasn't revealed anything yet.

As usual, testing reports highly welcome.

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

Created an attachment (id=34362)
BUG again, with flush retries before

Daniel, here is the result of one run with your latest patch (attachment #34353) at the time it BUGed but also having a few flush retries listed before.
I had the previous version running yesterday without any visible issue up to 2M flushes (but I might not have run the right actions)

It happened while surfing with firefox, some site doing fancy things with javascript and transparency for fade-in/out of images and the like.

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

(In reply to comment #12)
> Unfortunately I haven't yet fixed the BUG_ON in put_pages. For some odd reason
> I can't reproduce it here anymore and reviewing the code hasn't revealed
> anything yet.

Seems I can reproduce it damn easily...

In Firefox on page http://habiter.luxweb.com/NR58686_Mersch_Vente_Maison.html
it's sufficient to click on the image thumbnails to hit the BUG_ON (running Firefox 3.6-r2 + Flash plugin 10.0.45.2 (Gentoo)
Hovering the thumbnail images seems also to help getting flush retries.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #13 from Bruno <email address hidden> 2010-03-23 05:22:10 PST ---
> Created an attachment (id=34362)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34362)
> BUG again, with flush retries before
>
> Daniel, here is the result of one run with your latest patch (attachment
> #34353) at the time it BUGed but also having a few flush retries listed before.
> I had the previous version running yesterday without any visible issue up to 2M
> flushes (but I might not have run the right actions)

Thanks for testing. Don't worry about the retries, that's an integral part
of the new chipset flush. As long as it's a small number, everything's
fine (I've gotten higher numbers than you while testing).

The important thing is whether you still get backtraces about failed
chipset flushes (there's none in the dmesg you attached). iirc the
previous patch still had problems there for you.

If the BUG is causing you too much grieve while testing, undo the two
changes to drivers/gpu/drm/i915/i915_gem.c the patch makes. But that will
greatly reduce the number of chipset flushed, so massively hampering
testing. Otherwise just check you dmesg for any "chipset flush failed"
messages and hit the reset knob ;)

Geir Ove Myhr (gomyhr)
description: updated
Revision history for this message
In , 2points (2points) wrote :

Created an attachment (id=34375)
dmesg with gtt flush v4 patch

dmesg summary: Complaints about chipset flush timeout, subsequent log entries show that the maximum number of retries have been reached. Eventual freeze.

Revision history for this message
In , 2points (2points) wrote :

(In reply to comment #16)
> dmesg summary: Complaints about chipset flush timeout

While scanning over relevant code, I also noticed this: It should probably be canary_gtt_read, canary_cpu_read in drivers/char/agp/intel-gtt.c:978 instead of canary_cpu_read, canary_cpu_read.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #17 from <email address hidden> 2010-03-23 12:18:05 PST ---
> (In reply to comment #16)
> > dmesg summary: Complaints about chipset flush timeout
>
> While scanning over relevant code, I also noticed this: It should probably be
> canary_gtt_read, canary_cpu_read in drivers/char/agp/intel-gtt.c:978 instead of
> canary_cpu_read, canary_cpu_read.

Yep, you're right. Fixed in my local version.

I've looked at your dmesg and the chipset flush clearly doesn't work for
your hw. Can you please try to hang the gpu again (with my latest patch)
and then capture i915_error_state from <debugfs>/dri/0. Just to make sure
that your gpu is crashing due to a cache coherency bug and not due to
something else (rather unlikely). Meanwhile I try to come up with a new
idea to fix your problem.

Revision history for this message
In , 2points (2points) wrote :

Created an attachment (id=34376)
error state after freeze

Here you go.

Chris Wilson suggested elsewhere that this particular discrepancy between patch results of the same hardware might be related to chipset/hardware revision and corresponding quirks, so this may or may not explain oddities. (lspci claims rev 02 for my GPU)

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=34377)
new patch, improved retry flushing

New patch to (hopefully) tackle the problems uncovered by 2points. Please retest.

Bruno, I haven't yet found the problem with the BUG in put_pages. I'm hitting it about once every few days (with the same backtrace like you), but I haven't got a clue yet what's causing it.

Revision history for this message
In , Brian Rogers (brian-rogers) wrote :

Daniel, is this patch worth testing on i845 as well?

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #21 from Brian Rogers <email address hidden> 2010-03-24 08:21:43 PST ---
> Daniel, is this patch worth testing on i845 as well?

Since Chris Wilson last tested this patch on his i845 (didn't work), the
patch has changed quite a bit. So it might be worth to retest it, if you
have the time to spare. I certainly appreciate any testing feedback I can
get, because every machine (even seemingly similar 855 boxes) seems to act
differently.

Revision history for this message
In , 2points (2points) wrote :

Created an attachment (id=34416)
dmesg with gtt flush v5 patch

I've made these observations so far: The amount of reported failed chipset flushes has gone down drastically. In four test runs, only in one so far I've spotted a chipset flush failure warning. However, the GPU will still hang after a while, often without any flush-related warnings in dmesg.

I also like to add to speculations that failure rate might be related to CPU or possibly IO load. X runs for a lot longer if I just leave glxgears running by itself, but as soon as I start doing some work (Kontact, Opera, Flash plugin etc.) GPU freeze seems more likely.

Revision history for this message
In , 2points (2points) wrote :

Created an attachment (id=34417)
error state after freeze, v5

Revision history for this message
In , Chris Wilson (ickle) wrote :

Created an attachment (id=34418)
Clip solid fills

So I am working on the assumption that the residual hangs I am seeing after enabling/disable the GMCH to force the GTT flush are real batch buffer bugs...

For instance http://bugs.freedesktop.org/attachment.cgi?id=34417 looks like we attempt to write well beyond the end of the buffer. In which case the attached should workaround the issue.

Revision history for this message
In , 2points (2points) wrote :

Created an attachment (id=34419)
dmesg with gtt flush v5 patch

(In reply to comment #25)
> Created an attachment (id=34418) [details]
> Clip solid fills
>
> So I am working on the assumption that the residual hangs I am seeing after
> enabling/disable the GMCH to force the GTT flush are real batch buffer bugs...

Thanks, doing some tests with git HEAD and this patch. As far as chipset flushes go, I'm at slighly over 4.5M flushes right now. Since this is about four times as long as the machine usually lasts without freezing, I'd conclude that the fix is pretty effective.

As for failed chipset flushes, dmesg records three occurences now after about one hour of glxgears and various other tasks.

Revision history for this message
In , Geir Ove Myhr (gomyhr) wrote :

(In reply to comment #25)
> For instance http://bugs.freedesktop.org/attachment.cgi?id=34417 looks like we
> attempt to write well beyond the end of the buffer.

Could you explain how to see this from the file or intel_error_decode output? I'm trying to make some documentation for downstream on how to interpret the dumps.

Is it related to this? I'm not sure what those numbers mean, except that the first is obviously the start address of the batch buffer.
Buffers [13]:
...
  02821000 16384 00000048 00000000 000e354b dirty purgeable

Revision history for this message
In , legolas558 (legolas558) wrote :

I forgot to add myself to CC list, will now try latest patch!

Revision history for this message
In , legolas558 (legolas558) wrote :

Sorry, I have reported my findings on bug 26345:

- debugfs DRI dumps (see attachment 34436)
- dmesg with GTT failures (attachment 34437)

http://bugs.freedesktop.org/show_bug.cgi?id=26345#c76

Geir Ove Myhr (gomyhr)
description: updated
Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=34469)
gtt chipset flush v6

Changes since v6:
- tuned magic values, hopefully fixing problems seen by 2points and legolas
- some debug checks trying to catch the put_pages BUG_ON problem while it's happening.

As usual, testing feedback higly welcome.

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34473)
dmesg of 2 failures with v6 patch

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34474)
debugfs dumps after 2 failures with v6 patch

I have just tested v6 patch with 3 glxgears and I got 2 flush failures; it is not possible to state if anything sensibly changed when compared with previous patch

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #32 from legolas558 <email address hidden> 2010-03-26 03:19:48 PST ---
> Created an attachment (id=34474)
> --> (http://bugs.freedesktop.org/attachment.cgi?id=34474)
> debugfs dumps after 2 failures with v6 patch
>
> I have just tested v6 patch with 3 glxgears and I got 2 flush failures; it is
> not possible to state if anything sensibly changed when compared with previous
> patch

Indeed, not much changed at all. The real problem is reported in the first
backtrace (grep for "chipset flush timed out" in your dmesg). It
essentially means that the code has given up. All further failed flushes
are most likely just a result of this.

The problem seems to be writes to the gtt just don't show up at the cpu
side on your box. There's a tunable in my patch in the file
drivers/char/agp/intel-gtt.c

#define I830_GTT_MAX_RETRIES 100

Can you try out whether increasing this to a ridiculous number (like 1000)
helps? This might cause your machine to stall sometimes. As soon as you
get the chipset flush timed out message (and the max retries hits the
value you've defined), give up.

Thanks alot for testing this stuff.

btw, has video stability increased further with v6?

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #33)
> Indeed, not much changed at all. The real problem is reported in the first
> backtrace (grep for "chipset flush timed out" in your dmesg). It
> essentially means that the code has given up. All further failed flushes
> are most likely just a result of this.
>
I have an uptime of 50 minutes (it's the same session of reported dump/dmesg files) and my ratio is still 2 / 229376 (using gttqual script in attachment 34435), so it is indeed as you said.

> The problem seems to be writes to the gtt just don't show up at the cpu
> side on your box. There's a tunable in my patch in the file
> drivers/char/agp/intel-gtt.c
>
> #define I830_GTT_MAX_RETRIES 100
>
> Can you try out whether increasing this to a ridiculous number (like 1000)
> helps? This might cause your machine to stall sometimes. As soon as you
> get the chipset flush timed out message (and the max retries hits the
> value you've defined), give up.
>
> Thanks alot for testing this stuff.
>
Yes I am gonna make this test on next reboot and see what happens.

> btw, has video stability increased further with v6?
>
I would say definitively yes. I have tried with all videos which triggered the bug and no crash yet.

Other important notes:
* I am using only the v6 patch on drm-intel from git, and nothing else, neither the patched xf86-video-intel
* my kernel command line parameters are: lapic=yes hpet=force clocksource=hpet i8042.nomux=1

This hardware (Fujitsu Amilo clone) has a known problem with ACPI/i8042 controller; the i8042.nomux=1 is used to prevent touchpad glitches (http://bugzilla.kernel.org/show_bug.cgi?id=8740), while the battery, thermal and ac modules are always unloaded because otherwise this kernel bug would be triggered: http://bugzilla.kernel.org/show_bug.cgi?id=9147

Apart from these facts, I don't know anything else which could be relevant

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34476)
dmesg of early flush failure with 2000 retries

Bug triggered instantly, no sensible (additional) slowdown. This looks like a dead-end...

Shall I attach also the debugfs dri data? Shall I make any special test?

Revision history for this message
In , legolas558 (legolas558) wrote :

what if we "give up" on coherency of the first n flushes? something like the screen test of arcade machines, except that we just don't check if test is OK, pretending that coherency becomes consistent after that; I have often found that the last graphics contents of the LVDS pipe are persistent between a reboot (when using a liveCD, for example) and you can actually see the last screenshot a while before screen gets properly cleared and initialized.

Sorry but I am shooting in the dark here

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #36 from legolas558 <email address hidden> 2010-03-26 04:51:08 PST ---
> what if we "give up" on coherency of the first n flushes? something like the
> screen test of arcade machines, except that we just don't check if test is OK,
> pretending that coherency becomes consistent after that; I have often found
> that the last graphics contents of the LVDS pipe are persistent between a
> reboot (when using a liveCD, for example) and you can actually see the last
> screenshot a while before screen gets properly cleared and initialized.

That's exactly what my patch currently does - it simply gives up after too
many retries. Now if this corrupts a pixmap/texture, it just yields visual
corruptions on the screen. But this can also corrupt the gpu command
buffer (and some other vital things). And if the gpu reads crap from
these, it usually just hangs itself.

You've increased max_retries to 2000, which equals to about 1ms of delay.
And it hasn't helped at all, i.e. the chipset takes probably even longer
to reach a coherent state again. And 1 ms is an eternity for computer hw,
so this will crash your box - sooner or later.

> Sorry but I am shooting in the dark here

We all are ;) But there are some more constants to tune, this time in
include/drm/intel-gtt.h

#define I830_CC_GTT_WHACK_PAGES 16

Try to increase this (doubling it each step is sensible, the algo only
uses as much as required, this is just an upper bound). But don't go above
128, that'd be crazy (and I would have to figure out a new trick).

btw, dmesg is usually enough - I'll ask if I need anything else.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #37)
> > --- Comment #36 from legolas558 <email address hidden> 2010-03-26 04:51:08 PST ---
> > what if we "give up" on coherency of the first n flushes? something like the
> > screen test of arcade machines, except that we just don't check if test is OK,
> > pretending that coherency becomes consistent after that; I have often found
> > that the last graphics contents of the LVDS pipe are persistent between a
> > reboot (when using a liveCD, for example) and you can actually see the last
> > screenshot a while before screen gets properly cleared and initialized.
>
> That's exactly what my patch currently does - it simply gives up after too
> many retries. Now if this corrupts a pixmap/texture, it just yields visual
> corruptions on the screen. But this can also corrupt the gpu command
> buffer (and some other vital things). And if the gpu reads crap from
> these, it usually just hangs itself.
>
Yes I can understand this; there's always a state machine behind the scenes.

> You've increased max_retries to 2000, which equals to about 1ms of delay.
> And it hasn't helped at all, i.e. the chipset takes probably even longer
> to reach a coherent state again. And 1 ms is an eternity for computer hw,
> so this will crash your box - sooner or later.
>
I have done some other tests and it seems that 1000 or 2000 is high enough to *never* cause a failure with mild usage, while if I make 2 glxgear windows have a clipping rectangle, the failure immediately happens. Might this help? Perhaps openGL is altering the GPU in some way that we cannot forecast?

> > Sorry but I am shooting in the dark here
>
> We all are ;) But there are some more constants to tune, this time in
> include/drm/intel-gtt.h
>
> #define I830_CC_GTT_WHACK_PAGES 16
>
> Try to increase this (doubling it each step is sensible, the algo only
> uses as much as required, this is just an upper bound). But don't go above
> 128, that'd be crazy (and I would have to figure out a new trick).
>
> btw, dmesg is usually enough - I'll ask if I need anything else.
>
Ok, thank you Daniel, I will revert the retries to 1000 and make the next possible 3 tests with the whack pages constant.

Can we state that the pre-KMS driver was working good enough because bug was harder to trigger in those conditions? I can say I experienced lockups when watching videos or when shutting down even with the pre-KMS/Xorg1.6 combo, but it was rare.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #38 from legolas558 <email address hidden> 2010-03-26 07:08:31 PST ---
> Can we state that the pre-KMS driver was working good enough because bug was
> harder to trigger in those conditions? I can say I experienced lockups when
> watching videos or when shutting down even with the pre-KMS/Xorg1.6 combo, but
> it was rare.

Yep, the cache coherency bug was most likely always there. kms code tends
to be faster in certain circumstances and therefore tends to hit cache
coherency problems with a higher probability. That's also why Eric's patch
from half a year ago managed to break a few i855 chipsets by slightly
changing the timings. But that's just coincidental because that piece of
code helps cache coherency on i865 chipsets.

btw, I've tried your two-glxgears-with-clipping test. No ill effects, here
...

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34479)
dmesg of early failure with 1000 retries, 128 whack pages

(In reply to comment #39)
> > --- Comment #38 from legolas558 <email address hidden> 2010-03-26 07:08:31 PST ---
> btw, I've tried your two-glxgears-with-clipping test. No ill effects, here
> ...
>

I have tested with 32, 64, 128 whack pages and I have found the following:

- with 32/64 whack pages, moving around one glxgears window was enough to trigger the flush failure
- with 128 whack pages, moving around didn't seem to work but a resize of the glxgears window triggered the bug
- bug becomes progressively harder to trigger when increasing whack pages (qualitative sensation)

In all cases (even 16 whack pages and/or 1000/2000 retries), no more than 2 failures are found in dmesg (because as you said it gives up after that).

I am worried about this fact that our hardware, apparently the same, is not showing same behaviour...my .config is here:

http://www.iragan.com/linux/i855GM/legolas558.config

Geir Ove Myhr (gomyhr)
description: updated
Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #40 from legolas558 <email address hidden> 2010-03-26 07:49:31 PST ---
> In all cases (even 16 whack pages and/or 1000/2000 retries), no more than 2
> failures are found in dmesg (because as you said it gives up after that).

I've overlooked this, but now that I've checked, this is _very_ curious.
With v6 you only ever see 2 chipset flush failures, no matter how hard you
abuse your machine?

With the three dmesgs you've posted, these two failures are always in the
same chipset flush, just opposite directions (gtt->cpu and cpu->gtt
transfers). They'll also coincide with the chipset flush timed out
message. Can you please check that this is indeed the case (with the other
dmesgs you've got lying around) with the other test runs, too? Just
compare the "expected: xxx" value on each of the three backtraces.

This is strange because my code only gives up on the _current_ chipset
flush and doesn't bother to report any further timeouts. It still executes
all chipset flushes and still reports about failed ones. So if your hw
only ever reports one failure where everything fails (timeout+paranoia
check failures in both directions) and never fails again, this would be
_very_ strange indeed.

> I am worried about this fact that our hardware, apparently the same, is not
> showing same behaviour...my .config is here:

I've compared our configs and tried changing a few relevant ones to your
setting. Still can't reproduce your failures.

Revision history for this message
In , legolas558 (legolas558) wrote :
Download full text (3.3 KiB)

(In reply to comment #41)
> > --- Comment #40 from legolas558 <email address hidden> 2010-03-26 07:49:31 PST ---
> > In all cases (even 16 whack pages and/or 1000/2000 retries), no more than 2
> > failures are found in dmesg (because as you said it gives up after that).
>
> I've overlooked this, but now that I've checked, this is _very_ curious.
> With v6 you only ever see 2 chipset flush failures, no matter how hard you
> abuse your machine?
>
Yes. Never seen more than 2 since when I started using v6 patch, but I might be wrong because I never did more than 300k flushes in a session with a v6-patched kernel.

> With the three dmesgs you've posted, these two failures are always in the
> same chipset flush, just opposite directions (gtt->cpu and cpu->gtt
> transfers). They'll also coincide with the chipset flush timed out
> message. Can you please check that this is indeed the case (with the other
> dmesgs you've got lying around) with the other test runs, too? Just
> compare the "expected: xxx" value on each of the three backtraces.
>
Yes, you can also see it with v5 patch dmesg in attachment 34233

From my dmesg logs:
~~ session1 - v6 patch
[ 79.983513] i8xx chipset flush failed, expected: 5807, cpu_read: 5806
[ 79.983771] i8xx chipset flush failed, expected: 5807, gtt_read: 5806
~~ session2 - v6 patch
[ 101.807650] i8xx chipset flush failed, expected: 14194, cpu_read: 14193
[ 101.807844] i8xx chipset flush failed, expected: 14194, gtt_read: 14193
~~ session3 - v5 patch
[ 2832.905107] i8xx chipset flush failed, expected: 113457, cpu_read: 113456
[ 2832.905315] i8xx chipset flush failed, expected: 113457, gtt_read: 113456
[ 2910.626579] i8xx chipset flush failed, expected: 215361, cpu_read: 215360
[ 2910.626872] i8xx chipset flush failed, expected: 215361, gtt_read: 215360
[ 2977.424469] i8xx chipset flush failed, expected: 308976, cpu_read: 308975
[ 2977.424746] i8xx chipset flush failed, expected: 308976, gtt_read: 308975

I am gonna make more intensive tests later.

> This is strange because my code only gives up on the _current_ chipset
> flush and doesn't bother to report any further timeouts. It still executes
> all chipset flushes and still reports about failed ones. So if your hw
> only ever reports one failure where everything fails (timeout+paranoia
> check failures in both directions) and never fails again, this would be
> _very_ strange indeed.
>
Occam would say: perhaps it didn't fail at all and we are just not being informed correctly.

My raw guess is that some buddy between us and the GPU is touching something that shouldn't, and I am inclined to always blame the i8042 controller since I am already experiencing keyboard ports corruption when the battery ACPI is being used. But it is hard to link i8042 and the GPU (and the modules which cause the i8042 glitch for keyboard are never loaded), so I am still out of bullets.

> > I am worried about this fact that our hardware, apparently the same, is not
> > showing same behaviour...my .config is here:
>
> I've compared our configs and tried changing a few relevant ones to your
> setting. Still can't reproduce your failures.
>
As already stated, I am not using "clip sol...

Read more...

Revision history for this message
In , 2points (2points) wrote :

Created an attachment (id=34497)
dmesg with gtt flush v6 patch

Intermediate results for v6: No reported failures after 3.5M flushes, the number of maximum retries seems to have gone back slightly since v5 (now at 7, down from 9)

Meanwhile, I upgraded Xorg from 1.6.5 to 1.7.4 and noticed that the GPU hangs almost instantly once X is started up. Could attach relevant error state files, but it's probably not directly related to this bug, since no messages concerning failed flushes turn up in dmesg. Gone back to 1.6.5, and the problem disappeared (with the clip solid fills patch, that is).

> As already stated, I am not using "clip solid fills" patch, if that might be
relevant, but I doubt.
Maybe you should. It was already merged in git, too.

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34498)
dmesg of multiple flush failure with v6 patch

I was wrong. Playing with 4-5 glxgears windows finally triggered the bug, so situation is invariated vs v5 patch (except that failures happen less frequently).

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #43)
> Created an attachment (id=34497) [details]
> dmesg with gtt flush v6 patch
>
> Intermediate results for v6: No reported failures after 3.5M flushes, the
> number of maximum retries seems to have gone back slightly since v5 (now at 7,
> down from 9)
>
> Meanwhile, I upgraded Xorg from 1.6.5 to 1.7.4 and noticed that the GPU hangs
> almost instantly once X is started up. Could attach relevant error state files,
> but it's probably not directly related to this bug, since no messages
> concerning failed flushes turn up in dmesg. Gone back to 1.6.5, and the problem
> disappeared (with the clip solid fills patch, that is).
>
Wait, are you using Xorg 1.6.5 with KMS and the most recent intel driver? Is that possible?

I have always had the X hangup issue with 1.7.x series of Xorg. Right now I am using Xorg 1.7.5.902 and it crashes only when playing videos or using intensive graphics applications (wine).

> > As already stated, I am not using "clip solid fills" patch, if that might be relevant, but I doubt.
> Maybe you should. It was already merged in git, too.
>
Problem is that by freedesktop git stack doesn't work, so I don't know how to get a patched xf86-video-intel

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

On my side, nothing interesting to report, except maybe that (though I've not done aggressive tests, mostly just usual desktop use) for one day long sessions I've either hit the BUG_ON (not yet with v6 patch) or just had seldom increase of the number of retried flushes (for today, v6):
...
[ 384.945981] chipset flush no. 16384, max retries 0
[ 554.507090] chipset flush no. 32768, max retries 0
[ 686.938325] chipset flush no. 49152, max retries 0
[ 946.820082] chipset flush no. 65536, max retries 0
[ 1531.186977] chipset flush no. 81920, max retries 0
[ 2157.786320] chipset flush no. 98304, max retries 0
...
[21448.487379] chipset flush no. 1097728, max retries 0
[21910.290869] chipset flush no. 1114112, max retries 0
[22345.259438] chipset flush no. 1130496, max retries 0
[22858.071707] chipset flush no. 1146880, max retries 1
[23118.534467] chipset flush no. 1163264, max retries 1
[23299.123857] chipset flush no. 1179648, max retries 1
...

with 4 being the highest number seen.

Revision history for this message
In , legolas558 (legolas558) wrote :

If I don't open any glxgears and use the laptop for browsing and editing files, my max retries only reaches 6 at max.

When I open glxgears and resize it till flush failure, max retries count is updated to 1000 in next "chipset flush no." line.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #42 from legolas558 <email address hidden> 2010-03-26 14:49:10 PST ---
> From my dmesg logs:
> ~~ session1 - v6 patch
> [ 79.983513] i8xx chipset flush failed, expected: 5807, cpu_read: 5806
> [ 79.983771] i8xx chipset flush failed, expected: 5807, gtt_read: 5806
> ~~ session2 - v6 patch
> [ 101.807650] i8xx chipset flush failed, expected: 14194, cpu_read: 14193
> [ 101.807844] i8xx chipset flush failed, expected: 14194, gtt_read: 14193
> ~~ session3 - v5 patch
> [ 2832.905107] i8xx chipset flush failed, expected: 113457, cpu_read: 113456
> [ 2832.905315] i8xx chipset flush failed, expected: 113457, gtt_read: 113456
> [ 2910.626579] i8xx chipset flush failed, expected: 215361, cpu_read: 215360
> [ 2910.626872] i8xx chipset flush failed, expected: 215361, gtt_read: 215360
> [ 2977.424469] i8xx chipset flush failed, expected: 308976, cpu_read: 308975
> [ 2977.424746] i8xx chipset flush failed, expected: 308976, gtt_read: 308975

Yet again I was totally blind. All your failed flushes report an actual
value that's only one off from the expected one. But since v2 I'm moving
around the check value on the check page, so each position is only used
every 1024th cache flush. Which means that if the flush doesn't work and
the old value is still there, it should be "expected_value - 1024".

Furthermore your system seems to be the only one where chipset flushes
fail in pairs (always both directions in the same chipset flush). I
haven't seen this on any other dmesg neither by me nor by any other
tester.

In other words I highly suspect that something is (very rarely) corrupting
the last two bits of a 4 byte block. This would also explain why the
correct value never shows up, even after extensive gtt whacking.

Please test your box with memtest86+. If that doesn't turn anything up
I'll write a testpatch (memtest86+ doesn't check the gtt, wherein the
problem might be, too).

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #48)
> > --- Comment #42 from legolas558 <email address hidden> 2010-03-26 14:49:10 PST ---
> > From my dmesg logs:
> > ~~ session1 - v6 patch
> > [ 79.983513] i8xx chipset flush failed, expected: 5807, cpu_read: 5806
> > [ 79.983771] i8xx chipset flush failed, expected: 5807, gtt_read: 5806
> > ~~ session2 - v6 patch
> > [ 101.807650] i8xx chipset flush failed, expected: 14194, cpu_read: 14193
> > [ 101.807844] i8xx chipset flush failed, expected: 14194, gtt_read: 14193
> > ~~ session3 - v5 patch
> > [ 2832.905107] i8xx chipset flush failed, expected: 113457, cpu_read: 113456
> > [ 2832.905315] i8xx chipset flush failed, expected: 113457, gtt_read: 113456
> > [ 2910.626579] i8xx chipset flush failed, expected: 215361, cpu_read: 215360
> > [ 2910.626872] i8xx chipset flush failed, expected: 215361, gtt_read: 215360
> > [ 2977.424469] i8xx chipset flush failed, expected: 308976, cpu_read: 308975
> > [ 2977.424746] i8xx chipset flush failed, expected: 308976, gtt_read: 308975
>
In the session3, v5 might be v2 actually.

> Yet again I was totally blind. All your failed flushes report an actual
> value that's only one off from the expected one. But since v2 I'm moving
> around the check value on the check page, so each position is only used
> every 1024th cache flush. Which means that if the flush doesn't work and
> the old value is still there, it should be "expected_value - 1024".
>
Well I hope this will be useful to improve the patch.

> Furthermore your system seems to be the only one where chipset flushes
> fail in pairs (always both directions in the same chipset flush). I
> haven't seen this on any other dmesg neither by me nor by any other
> tester.
>
Yes, I admit I feel lonely recently...it would be nice to find another guy with my exact hardware.

My only custom option for intel driver in xorg.conf is:
Option "XvMC" "true"

but I doubt this could be relevant.

> In other words I highly suspect that something is (very rarely) corrupting
> the last two bits of a 4 byte block. This would also explain why the
> correct value never shows up, even after extensive gtt whacking.
>
I have tried booting with acpi=off, but seems that KMS depends on ACPI. I can only think that some ACPI or "gone wild" IRQ is causing the corruption, or that there is a broken GTT memory cell as you hypothesized.

> Please test your box with memtest86+. If that doesn't turn anything up
> I'll write a testpatch (memtest86+ doesn't check the gtt, wherein the
> problem might be, too).
>
I completed 2 passes (ECC off) with memtest86+ v4.00 and no errors were found in my 503M (I suppose the missing memory is shadowed). So the corruption might lie in the GTT area, but I don't know how to test that...and if I have understood correctly i855GM is not very handy to make this kind of consistency checks; I am waiting your testpatch because unfortunately I am far from being able to write such GTT testpatch.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=34505)
memory check patch for legolas

legolas, please apply this patch on top of v6. When a flush fails, this will read back the check values written to system mem (cpu) and gtt via the same path as they have been written to and print them out. If the readback value equals the expected value (look at the chipset fail message right before), everything is fine. If they equal the values as read on the other side (i.e. gtt readback = cpu read), something is corrupting memory when (ab)using the gtt.

To test just stress your system until you get a cache flush failure.

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34506)
dmesg of failure with v6 patch + gtt/cpu readback info

mumble mumble...

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #51)
> Created an attachment (id=34506) [details]
> dmesg of failure with v6 patch + gtt/cpu readback info
>
An excerpt from the above file:

[ 85.216591] i8xx chipset flush failed, expected: 5031, gtt_read: 5029
[ 85.216771] gtt readback: 5029, cpu reaback: 5029
[ 85.231559] chipset flush timed out, gtt_read: 5031, cpu_read: 5031, expected: 5032, gtt_pos :2421, cpu_pos: 0
[ 85.231854] gtt readback: 5031, cpu reaback: 5031
[ 113.151457] gtt readback: 20845, cpu reaback: 20845
[ 116.806192] gtt readback: 23135, cpu reaback: 23135

I'd say it's the second case, and narrowing down the cause seems scary. Some notes regarding patch v6:

1) there are some lags (even mouse cursor hiccups), this might be normal
2) the ability to switch to VT when system fatally crashes (e.g. video failure or other program failure, with hangcheck error or Xorg I/O errors) is gone
3) if I don't open glxgears and just use thunderbird, firefox and XFCE normally, no flush failures are ever reported in dmesg. Otherwise if I open glxgears and play with it a bit, I get the 2 failures (getting more is more difficult but indeed possible as shown before)

Regarding (2): drm and i915 modules are loaded automatically by Xorg and not loaded during kernel boot-up. I might try compiling drm+i915 as built-in, and perhaps this will fix the weird GTT corruption I am experiencing.

So it happens that after quite a while of normal usage I look at dmesg and I find no failure; in this status I have to expect a sudden (unrecoverable) crash from time to time, which requires hard shutdown, and that has almost become my usual way of closing the session (50% crashes, 50% normal I'd say).

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #52)
> Regarding (2): drm and i915 modules are loaded automatically by Xorg and not
> loaded during kernel boot-up. I might try compiling drm+i915 as built-in, and
> perhaps this will fix the weird GTT corruption I am experiencing.
>
Bah, it didn't work. I have just compiled drm and i915 as built-in and it creates almost the same output:

[ 71.198038] chipset flush timed out, gtt_read: 0, cpu_read: 3932, expected: 3933, gtt_pos :1, cpu_pos: 1
[ 71.198389] i8xx chipset flush failed, expected: 3933, gtt_read: 3932
[ 71.198533] gtt readback: 3932, cpu reaback: 3932
[ 71.521156] gtt readback: 4138, cpu reaback: 4138
[ 74.132310] gtt readback: 5803, cpu reaback: 5803
[ 75.552605] gtt readback: 6813, cpu reaback: 6813
[ 89.246701] gtt readback: 15633, cpu reaback: 15633
[ 95.692300] gtt readback: 20097, cpu reaback: 20097

Revision history for this message
In , Tony W (tonywhite100) wrote :
Download full text (3.2 KiB)

Hi guys,
Really appreciate the work you guys have been doing to try to fix this issue. And I truly mean that. This is a horrible issue.

If you haven't read it, the intel data sheet for this graphics card is here :
http://www.intel.com/Assets/PDF/datasheet/252615.pdf

And it does make of a reasonably interesting read when you consider it from the aspect of trying to hunt down what's causing this problem, although needle - haystack, much?

The SMM space restrictions look like a place of interest to me and also the very liberal way in which they have allowed bios manufacturers choose certain things related to the address registers.
Because it's Centrino technology the 855 is like a three in one deal, bios, 855gm chip and processor, all linked into together to render the graphics to screen and it looks like bios manufacturers have done whatever they thought was the best way to make that combo work, so any number of 855gm cards can work any number of different weird and wonderful ways using each possibly unique bios implementation. I've certainly seen evidence of that on my 855GM.
Having used Linux on the machine using the machine's original bios and then updating the bios to the latest one from the Asus website. Different behaviour exhibited by both bios versions.
The first one only required the nolapic parameter to boot. While the latest one requires mem=1001M (But the memory is supposed to be 1024M and memtest says 1016M.)
Without specifying the memory, the kernel boots but very slowly without the mem param. Fine with.
My wild stab in the dark here is that there is an undetected memory hole and that's causing the problem. The actual memory modules are fine.

As far as what you guys have been testing, I have experienced the same thing in regards to the symptoms. It will work, I can use a browser, watch flash video and it all seems fine but after an hour, it will lock up and need a manual power down to restore a working system.

If it is the case that it is the memory and more specifically the memory buffer which is causing the problem because the buffer is filling up and it is not being flushed in time, does the card not use any sort of compression to compress any parts of the buffer which would require a different type of flush to empty? (Multiple overlay?)
Could it be that because bios manufacturers have had such liberal choices on their bios implementations with this card that the memory addresses to flush are being detected incorrectly or could it be that the flush is trying to flush a part of the memory which it is not allowed to (Maybe because it's detected the addresses incorrectly) And that in turn triggers some kind of stop in the hardware, which prevents any further flushes.

I am of course clutching at straws, my knowledge is limited and it would be a tall order for me to learn C, fork the code, go through the data sheet and write a proper driver for the Linux kernel for this card. Although I would dearly love to do that.

I wish you guys luck on fixing this problem. You have made some impressive improvements so far compared to the way it was and in it's current form, the driver using your patches is so very nearly close to being sui...

Read more...

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #54 from Tony White <email address hidden> 2010-03-29 14:12:51 PST ---
> Hi guys,
> Really appreciate the work you guys have been doing to try to fix this issue.
> And I truly mean that. This is a horrible issue.
>
> If you haven't read it, the intel data sheet for this graphics card is here :
> http://www.intel.com/Assets/PDF/datasheet/252615.pdf

I know about this specsheet and I've read through it already a few weeks
ago. It only contains detailed feature descriptions of the gmch plus a few
non-graphics-core related register definitions. There's nothing in there
that could give us a hint as to how gtt<->cpu caches work and how to fix
it.

If you want to help, please test my latest patch (v6) and report how it
fares on your machine.

To everyone else who has already tested this: Thanks alot. Small summary
on the state of i855 cache coherency (please correct me if I'm wrong):

- Latest version works on three boxes (from Bruno, 2points and mine).
- It also seems to work on legolas' box, but that machine has some other
  issue. Pardon for being the messenger, but looks like your machine is
  toast :(
- The most likely unrelated BUG in put_pages hasn't surfaced again.

I won't submit this as is anytime soon for two reasons:
- The code is still rather ugly atm. I need to clean it up.
- This patch is a way too horrible hack to submit it for inclusion with a
  straight face.

To fix the latter, I want some more test reports. Given how many bug
reports downstream gathered, it shouldn't be a problem to find more
testers (distro maintainers: hint, hint). Please only test on i855
chipsets, i845 still seems to have some problems. If it works (check the
demsg for chipset flush related backtraces), please add your tested-by
line with a small blurb about your machine to this bug report, like this:

Tested-by: Daniel Vetter <email address hidden> (IBM Thinkpad X40)

If it doesn't work, hit me with your dmesg and i915_error_state output ;)

Bruno, legolas, 2points, please add your tested-by, too.

Revision history for this message
In , Geir Ove Myhr (gomyhr) wrote :

(In reply to comment #55)
> - This patch is a way too horrible hack to submit it for inclusion with a
> straight face.
>
> To fix the latter, I want some more test reports. Given how many bug
> reports downstream gathered, it shouldn't be a problem to find more
> testers (distro maintainers: hint, hint).

I have been meaning to build test kernels for Ubuntu users for a while, but I the Ubuntu wiki documentation for building kernel packages [1] that I have used before seems to not work that well anymore. Btw, which kernel version is it best to patch? A recent drm-intel-next or 2.6.34-rc2? On -rc2 I get problems with intel-agp.c:

gomyhr@storhaugen:~/src/linux-2.6.34-rc2$ patch -p1 --dry-run <../fix-i8xx-gtt-cache-coherency-v6.patch
patching file drivers/char/agp/Makefile
patching file drivers/char/agp/agp.h
patching file drivers/char/agp/efficeon-agp.c
patching file drivers/char/agp/intel-agp-gart.c
patching file drivers/char/agp/intel-agp.c
Reversed (or previously applied) patch detected! Assume -R? [n]
Apply anyway? [n]
Skipping patch.
1 out of 1 hunk ignored -- saving rejects to file drivers/char/agp/intel-agp.c.rej
patching file drivers/char/agp/intel-agp.h
patching file drivers/char/agp/intel-gtt.c
patching file drivers/gpu/drm/i915/i915_dma.c
patching file drivers/gpu/drm/i915/i915_gem.c
patching file include/drm/intel-gtt.h

and I get the same with drm-intel-next. Should I just delete this file, since that is what the patch seems to do? (I tried this, and the build seemed to work, but then some of the Ubuntu-packaging failed).

[1]: https://wiki.ubuntu.com/KernelTeam/GitKernelBuild

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #56 from Geir Ove Myhr <email address hidden> 2010-03-30 03:59:57 PST ---
> I have been meaning to build test kernels for Ubuntu users for a while, but I
> the Ubuntu wiki documentation for building kernel packages [1] that I have used
> before seems to not work that well anymore. Btw, which kernel version is it
> best to patch? A recent drm-intel-next or 2.6.34-rc2? On -rc2 I get problems
> with intel-agp.c:

Sorry for the confusion. Patch is actually based upon rc1, but that ones
missing some recent intel bugfixes. I'll rebase that thing on something
more current later today.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #55)
> To everyone else who has already tested this: Thanks alot. Small summary
> on the state of i855 cache coherency (please correct me if I'm wrong):
>
> - Latest version works on three boxes (from Bruno, 2points and mine).
> - It also seems to work on legolas' box, but that machine has some other
> issue. Pardon for being the messenger, but looks like your machine is
> toast :(

I'd be more than happy to mark it as "toasted" and go on, but then I wouldn't be able to use it with WindowsXP (never had a crash there) and neither with Xorg 1.6 (that works but because as you said the driver hits less frequently the cache).

Has Intel ever released the WindowsXP driver sources? Yes, I know..just dreaming..

Is there any other way that could explain the GTT<->CPU bug?

I'll add my Tested-By with next patch, since this one doesn't seem to be in-sync with latest drm-intel

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #58 from legolas558 <email address hidden> 2010-03-30 05:47:52 PST ---
> I'd be more than happy to mark it as "toasted" and go on, but then I wouldn't
> be able to use it with WindowsXP (never had a crash there) and neither with
> Xorg 1.6 (that works but because as you said the driver hits less frequently
> the cache).

The patch I intend to submit hopefully works better, too. By killing all
the gtt stress-testing hacks I've added you box is probably on par with
the other solutions.

> Has Intel ever released the WindowsXP driver sources? Yes, I know..just
> dreaming..

Likely won't help. The i8xx chipsets were designed without a kernel memory
manager in mind (Windows only gained that with Vista). So the XP driver
probably just implements a static gtt (that doesn't need any chipset
flushes) and copies textures back and forth. That works, but performance
will suck, especially with kernel-managed graphics memory allocation,
where spills happen rather often.

In other words, we're coxing these chipset into a framework they're not
designed for (but which is the only sane thing to do considering modern
graphics apis), trying to paper over any hw deficiencies with horrible
hacks like mine.

> Is there any other way that could explain the GTT<->CPU bug?

As long as there's no other report of the same problem, hw flakiness is
the only likely option. After all it only happens when hitting the gtt
really hard, something XP (and the old ums driver) are not likely to do.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #59)
> > --- Comment #58 from legolas558 <email address hidden> 2010-03-30 05:47:52 PST ---
> > I'd be more than happy to mark it as "toasted" and go on, but then I wouldn't
> > be able to use it with WindowsXP (never had a crash there) and neither with
> > Xorg 1.6 (that works but because as you said the driver hits less frequently
> > the cache).
>
> The patch I intend to submit hopefully works better, too. By killing all
> the gtt stress-testing hacks I've added you box is probably on par with
> the other solutions.
>
Yes because I am driven to think that the sudden crash happening later in time (also 1 hour of uptime) is not related to these rarely happening cache failures (easily verified with glxgears, but not otherwise during normal usage). Fact is that the Xorg total crash will happen even if no cache failure have yet happened...so perhaps that's another bug, but this patch is still necessary even as-is.

Are you planning to submit the patch here before sending it upstream?

Anyway my signature is:

Tested-by: Daniele Castellitto <email address hidden> (Maxdata Pro 7000X)

> > Has Intel ever released the WindowsXP driver sources? Yes, I know..just
> > dreaming..
>
> Likely won't help. The i8xx chipsets were designed without a kernel memory
> manager in mind (Windows only gained that with Vista). So the XP driver
> probably just implements a static gtt (that doesn't need any chipset
> flushes) and copies textures back and forth. That works, but performance
> will suck, especially with kernel-managed graphics memory allocation,
> where spills happen rather often.
>
> In other words, we're coxing these chipset into a framework they're not
> designed for (but which is the only sane thing to do considering modern
> graphics apis), trying to paper over any hw deficiencies with horrible
> hacks like mine.
>
Thank you Daniel for telling us so much - very appreciated. I now see the bigger picture.

> > Is there any other way that could explain the GTT<->CPU bug?
>
> As long as there's no other report of the same problem, hw flakiness is
> the only likely option. After all it only happens when hitting the gtt
> really hard, something XP (and the old ums driver) are not likely to do.
>
I see. I have also tried adding delays before reading back the values, but that does not help. It must be indeed some hardware glitch. I'd like to try KMS without ACPI, to see if it still happens, but that doesn't seem possible either.

So for now I'll keep this patch; and perhaps we can focus the sudden crash bug later.

Revision history for this message
In , 2points (2points) wrote :

Thanks for all your work on this. Even if inclusion in mainline is still pending, patches here finally fix problems I've had for two years now. Looks like I can finally upgrade from 2.6.27 without fear of random failures interrupting my work every now and then.

Tested-by: Moritz Brunner <email address hidden> (Asus M2400N)

Revision history for this message
In , Thorsten-thvo (thorsten-thvo) wrote :

Daniel, I tested your patch (v6) on an 852GME. Being very similar to the 855, the 852 also suffers from this bug. The GPU hangs frequently with all recent kernels. With your patch applied, I have not seen any hangs, nor any messages about failed flushes. The graphics performance is noticeably reduced though.

Tested-by: Thorsten Vollmer <email address hidden> (DFI-ACP G5M150-N w/ 852GME)

BTW: I was surprised to read that you are using an X40, because I would never have discovered this bug on my X40, my second machine. I have been using it for weeks with unpatched kernels and never saw any hangs. At first I hoped that my X40 was not affected and we could compare register settings. But with your patch applied, the kernel reports some flush retries. The frequency of failed flushes must be very low though.

I appreciate your work on this issue. Thanks.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #60 from legolas558 <email address hidden> 2010-03-31 02:01:45 PST ---
> Are you planning to submit the patch here before sending it upstream?

As soon as I post the patch for inclusion, I'll add a patch with all the
debugging stuff removed to this bug report.

> --- Comment #62 from Thorsten Vollmer <email address hidden> 2010-03-31 11:24:00 PST ---
> BTW: I was surprised to read that you are using an X40, because I would never
> have discovered this bug on my X40, my second machine. I have been using it for
> weeks with unpatched kernels and never saw any hangs. At first I hoped that my
> X40 was not affected and we could compare register settings. But with your
> patch applied, the kernel reports some flush retries. The frequency of failed
> flushes must be very low though.

Yep, my X40 is very stable with stock kernels. So I could never understand
all these bug reports about "intel drivers totally suck on i855GM"
because, hey, it works here! But a discussion with Chris Wilson about a
very strange bug report got me thinking. A few debug hacks later (to
stress the gtt) I've reduced the lifetime expectancy of my X40 to half a
minute :( With this, I've could then start hacking on solutions.

btw, these hacks are included in the patches posted here to really make
sure it works now. I'll drop them for the final rev.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=34595)
gtt chipset flush v7

Rebased against 2.6.34-rc3 (_not_ drm-intel-next, that one doesn't have all the latest fixes). No other changes from v6.

Revision history for this message
In , Renegabriels (renegabriels) wrote :

Created an attachment (id=34597)
NEC P520: dmesg output

Revision history for this message
In , Renegabriels (renegabriels) wrote :

The v7 patch appears to be working on my NEC versa P520: no crashes or screen corruptions yet! I've been running glxgears for 40 minutes to stress test it. The dmesg output (see previous post) reports a number of inconsistancies though. No idea if this is a problem, i haven't been following this discussion very closely.

PS: Performance is also way better than with previous patch (https://bugs.freedesktop.org/attachment.cgi?id=33593), which I have been running for a couple of weeks now without any problems (except performance).

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

(In reply to comment #55)
> Tested-by: Daniel Vetter <email address hidden> (IBM Thinkpad X40)
>
> If it doesn't work, hit me with your dmesg and i915_error_state output ;)
>
> Bruno, legolas, 2points, please add your tested-by, too.
>

Here it is, also had v7 running this evening with 3x glxgears and dmesg lists same kind of results as in comment #46 (reached about 5M-flushes with 1 retry until now)

Tested-by: Bruno Prémont <email address hidden> (Acer TM66x)

Revision history for this message
In , Remi (remi) wrote :

Created an attachment (id=34602)
dmesg output (HP Pavilion dv1000)

Here's my dmesg after running 2 glxgears at the same time... Doesn't look too good.

I'll see if the various corruptions I've seen so far reappear less often or not.

Thanks

Revision history for this message
In , legolas558 (legolas558) wrote :

I am now using patch v7, it is much more performant, max retries never seen above 5.

The 2 flush failures still happen (although it's harder to trigger them) when resizing glxgears window. Also if you insist a bit you will get a total system crash.

Always better than the vanilla kernel; please send upstream.

Tested-by: Daniele Castellitto <email address hidden> (Maxdata Pro
7000X)

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #68)
> Created an attachment (id=34602) [details]
> dmesg output (HP Pavilion dv1000)
>
@Daniel Vetter: maybe I and Rémi have the same issue? Would it be possible to store somewhere the instructions executed just before the GTT failure or much more importantly before the total system crash? Perhaps there are opcodes which do not fully reset the internal state machine, and adding some null operation in the flow will fix it (this would also explain why Xorg 1.6 crashes once in a year...)

Revision history for this message
In , Indan (indan) wrote :

(In reply to comment #70)
> If you want to help, please test my latest patch (v6) and report how it
> fares on your machine.

I've got a Thinkpad X40 with a rev 2 855GM chipset. With stock kernels my system is mostly stable, but X crashes now and then (depending on what I do, a couple times a week). Since KMS it's just the screen that freezes, except for the mouse pointer, and I can switch to console.

With your v7 patch applied on current git I get a full system hang fairly soon and easily. Because the system is stuck I can't get any error messages, but I didn't get any in dmesg before the hang (max retries was always 0, printed twice).

I do have the debugfs output, dmesg and Xorg.log after a hang with a plain current git kernel (2.6.34-rc3, HEAD at 0fdf86). Start of i915_error_state says:

Time: 1270672918 s 546381 us
PCI ID: 0x3582
EIR: 0x00000000
  PGTBL_ER: 0x00000000
  INSTPM: 0x00000000
  IPEIR: 0x00000000
  IPEHR: 0x41100000
  INSTDONE: 0x037fefc1
  ACTHD: 0x07c2a814
seqno: 0x000298f1

xorg-server 1.7.5.902, intel driver 2.10.0,

If there's anything I can do to help, please ask.

As I'm not interested in 3D, I'm starting to wonder why I shouldn't switch to the VGA driver. Surely that one is stable and not that much slower for 2D?

Revision history for this message
Chris Halse Rogers (raof) wrote :

It looks likely that we'll be blacklisting KMS for this chip in Lucid to work around other bugs. It would be helpful if users experiencing this bug could try disabling KMS (see https://wiki.ubuntu.com/X/KernelModeSetting if you are unsure how to do this) and see if it fixes the problem for you.

If there seems to be a consensus favoring blacklisting KMS, then we'll reassign this bug to the kernel and have andy blacklist for us.

If anyone finds that their i845 system works *worse* with KMS turned off for some reason, please shout; I wouldn't expect turning KMS to cause regressions so this would be a surprise, but with i8xx weirder things have been known to happen!

Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Triaged → Incomplete
Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #71 from Indan Zupancic <email address hidden> 2010-04-07 16:37:09 PDT ---
> If there's anything I can do to help, please ask.

Thanks for testing. Please update to xf86-video-intel 2.11 (just released
a few days ago). Also update to libdrm 2.4.20. These contain a few fixes
for gpu hangs on i8xx hw. If your gpu still hangs, please attach the
output of i915_error_state, that's usually sufficient to get a clue about
what's going on.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #70 from legolas558 <email address hidden> 2010-04-07 02:24:10 PDT ---
> @Daniel Vetter: maybe I and Rémi have the same issue?

Yep, it looks like Rémi, René and you all suffer from the same. In other
words, this can't be explained by broken hw anymore. I have a few ideas
about what's going on (that would also explain the put_pages BUG seen by
Bruno). But don't hold your breath waiting for a fix for this bug really
is a specialist in evasive maneuvers ;(

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #73)
> > --- Comment #70 from legolas558 <email address hidden> 2010-04-07 02:24:10 PDT ---
> > @Daniel Vetter: maybe I and Rémi have the same issue?
>
> Yep, it looks like Rémi, René and you all suffer from the same. In other
> words, this can't be explained by broken hw anymore. I have a few ideas
> about what's going on (that would also explain the put_pages BUG seen by
> Bruno). But don't hold your breath waiting for a fix for this bug really
> is a specialist in evasive maneuvers ;(

Oh sure it's broken, but its brokeness is consistent within the same model at least. How mad/stupid would it be to analyze the flow of GPU instructions to detect differences between a normal non-crashing flow (pre-KMS) and a crashing flow (KMS)? Also it would be nice to be able to "pack" these GPU flows in runnable batchsets so that one can eventually find the glitching sequence.
I have done this kind of sorcerery with JTAG so I thought it might be a "tool" for us too.

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

Created an attachment (id=34823)
Kernel logs with 3 BUGs in i915_gem.c:1456 / i915_gem_object_put_pages()

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=34824)
fix locking around chipset flushing

legolas, Rémi, René, this patch should fixed the problems you've encountered with timed-out chipset flushes. It was a bug in my code. Please test extensively.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #76)
> Created an attachment (id=34824) [details]
> fix locking around chipset flushing
>
> legolas, Rémi, René, this patch should fixed the problems you've encountered
> with timed-out chipset flushes. It was a bug in my code. Please test
> extensively.

Everything nominal with the new bugfix, as expected :-)

failures / flushes:
0 / 475136
max retries:
38.512510 0
160.138951 6
232.302424 8
402.559573 11

No flush failures with 6 glxgears windows (my 1.6Ghz hardware was almost hung up by running them).

The 8,11 max retries happened when opening Firefox, not when running glxgears; now it really seems stable, my raw feeling is that the bug is totally fixed.

Thanks for finding it :)

Now I'll try if playing videos still triggers a hangup, or if it hangs after hours of uptime.

Revision history for this message
Julian Lam (julian-lam) wrote :

KMS works fine in my experience, and I haven't experienced this apport crash since perhaps... 4 weeks ago?

If that changes anything.

Revision history for this message
In , Indan (indan) wrote :

(In reply to comment #72)
> Thanks for testing. Please update to xf86-video-intel 2.11 (just released
> a few days ago). Also update to libdrm 2.4.20. These contain a few fixes
> for gpu hangs on i8xx hw. If your gpu still hangs, please attach the
> output of i915_error_state, that's usually sufficient to get a clue about
> what's going on.

Okay, running 2.11 and 2.4.20 now. I'll report if I get any hangs, but that can take a while.

Thank you for all your work.

(As a side note, I still get slightly corrupted text in Firefox sometimes. If that has anything to do with this then the issue isn't totally solved yet.)

Revision history for this message
Geir Ove Myhr (gomyhr) wrote :

Julian, we turned off the automatic reporting of these bugs a while ago (4 weeks?). The apport-crash bug report would come up when the computer was rebooted after a freeze, but possibly in other situations as well. Did you see those freezes before or did you only see the apport-crash bug reporting dialog?

Revision history for this message
In , legolas558 (legolas558) wrote :

4h uptime, more than 2M flushes (2031616 exactly) without any failure, no crash after watching several videos, max retries is still 11. No graphics corruption anywhere. Possibly never experienced a more stable Xorg (neither with UMS).

All i855GM bugs are fixed for me; I am using xf86-video-intel v2.10.0 and an old libdrm compiled from git

The code which prints chipset flush stats can be taken out; please push patch upstream.

Many thanks for all the hard work!

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #79 from legolas558 <email address hidden> 2010-04-08 18:37:55 PDT ---
> 4h uptime, more than 2M flushes (2031616 exactly) without any failure, no crash
> after watching several videos, max retries is still 11. No graphics corruption
> anywhere. Possibly never experienced a more stable Xorg (neither with UMS).

Great! Rémi & Réne, can you please retest v7 with this fix applied and add
your tested-by line here?

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34836)
i915_gem_object_put_pages crash happening with libdrm 2.4.18

Now I am also affected by the i915_gem_object_put_pages bug; it obviously has a different cause.

Note: this total system crash happens with libdrm-2.4.18 and with libdrm-git (pulled today), however I could only retrieve the syslog message for the older libdrm (with the new one I only got nul bytes printed to syslog)

The bug triggers very quickly with libdrm-2.4.18 while it becomes much harder to trigger with the most recent libdrm, but it is indeed there.

Revision history for this message
Scott Testerman (scott-testerman) wrote :
Revision history for this message
In , Scott (firecat53) wrote :

Created an attachment (id=34849)
Everything.log for i845 freezes

Ok, for this hardware: 00:02.0 VGA compatible controller: Intel Corporation 82845G/GL[Brookdale-G]/GE Chipset Integrated Graphics Device (rev 01)

Running this kernel: 2.6.34-rc2-59809-g22f2d3a-dirty #1 SMP PREEMPT Thu Apr 8 21:41:20 PDT 2010 i686 Intel(R) Pentium(R) 4 CPU 1.80GHz GenuineIntel GNU/Linux

from drm-intel-next with Daniel's patch from yesterday, runnining xorg-server 1.7.6, libdrm-newest 2.4.19, and xf86-video-intel-git 20100408,

it appears that the freezing up is gone -- almost. I have been running for several hours with a combination of glxgears, playing .mp4 and .wmv movies from my hard drive, surfing with chromium and firefox, flash movie playback, and switching virtual terminals between dwm and xfce and have no freezes yet

EXCEPT -- running Tuxpaint under xfce, I got a freeze (see attached log for details -- just a single error message). Also, trying to play a dvd (using mplayer or gnome-mplayer) I got lots of graphics errors and then the same freeze (see log for details.) Not sure if these freezes are from this bug or an unrelated bug with the 845 chips.

Otherwise, nice work! Thank you :)

Scott

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #82 from Scott Hansen <email address hidden> 2010-04-09 10:04:27 PDT ---
> Otherwise, nice work! Thank you :)

Sorry to disappoint you, but that's just placebo. It looks like you've
only applied the small kernel patch from yesterday. But that's just an
incremental fix, i.e. you need the v7 patch _plus_ this small fix.

Can you please retest? If it still reports hangs, please add
i915_error_state in addition to the dmesg to this bug report.

Revision history for this message
In , Scott (firecat53) wrote :

Created an attachment (id=34857)
dmesg, i915_error_state and intel_gpu_dump

Agh, sorry! Well, with both the v7 and lock patches, my machine locked up instantly on starting X. I've attached (hopefully) the logs you requested. Let me know if you need more.

Thanks,
Scott

Revision history for this message
In , Renegabriels (renegabriels) wrote :

(In reply to comment #80)
> > --- Comment #79 from legolas558 <email address hidden> 2010-04-08 18:37:55 PDT ---
> > 4h uptime, more than 2M flushes (2031616 exactly) without any failure, no crash
> > after watching several videos, max retries is still 11. No graphics corruption
> > anywhere. Possibly never experienced a more stable Xorg (neither with UMS).
>
> Great! Rémi & Réne, can you please retest v7 with this fix applied and add
> your tested-by line here?

OK the following setup is still working after 45 min of stress-test (3x glxgears, 1x x11perf, 1x youtube):

- kernel 2.6.34-rc3 + v7 patch + mutex patch
- libdrm 2.4.20
- xorg-server 1.7.6
- xf86-video-intel 2.11.0
- mesa 7.8.1

No errors in dmesg whatsoever. The last entry reads "chipset flush no. 3637248, max retries 3".

There's just 1 thing wrong right now: the GNOME panel refuses to draw text after the stress test, but that's probably a panel issue.

Revision history for this message
In , Remi (remi) wrote :

Like I told Daniel yesterday on IRC, this works brilliantly (v7+locking patch). No more corruption and no more messages/crashes in dmesg.

Thanks again Daniel!

Tested-by: Rémi Cardona <email address hidden> (HP Pavilion dv1000)

Cheers

Revision history for this message
In , Arthur Spitzer (arthapex) wrote :

(In reply to comment #86)
> Like I told Daniel yesterday on IRC, this works brilliantly (v7+locking patch).
> No more corruption and no more messages/crashes in dmesg.
>
> Thanks again Daniel!
>
> Tested-by: Rémi Cardona <email address hidden> (HP Pavilion dv1000)
>
> Cheers
@Rémi: Could you please create an ebuild with the working patchset and put it in the x11-overlay or portage? I would love to test it here on my Acer Travelmate 663

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #81 from legolas558 <email address hidden> 2010-04-09 00:53:49 PDT ---
> Created an attachment (id=34836)
> --> (https://bugs.freedesktop.org/attachment.cgi?id=34836)
> i915_gem_object_put_pages crash happening with libdrm 2.4.18
>
> Now I am also affected by the i915_gem_object_put_pages bug; it obviously has a
> different cause.
>
> Note: this total system crash happens with libdrm-2.4.18 and with libdrm-git
> (pulled today), however I could only retrieve the syslog message for the older
> libdrm (with the new one I only got nul bytes printed to syslog)
>
> The bug triggers very quickly with libdrm-2.4.18 while it becomes much harder
> to trigger with the most recent libdrm, but it is indeed there.

I've tried to again reproduce this problem on my box by downgrading to
libdrm 2.4.18 (and a few other hacks). But that bug simply refuses to show
up again, here. Can you and Bruno please gather a few backtraces (as many
as you have lying around in your logs) and upload them to this bug?

Perhaps I can see a pattern and get a clue what's going on - at least that
way I've managed to fix the other problem with the stuck chipset flush.

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

Created an attachment (id=34917)
All BUGs I've seen in i915_gem.c since February

Revision history for this message
In , Renegabriels (renegabriels) wrote :

After a number of days testing this patch, i haven't seen any crashes or (render) errors. Thanks for the hard work Daniel!

Tested-by: René Gabriëls <email address hidden> (NEC Versa P520)

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #88)
> > --- Comment #81 from legolas558 <email address hidden> 2010-04-09 00:53:49 PDT ---
> > Created an attachment (id=34836) [details]
> > --> (https://bugs.freedesktop.org/attachment.cgi?id=34836)
> > i915_gem_object_put_pages crash happening with libdrm 2.4.18
> >
> > Now I am also affected by the i915_gem_object_put_pages bug; it obviously has a
> > different cause.
> >
> > Note: this total system crash happens with libdrm-2.4.18 and with libdrm-git
> > (pulled today), however I could only retrieve the syslog message for the older
> > libdrm (with the new one I only got nul bytes printed to syslog)
> >
> > The bug triggers very quickly with libdrm-2.4.18 while it becomes much harder
> > to trigger with the most recent libdrm, but it is indeed there.
>
> I've tried to again reproduce this problem on my box by downgrading to
> libdrm 2.4.18 (and a few other hacks). But that bug simply refuses to show
> up again, here. Can you and Bruno please gather a few backtraces (as many
> as you have lying around in your logs) and upload them to this bug?
>
> Perhaps I can see a pattern and get a clue what's going on - at least that
> way I've managed to fix the other problem with the stuck chipset flush.
Yes I confirm that the chipset flushes are all OK because I never got again a GTT failure.

I can't be sure that the crash with libdrm-2.4.20 is also due to i915_gem_object_put_pages, because no log lines are stored on /var/log/messages (only nul bytes).

I can trigger this total system crash only with a windows application running inside wine, with all other normal linux usage (even 3D) there is no crash.

I'll try catching the i915 debugfs data right after the crash, but I am not sure that the init process is still alive after that.

Unfortunately I don't have other dump files lying around, however I am sure that this is a new bug on this hardware e.g. it is not the crash happening when watching videos.

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34922)
dri debugfs snapshots taken every second + generator script used

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=34923)
excerpt from dmesg containing crash dump for i915_gem_object_put_pages+0x10b/0x110

A few updates. I have used a script running in background to gather DRI debugfs dumps.

1) the system is not totally hung up, because background scripts still run. Only keyboard/mouse die
2) it is the same crash happening with libdrm-2.4.18 and libdrm-git, so it's not libdrm-dependant
3) the nul bytes were due to write buffers not being flushed before hard shutdown, and with the 'sync' call in the daemon script (available in tbz archive) it correctly puts the crash dump in /var/log/messages
4) it does not only depend from wine but also from some other application, because I had to run wine and firefox to trigger it; anyway it looks deterministic and not totally random
5) the crash must have happened within the last 10 snapshots (e.g. seconds), sorry but I can't be more precise, I hope you can guess where it crashed from the DRI debugfs dumps

If I can be of some help I'd be glad to make other tests/reports; looks like I am able to reproduce this bug at will, so I can actually make manual tests.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #91 from legolas558 <email address hidden> 2010-04-12 10:13:19 PDT ---
> Unfortunately I don't have other dump files lying around, however I am sure
> that this is a new bug on this hardware e.g. it is not the crash happening when
> watching videos.

Concerning your overlay problem: Can you open a new bug report for that
and put me on the cc: (I'm the overlay guy)? A have another report from a
i965G hanging when using the overlay, perhaps there's some pattern. Please
add the usual amount of information so that I (or anyone else) doesn't
have to hunt around in various bug reports. Thanks.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #94)
> > --- Comment #91 from legolas558 <email address hidden> 2010-04-12 10:13:19 PDT ---
> > Unfortunately I don't have other dump files lying around, however I am sure
> > that this is a new bug on this hardware e.g. it is not the crash happening when
> > watching videos.
>
> Concerning your overlay problem: Can you open a new bug report for that
> and put me on the cc: (I'm the overlay guy)? A have another report from a
> i965G hanging when using the overlay, perhaps there's some pattern. Please
> add the usual amount of information so that I (or anyone else) doesn't
> have to hunt around in various bug reports. Thanks.

Sorry, I ought have used the verb in past tense. the crash *that was* happening when watching videos. I am no more experiencing the overlay bug when watching videos with v7+locking patch.

If you want I can go back to an older patch which still verifies the crash with videos and make the report as I did for the i915_gem_object_put_pages issue.

I am confident that it is fixed now because I am no more seeing a psychedelic fuchsia/rainbow overlay fill, which was interleaved in frames from time to time and preceeded by some minutes the final crash.

The only bug remaining for me is i915_gem_object_put_pages

Revision history for this message
In , Indan (indan) wrote :

(In reply to comment #72)
> > --- Comment #71 from Indan Zupancic <email address hidden> 2010-04-07 16:37:09 PDT ---
> > If there's anything I can do to help, please ask.
>
> Thanks for testing. Please update to xf86-video-intel 2.11 (just released
> a few days ago). Also update to libdrm 2.4.20. These contain a few fixes
> for gpu hangs on i8xx hw. If your gpu still hangs, please attach the
> output of i915_error_state, that's usually sufficient to get a clue about
> what's going on.

Okay, I have been running this combination for five days now without any hangs, it's looking pretty stable, I think my problems are fixed now.

Unpatched kernel 2.6.34-rc3
xf86-video-intel 2.11.0
libdrm 2.4.20
xorg-server 1.7.5.902
intel-dri 7.7.1

Do you want me to test your v7 patch + locking fixes to make sure it causes no regressions? It appeared to make things worse for Scott, and v7 on its own didn't work for me before either. Or are the good bits already upstream?

I think Scott hit the same bug as I did, so it might be fixed for him too now with the new libdrm. (No idea what actually caused my problems, nor what fixed it. Was it the EINTR versus EAGAIN bugfix? Or all the intel driver fixes?)

Thanks,

Indan

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #96 from Indan Zupancic <email address hidden> 2010-04-13 14:14:05 PDT ---
> Do you want me to test your v7 patch + locking fixes to make sure it causes no
> regressions? It appeared to make things worse for Scott, and v7 on its own
> didn't work for me before either. Or are the good bits already upstream?

Nope, nothing upstream yet (but the first patch series should hit
drm-intel-next in a few days). It looks like X40s are not really affected
by this gtt inconsistencies in day-to-day use. But the problem exists
there, too. So yes, please test v7+locking fix and beat it up for a few
days. If it works and doesn't report any failed chipset flushes, please
add your tested-by line, too.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package xserver-xorg-video-intel - 2:2.9.1-3ubuntu3

---------------
xserver-xorg-video-intel (2:2.9.1-3ubuntu3) lucid; urgency=low

  [Christopher James Halse Rogers]
  * debian/patches/107_disable_dri_on_845_855.patch:
    + Disable DRI on i845 and i855 chips. Works around the stability problems
      these chips have in Lucid (LP: #541492, LP: #541511).
 -- Bryce Harrington <email address hidden> Tue, 13 Apr 2010 15:25:23 -0700

Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Incomplete → Fix Released
Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

(In reply to comment #76)
> Created an attachment (id=34824) [details]
> fix locking around chipset flushing
>
> legolas, Rémi, René, this patch should fixed the problems you've encountered
> with timed-out chipset flushes. It was a bug in my code. Please test
> extensively.

Hi,
The v7 + locking patches work fine here on a JVC MP-XP731 (with an Intel 855GM rev 02), more than 3mb flushes and no hangs. Before (normal debian squeeze stack) X would crash shortly after login.

However, it feels slightly more sluggish than with the old intel 2.3 driver and Xorg 7.3, but that's probably because the patch does some extra debug checking?

My working setup right now is:

Linux 2.6.34-rc2 from drm-intel-next + v7 + locking patch
libdrm 2.4.18
intel driver 2.9.1
Xserver 1.7.6
Mesa 7.7.1-devel

So the only exchanged component is the kernel. I've made a package which is available for others to test here: http://www2.informatik.hu-berlin.de/~beier/tmp/linux-image-2.6.34-rc2_2.6.34-rc2-10.00.Custom_i386.deb

Thumbs up for the hard work!

Christian

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

Created an attachment (id=34996)
dmesg output after suspend to disk

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

(From update of attachment 34996)
Oops, just after thinking evrything's fine. X got stuck shortly after a wakeup from suspend to disk. dmesg output attached. Dunno if this is related at all...

Cheers,
   Christian

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Thanks alot for testing (this goes to everyone, not just Christian)!

> --- Comment #98 from Christian Beier <email address hidden> 2010-04-14 03:27:27 PDT ---
> However, it feels slightly more sluggish than with the old intel 2.3 driver and
> Xorg 7.3, but that's probably because the patch does some extra debug checking?

Yep, that's to be expected. My patch currently completely trashes the gtt
(to really exercise the chipset flush - no way to get to a few mm flushes
within just a few hours of testing without this). But that also kills
performance. Final version should be about on par with older drivers.

btw, is the following tested-by line correct?

Tested-by: Christian Beier <email address hidden> (JVC MP-XP731)

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

> btw, is the following tested-by line correct?
>
> Tested-by: Christian Beier <email address hidden> (JVC MP-XP731)

Oops, forgot that. Yeah, correct!

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #100 from Christian Beier <email address hidden> 2010-04-14 03:39:37 PDT ---
> (From update of attachment 34996)
> Oops, just after thinking evrything's fine. X got stuck shortly after a wakeup
> from suspend to disk. dmesg output attached. Dunno if this is related at all...

Great, everyone is stuck on the dev->struct_mutex lock. Sigh. One more
hint that the locking is fishy. Can you please enable lockdep (Kernel
hacking -> Lock debugging: prove locking correctness) in your kernel
config and retest? Lockdep should shed some light on what the heck is
going on here.

Revision history for this message
Geir Ove Myhr (gomyhr) wrote :

Bug reporters, it would be nice to have some confirmation that the workaround works as intended. The expected behaviour is that with 2:2.9.1-3ubuntu3, the freezes will stop and the performance (at least 3D and video) will be horrible.

While there is a fix in the works upstream for 855GM it looks like it is intended for the drm-intel-next series of kernels, which I think means that it will end up in the 2.6.35 kernel. It should probably also be possible to apply it to 2.6.34, but I'm not sure about 2.6.33 which is effectively what we have in Lucid when it comes to these drivers.

Revision history for this message
Geir Ove Myhr (gomyhr) wrote :

If someone wants to help upstream verify the potential fix on your 855GM hardware, Christian Beier has built a kernel-package with the patches (for Debian, but hopefully it will also work in Lucid). See https://bugs.freedesktop.org/show_bug.cgi?id=27187#c98 . The feedback they want is dmesg output from running this for a few hours while exercising the system (glxgears, 3D-apps, video, screensavers, x11perf, and whatever you may think of). If the system hangs, the file /sys/kernel/debug/dri/0/i915_error_state is interesting (will have to be retrieved by ssh while the system is hung). It is a bit slower than usual due to all the debugging code that is intended to catch any errors.

Of course, this will not work with xserver-xorg-video-intel 2:2.9.1-3ubuntu3, since it disables DRI. The one in my standard PPA should be equivalent to the current Lucid one, except that DRI is not disabled. https://launchpad.net/~gomyhr/+archive/standard

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

Created an attachment (id=35042)
dmesg v7 + locking patch, lockdep enabled

This one's different from my last dmesg, no more hung tasks, rather looks like the i915_gem_object_put_pages() bug the others experienced.

HTH anyway,
Christian

Revision history for this message
In , legolas558 (legolas558) wrote :

I have enabled lockdep debugging and I also have an early BUG dump "BUG: key dd9e5288 not in .data!" like Christian, so that can be ignored.

Fixing this last bug has become very important because with updates of last week the Xorg v1.6 (and related packages and old intel driver) is badly crashing, so it can no more be used.

I am now using Xorg 1.7.6 with the VESA driver, and that is rock-solid

Revision history for this message
Hector Avila (compugeek3264) wrote :

The workaround did not work on my Gateway M305CRV with Intel 82852. I still get GPU lockups and the Xorg crashes that result from it.

Here's my Xorg log for anyone who wants it;

Revision history for this message
Hector Avila (compugeek3264) wrote :

does not*

Revision history for this message
Bryce Harrington (bryce) wrote :

reopening due to user feedback. I suspect disabling DRI might have helped, but is not a sufficient solution.

Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Fix Released → Triaged
Revision history for this message
Vlad Pes (pesotsky-web) wrote :

1. this workaround did not work on my Toshiba Portege M100 with
        00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02)
+
2. when ubuntu's xorg chrashes und automatically restarts in low graphic mode, it is impossible to start xorg on the parallel installation debian 5.0.4. Debian reports errors of xorg in text mode (--> https://bugs.launchpad.net/bugs/532452).

Revision history for this message
Bryce Harrington (bryce) wrote :

I wish we had more testing evidence to base this decision on, but I've posted a kernel bug report requesting KMS disablement on three of the older 8xx cards: lp #563277

We've already sent up a fair plentitude of bug reports to upstream, so I'm hopeful that they'll come up with fixes to this and to KMS, so we can re-enable in meerkat, or maybe even in 10.04.1, but we'll have to see how things go.

Revision history for this message
Hector Avila (compugeek3264) wrote :

I apologize for posting the wrong Xorg.0.log file on one of my last comments.

00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 01)
00:02.1 Display controller [0380]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 01)

Revision history for this message
Rick Spencer (rick-rickspencer3) wrote :

This is an unfortunate situation. There is a non-trivial number of users with 845 and 855 chips who are impacted by a regressions in stability in the current x stack when running 3d and KMS.

We have opted for a "stability first" approach for these users. We will disable 3D and KMS for these chips in Lucid final release. This will have the unfortunate effect of disabling compiz. This will introduce a functional regression. So we will be sacrificing functionality for these users in favor of stability. This is a painful choice to make, but we feel that stability must trump functionality when we are forced to make such choices.

We will be pursuing functional fixes. However, we will do this outside the main release, for example in a PPA. If we are able to provide a fix that delivers stability and functionality, we will consider this a potential SRU in 10.04.1.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Good news about the put_pages BUG_ON: There's another bug report
indicating that this is not a problem in my patch but also exists in the
mainline kernel:

https://bugzilla.kernel.org/show_bug.cgi?id=15664

My patch (especialyl the hack to stress test the gtt) just makes it more
likely.

Bad news: I still have no clue what's goin on.

To all those who are hitting this problem: What mesa release are you using
and are you using a compositioning window manager that uses OpenGL? I have
an idea ...

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #106)
> To all those who are hitting this problem: What mesa release are you using
> and are you using a compositioning window manager that uses OpenGL? I have
> an idea ...

Using XFCE with mesa/libgl 7.7.1, no compositing at all. If you want I can enable Option "Composite" "Disable" in xorg.conf

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #107 from legolas558 <email address hidden> 2010-04-15 03:41:55 PDT ---
> (In reply to comment #106)
> > To all those who are hitting this problem: What mesa release are you using
> > and are you using a compositioning window manager that uses OpenGL? I have
> > an idea ...
>
> Using XFCE with mesa/libgl 7.7.1, no compositing at all. If you want I can
> enable Option "Composite" "Disable" in xorg.conf

Arrgh, whatever, my theory just went bust.

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

(In reply to comment #106)
> To all those who are hitting this problem: What mesa release are you using
> and are you using a compositioning window manager that uses OpenGL? I have
> an idea ...

Mesa 7.7.1 and running compiz 0.8.4...

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=35065)
only call put_pages when gtt_space != NULL

Ok, this might be the first real stab at that dreaded put_pages BUG. Everyone who's hitting this problem, please apply this patch on top of whatever kernel most easily reproduces the problem.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #110)
> Created an attachment (id=35065) [details]
> only call put_pages when gtt_space != NULL
>
> Ok, this might be the first real stab at that dreaded put_pages BUG. Everyone
> who's hitting this problem, please apply this patch on top of whatever kernel
> most easily reproduces the problem.

It seems much more stable now.

gtt failures:
0 / 114688
max retries:
66.696711 0
203.340132 5
284.831640 6
570.264962 7

I also tested it through Murphy's law by trying to do something important with the wine application: no hangups up to now.

Let's see what happens in the next couple of days. For now I'd say FIXED.

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

Created an attachment (id=35073)
dmesg output snippet with i915_gem_tiling.c warning

With all three patches (the v7, locking and gtt_space!=NULL) atop a drm-intel-next kernel it seems to run stable. Did not run into any crashes (by now...). However, while playing around with RandR rotation, i got the attached warning. X continues running.

Again, I don't know if this is in any way related, but maybe it helps...

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #112 from Christian Beier <email address hidden> 2010-04-15 15:19:40 PDT ---
> With all three patches (the v7, locking and gtt_space!=NULL) atop a
> drm-intel-next kernel it seems to run stable. Did not run into any crashes (by
> now...). However, while playing around with RandR rotation, i got the attached
> warning. X continues running.

Looks like the ddx is not properly disabling bo reuse on the framebuffer.
Please retest with the latest version of xf86-video-intel and libdrm. If
the problem persists, please open a new bug report, this is definitely
something else.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

Created an attachment (id=35087)
new patch against current drm-intel-next

I've beaten the patch into shape and killed all the debug hacks. Patch is against current drm-intel-next. But portions of it are already submitted upstream, so it might no longer apply in a few days. I'll try to rebase asap when that happens.

Performance should be about the same as old ums code or unpatched kernel - but stable ;)

Everyone who's still testing these patches and has not yet reported their tested-by line, please do so now. I intend to submit this some when next week.

Please test this patch thoroughly.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #114)
> Created an attachment (id=35087) [details]
> new patch against current drm-intel-next
>
> Performance should be about the same as old ums code or unpatched kernel - but
> stable ;)
>
General performance has indeed increased, however I can clearly notice "hickups" during major load on the GPU. Like when starting mozilla apps or wine apps; this was not noticeable with previous v7+locking patchset.

As you said it is like old UMS code, and actually glxgears FPS is comparable (if not better); so the only minor issue is the hickups that I am experiencing. These hickups also hang the mouse for some seconds, so I suppose it is some locking on the GPU pipe, but I can't read the changes in your v8 patch so just suppositions.

Anyway the patch is perfectly mature for upstream in my opinion, and is indeed the best one we have up to now. Please grab my Tested-By line from previous comments.

Revision history for this message
Anand Kumria (wildfire) wrote : Re: [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)

Hi guys,

It is a shame that things are not working. Unfortunately reverting
things has also broken things.

Just a data point - reverting KMS has now impacted me. (with KMS
things would work occassionaly).

I am not no longer able to boot to an operational graphical environment.

i855M (PCI ID: 8086:3582, subsystem: 1028:018d) rev 02

kernel: linux-image-2.6.32-21-generic (2.6.32-21.32)
gdm: 2.30.0-0ubuntu5
X:xserver-xorg-server-intel (2:2.9.1-3ubuntu4), intel-gpu-tools
(1.0.2+git20100324-0ubuntu1)

Anand

On Thu, Apr 15, 2010 at 7:25 AM, Rick Spencer
<email address hidden> wrote:
> This is an unfortunate situation. There is a non-trivial number of users
> with 845 and 855 chips who are impacted by a regressions in stability in
> the current x stack when running 3d and KMS.
>
> We have opted for a "stability first" approach for these users. We will
> disable 3D and KMS for these chips in Lucid final release. This will
> have the unfortunate effect of disabling compiz. This will introduce a
> functional regression. So we will be sacrificing functionality for these
> users in favor of stability. This is a painful choice to make, but we
> feel that stability must trump functionality when we are forced to make
> such choices.
>
> We will be pursuing functional fixes. However, we will do this outside
> the main release, for example in a PPA. If we are able to provide a fix
> that delivers stability and functionality, we will consider this a
> potential SRU in 10.04.1.
>
> --
> MASTER: [i855] GPU lockup (apport-crash)
> https://bugs.launchpad.net/bugs/541511
> You received this bug notification because you are a direct subscriber
> of a duplicate bug.
>
> Status in X.org xf86-video-intel: Confirmed
> Status in “xserver-xorg-video-intel” package in Ubuntu: Triaged
> Status in “xserver-xorg-video-intel” source package in Lucid: Triaged
>
> Bug description:
> Binary package hint: xserver-xorg-video-intel
>
> This is a MASTER bug report, i.e. not a real bug report, but a tool to help manage other bug reports.
>
> Most bug reports on i855 are probably due to the CPU/GPU incoherency problem that is now consolidated upstream at http://bugs.freedesktop.org/show_bug.cgi?id=27187 (which was split off from a bug report for i845). For now, we mark all automatically reported GPU lockups on i855 as duplicates of this unless there is a reason not to. There are some tests you may do to help upstream with this issue, and I will come back with instructions here. For those of you who know how to patch and compile a kernel you may look at comment #30 (and #6 for what kind of feedback they want) in the upstream bug report. Actually, if someone could volunteer to build an ubuntu-packaged kernel with this patch for others to test, that would be nice.
>
> There is a similar master bug report for i845 at bug 541492.
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/xserver-xorg-video-intel/+bug/541511/+subscribe
>
>

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

(In reply to comment #114)
> Performance should be about the same as old ums code or unpatched kernel - but
> stable ;)

Yeah, with the v8 patch everything is more or less on par with the old ums code performance-wise. Seems to be stable as well, running with compiz since a few days, no crashes or hickups.

Thumbs up!
Christian

Revision history for this message
fossfreedom (fossfreedom) wrote :

Likewise,
  am using i855GM - a low resolution boot splash is displayed and then black. No GDM.

I've had to resort to the 32-20 kernel as well as the 2:2.9.1-3ubuntu1 driver to get back to a functioning system.

Revision history for this message
Anand Kumria (wildfire) wrote : Re: [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)

On Sun, Apr 18, 2010 at 7:18 PM, DavidM <email address hidden> wrote:
> Likewise,
>  am using i855GM - a low resolution boot splash is displayed and then black.  No GDM.
>
> I've had to resort to the 32-20 kernel as well as the 2:2.9.1-3ubuntu1
> driver to get back to a functioning system.

Thanks for that info. Looking at the changelog either 2:2.9.1-3ubuntu1
or 2:2.9.1-3ubuntu2 ought to work.

Unfortunately, for me, it appears all older packages have disappeared
from the mirrors.

Anand

> --
> MASTER: [i855] GPU lockup (apport-crash)
> https://bugs.launchpad.net/bugs/541511
> You received this bug notification because you are a direct subscriber
> of a duplicate bug.
>
> Status in X.org xf86-video-intel: Confirmed
> Status in “xserver-xorg-video-intel” package in Ubuntu: Triaged
> Status in “xserver-xorg-video-intel” source package in Lucid: Triaged
>
> Bug description:
> Binary package hint: xserver-xorg-video-intel
>
> This is a MASTER bug report, i.e. not a real bug report, but a tool to help manage other bug reports.
>
> Most bug reports on i855 are probably due to the CPU/GPU incoherency problem that is now consolidated upstream at http://bugs.freedesktop.org/show_bug.cgi?id=27187 (which was split off from a bug report for i845). For now, we mark all automatically reported GPU lockups on i855 as duplicates of this unless there is a reason not to. There are some tests you may do to help upstream with this issue, and I will come back with instructions here. For those of you who know how to patch and compile a kernel you may look at comment #30 (and #6 for what kind of feedback they want) in the upstream bug report. Actually, if someone could volunteer to build an ubuntu-packaged kernel with this patch for others to test, that would be nice.
>
> There is a similar master bug report for i845 at bug 541492.
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/xserver-xorg-video-intel/+bug/541511/+subscribe
>
>

Revision history for this message
Geir Ove Myhr (gomyhr) wrote :

> Unfortunately, for me, it appears all older packages have disappeared
from the mirrors.

xserver-xorg-video-intel 2:2.9.1-3ubuntu2~gomyhr1~clipsolids in my standard PPA (https://launchpad.net/~gomyhr/+archive/standard) should be functionally equivalent to 2:2.9.1-3ubuntu2, since I put it up to test that patch. Don't count on it staying there, since I will replace it whenever I need something else tested.

Revision history for this message
Anand Kumria (wildfire) wrote :

Hi Geir,

On Mon, Apr 19, 2010 at 12:26 AM, Geir Ove Myhr <email address hidden> wrote:
>> Unfortunately, for me, it appears all older packages have disappeared
> from the mirrors.
>
> xserver-xorg-video-intel 2:2.9.1-3ubuntu2~gomyhr1~clipsolids in my
> standard PPA (https://launchpad.net/~gomyhr/+archive/standard) should be
> functionally equivalent to 2:2.9.1-3ubuntu2, since I put it up to test
> that patch. Don't count on it staying there, since I will replace it
> whenever I need something else tested.

Thanks for that!

Kernel 2.6.32-21-generic:
 - Confirmed works if 'startx' is done manually.
 - with 'text' on the kernel command line and then gdm is manually
('start gdm') started after logging in as root
Kernel 2.6.32-19-generic:
 - causes plymouth to crash. gdm manually started is NOT successfull.
X does not work either.
Kernel 2.6.32-18-generic:
 - as with 2.6.32-19-generic

Cheers,
Anand

> --
> MASTER: [i855] GPU lockup (apport-crash)
> https://bugs.launchpad.net/bugs/541511
> You received this bug notification because you are a direct subscriber
> of a duplicate bug.
>
> Status in X.org xf86-video-intel: Confirmed
> Status in “xserver-xorg-video-intel” package in Ubuntu: Triaged
> Status in “xserver-xorg-video-intel” source package in Lucid: Triaged
>
> Bug description:
> Binary package hint: xserver-xorg-video-intel
>
> This is a MASTER bug report, i.e. not a real bug report, but a tool to help manage other bug reports.
>
> Most bug reports on i855 are probably due to the CPU/GPU incoherency problem that is now consolidated upstream at http://bugs.freedesktop.org/show_bug.cgi?id=27187 (which was split off from a bug report for i845). For now, we mark all automatically reported GPU lockups on i855 as duplicates of this unless there is a reason not to. There are some tests you may do to help upstream with this issue, and I will come back with instructions here. For those of you who know how to patch and compile a kernel you may look at comment #30 (and #6 for what kind of feedback they want) in the upstream bug report. Actually, if someone could volunteer to build an ubuntu-packaged kernel with this patch for others to test, that would be nice.
>
> There is a similar master bug report for i845 at bug 541492.
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/xserver-xorg-video-intel/+bug/541511/+subscribe
>
>

Revision history for this message
In , Stefan Glasenhardt (glasen) wrote :

(In reply to comment #114)
> Created an attachment (id=35087) [details]
> new patch against current drm-intel-next

Hi Daniel,

Is it possible to merge the changes into a clean patch which applies to kernel version 2.6.32 and 2.6.33 (.32 preferred).

Since two hours fiddling around to get your patch cleanly applied to the latest lucid kernel. It compiles, but the patch changes so much things (splitting of files, etc.), so i haven't the slightest idea what you have really changed and what the patch really does.

Can you please post the code snippets which where introduced by you to solve the crashes?

Greetings Stefan

Revision history for this message
Vlad Pes (pesotsky-web) wrote :

I've searched for the information and this is what I've found (maybe it'll help):

http://bugs.archlinux.org/task/16974:
"Anyway the i855GM issues have just been radically fixed by Daniel Vetter, you can find the patch and the complete kernel sources archive here:http://www.iragan.com/linux/i855GM/"

If this doesn't help, (I don't know much about it but I've thought maybe this could be the solution): make alternative xorg + intel-driver for 855gm (which works) as a manual installation available for those who have such hardware.

Cheers,
Vlad

Revision history for this message
Geir Ove Myhr (gomyhr) wrote :

Vlad, thank you for taking an interest in helping. We have followed the upstream bug report (freedesktop-bugs #27187 at the top of this page) and we know about the fix.

>"Anyway the i855GM issues have just been radically fixed by Daniel Vetter, you can find the patch and the complete kernel
sources archive here:http://www.iragan.com/linux/i855GM/"

Two key words here are radically and kernel. Daniel has rewritten some parts of the kernel in order to achieve this and the fix is based on will be kernel 2.6.35 which is much newer than what is in Lucid. So the two problems are: 1. The fix changes a lot of things and could lead to a regression for others. Not something we want right before the release of Lucid. 2. The fix doesn't necessarily apply on the current Lucid kernel. After the release, we may be able to backport, test, and update if it doesn't lead to problems for anyone else.

> If this doesn't help, (I don't know much about it but I've thought maybe this could be the solution): make alternative xorg + intel-driver for 855gm (which works) as a manual installation available for those who have such hardware.

It is the kernel that needs an alternative version, changing xorg will not fix this. It's not too hard to compile, and I would already have done this if I had a i386 installation of Lucid available.

Revision history for this message
In , Arthur Spitzer (arthapex) wrote :

(In reply to comment #114)
> Created an attachment (id=35087) [details]
> new patch against current drm-intel-next
> ...
> Please test this patch thoroughly.
I've tested it on my Travelmate 660 since Friday evening, and it worked wonderful. Here are my specs:
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
00:02.1 Display controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)

I've started xcompmgr on top of openbox. Every night four glxgears were running. And on daily use firefox (with a lot of flash videos) a high resolutioned trailer of IronMan2 with mplayer, eclipse and a transparent consolte were running, without a freeze.
I would say it works! Thanks!

Tested-By: Arthur Spitzer <email address hidden> (Acer Travelmate 660)

Revision history for this message
In , Geir Ove Myhr (gomyhr) wrote :

(In reply to comment #117)
> Is it possible to merge the changes into a clean patch which applies to kernel
> version 2.6.32 and 2.6.33 (.32 preferred).
> Since two hours fiddling around to get your patch cleanly applied to the latest
> lucid kernel.

Stefan, for the Lucid kernel we would want a patch against the 2.6.33 kernel, since Ubuntu (and some other distros) use 2.6.32 kernels with drm from 2.6.33.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #115)
> (In reply to comment #114)
> > Created an attachment (id=35087) [details] [details]
> > new patch against current drm-intel-next
> >
> > Performance should be about the same as old ums code or unpatched kernel - but
> > stable ;)
> >
> General performance has indeed increased, however I can clearly notice
> "hickups" during major load on the GPU. Like when starting mozilla apps or wine
> apps; this was not noticeable with previous v7+locking patchset.
>
> As you said it is like old UMS code, and actually glxgears FPS is comparable
> (if not better); so the only minor issue is the hickups that I am experiencing.
> These hickups also hang the mouse for some seconds, so I suppose it is some
> locking on the GPU pipe, but I can't read the changes in your v8 patch so just
> suppositions.
>
Daniel the patch fixed every bug, I almost forgot that I was running a testing stack. Regarding the hickup: it most probably is perfectly normal and was concealed in previous versions of the patch because the overall performance was slower so the hickups could not be "felt"

Please add my Tested-by line, it's ready for me.

Revision history for this message
Anand Kumria (wildfire) wrote : Re: [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)

Hi,

On Mon, Apr 19, 2010 at 7:39 AM, Geir Ove Myhr <email address hidden> wrote:
> Vlad, thank you for taking an interest in helping. We have followed the
> upstream bug report (freedesktop-bugs #27187 at the top of this page)
> and we know about the fix.
>
>>"Anyway the i855GM issues have just been radically fixed by Daniel Vetter, you can find the patch and the complete kernel
> sources archive here:http://www.iragan.com/linux/i855GM/"

I took a look over the patches; whilst there is a radical re-working
of things - what is interesting is seeing what was shipped
(supposedly, this is from the comment and not verified) in Fedora 13.
They have a 2 line patch.

From http://www.iragan.com/linux/i855GM/old_patches/drm-intel-big-hammer.patch

RedHat patch used in Fedora Core 13.

This patch prevents instantaneous crashes when starting Xorg.

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 37427e4..08af9db 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2553,6 +2553,11 @@ i915_gem_execbuffer(struct drm_device *dev, void *data,

  mutex_lock(&dev->struct_mutex);

+ /* We don't get the flushing right for these chipsets, use the
+ * big hamer for now to avoid random crashiness. */
+ if (IS_I85X(dev) || IS_I865G(dev))
+ wbinvd();
+
  i915_verify_inactive(dev, __FILE__, __LINE__);

  if (dev_priv->mm.wedged) {

> It is the kernel that needs an alternative version, changing xorg will
> not fix this. It's not too hard to compile, and I would already have
> done this if I had a i386 installation of Lucid available.

It may be worthwhile asking the kernel guys about incorporating this
into the kernel. I am compiling 2.6.32-21-generic to test this out to
see if it works for me. I'll report back in a few hours.

Cheers,
Anand

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #117 from Stefan Glasenhardt <email address hidden> 2010-04-18 15:32:41 PDT ---
> Can you please post the code snippets which where introduced by you to solve
> the crashes?

As a preview, my local topic branch is available at

http://cgit.freedesktop.org/~danvet/drm/log/?h=stuff/i8xx_cache_coherency_for_oga

The relevant patches start after "Enable distcc". On my further merge
plans: As already said, I hope to send the last patch pile (containing the
real fix) for review in a few days, pending merging of the previous
submissions. If it survives review intact I'll backport just the fix for
.34 and earlier kernels.

So taking testing/relase delays on each stage (-next, .34, -stable) into
account, expect a few weeks before this hits a stable kernel near you,
best-case scenario.

Revision history for this message
Chris Halse Rogers (raof) wrote :

I'm aware of that “big hammer” patch from the upstream bug. The problem is that it's not a fix - reports on the upstream bug indicate that it reduces the incidence of crashes, and the analysis by upstream is that it simply introduces a delay which makes it less likely to trigger the problem.

It was decided that as this was not going to be sufficient we wouldn't pursue this patch for Lucid. We'll be looking at pulling Daniel Vetter's fix into a PPA kernel at some point, and given sufficient testing it *might* be an SRU for 10.04.

Revision history for this message
In , Indan (indan) wrote :

Created an attachment (id=35159)
dmesg of a failed chipset flush warning

Since the libdrm and intel driver updates my system seems to be rock solid.
That said, I tried your patches to see if it made any change. v7 had horrible
performance, as expected, but v8 is considerably slower than unpatched too. E.g.
"time dmesg" in rxvt takes a lot longer (2x or more) and it all feels slightly
sluggish, not snappy as it is when unpatched. On the upside, it seems the text corruption is fixed by v8, though I'm not totally sure.

On the downside, just when I thought everything was fine, I got the following warning (for the first time):

WARNING: at /home/indan/src/linux-2.6/drivers/char/agp/intel-gtt.
c:1007 intel_i830_chipset_flush+0x2e3/0x32d()
Hardware name: 2371GHG
i8xx chipset flush failed, expected: 827, cpu_read: 315

So it seems we're not there yet.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #122 from Indan Zupancic <email address hidden> 2010-04-19 05:36:35 PDT ---
> Since the libdrm and intel driver updates my system seems to be rock solid.
> That said, I tried your patches to see if it made any change. v7 had horrible
> performance, as expected, but v8 is considerably slower than unpatched too.
> E.g.
> "time dmesg" in rxvt takes a lot longer (2x or more) and it all feels slightly
> sluggish, not snappy as it is when unpatched. On the upside, it seems the text
> corruption is fixed by v8, though I'm not totally sure.

This is just with the kernel changed, right? Because 2.11 has taken a
rather severe hit against 2.10 for i8xx chipsets on some workloads (I'm
working on fixing it).

> On the downside, just when I thought everything was fine, I got the following
> warning (for the first time):
>
> WARNING: at /home/indan/src/linux-2.6/drivers/char/agp/intel-gtt.
> c:1007 intel_i830_chipset_flush+0x2e3/0x32d()
> Hardware name: 2371GHG
> i8xx chipset flush failed, expected: 827, cpu_read: 315
>
> So it seems we're not there yet.

Depends. It's definitely just a failed chipset flush (I've checked the
offset). But given enough time and testers, this is somewhat expected
because this patch just implements a probabilistic chipset flush. Tallying
all the chipset flushes of all testers easily gives on the order of 100mm
successful ones. Now if yours is the only one that failed, that's not a
problem. Please keep an eye on this and report any reoccurences - some
more tuning might be called for (perhaps even a module parameter).

Also please report if the glyph corruptions show up again.

Revision history for this message
In , Indan (indan) wrote :

(In reply to comment #123)
> This is just with the kernel changed, right? Because 2.11 has taken a
> rather severe hit against 2.10 for i8xx chipsets on some workloads (I'm
> working on fixing it).

Yes, all userspace is unchanged since I switched to 2.11 and newer libdrm.

I didn't notice any regressions with 2.11 compared to 2.10 though, but I'm
only using 2D with xcompmgr -a running.

> Depends. It's definitely just a failed chipset flush (I've checked the
> offset). But given enough time and testers, this is somewhat expected
> because this patch just implements a probabilistic chipset flush. Tallying
> all the chipset flushes of all testers easily gives on the order of 100mm
> successful ones. Now if yours is the only one that failed, that's not a
> problem. Please keep an eye on this and report any reoccurences - some
> more tuning might be called for (perhaps even a module parameter).

Well, it's curious I never got it with the v7 patch, while I ran that one for days.

A module parameter to dis/enable this canary stuff would be good, it just seems to slow things down for me without improving anything.

I wonder if it's in any way significant that the difference between the expected 827 and cpu_read 315 is precisely 512... Did anyone try to increase I830_MCH_WRITE_BUFFER_SIZE to something bigger?

Looking at the commit, especially the description, it seems like there's no way to do proper chipset flushes. Maybe hunt down and confront an Intel developer? Or avoid the need to do flushes, but that's probably unrealistic. On the other hand, if you can't really flush, you can't really depend on it either.

> Also please report if the glyph corruptions show up again.

I will.

Okay, while writing this I got a second warning:

i8xx chipset flush failed, expected: 4043, cpu_read: 3531

The difference is again exactly 512.

Maybe the chipset flushing is fine, but there's a different bug making it seem to fail?

Revision history for this message
In , Bonbons67 (bonbons67) wrote :

Created an attachment (id=35167)
failed flush with

Yesterday I've had a failed flush as well, the only one since I applied patch in attachment #35087.
Currently running:
  x11-base/xorg-server-1.7.6
  x11-libs/libdrm-2.4.20
  media-libs/mesa-7.8.1
  xf86-video-intel at commit 80f52482c7cde000a76b91fe3d8b6c16baf2141f
                             XvMC: fix memory overflow
                             8 April 2010, by Daniel Vetter

Revision history for this message
In , Geir Ove Myhr (gomyhr) wrote :

(In reply to comment #125)
> Created an attachment (id=35167) [details]
[30125.064301] i8xx chipset flush failed, expected: 648447, cpu_read: 647935
The difference is again 512. Suggests that it is usually/always bit 8 (counting from 0) that comes out wrong(?)

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #124 from Indan Zupancic <email address hidden> 2010-04-19 11:26:15 PDT ---
> > Depends. It's definitely just a failed chipset flush (I've checked the
> > offset). But given enough time and testers, this is somewhat expected
> > because this patch just implements a probabilistic chipset flush. Tallying
> > all the chipset flushes of all testers easily gives on the order of 100mm
> > successful ones. Now if yours is the only one that failed, that's not a
> > problem. Please keep an eye on this and report any reoccurences - some
> > more tuning might be called for (perhaps even a module parameter).
>
> Well, it's curious I never got it with the v7 patch, while I ran that one for
> days.
>
> A module parameter to dis/enable this canary stuff would be good, it just seems
> to slow things down for me without improving anything.
>
> I wonder if it's in any way significant that the difference between the
> expected 827 and cpu_read 315 is precisely 512... Did anyone try to increase
> I830_MCH_WRITE_BUFFER_SIZE to something bigger?

The fact that it's 512 shows that the problem is a failed cacheflush and
nothing else (this is actually what I've checked). The chipset flush
checker changes the place it writes the check value every chipset flush.
And it reuses the same place every 512th chipset flush. So when the
chipset flush failes, the old value is there, which should be exactly 512
less than what's expected.

> Looking at the commit, especially the description, it seems like there's no way
> to do proper chipset flushes. Maybe hunt down and confront an Intel developer?
> Or avoid the need to do flushes, but that's probably unrealistic. On the other
> hand, if you can't really flush, you can't really depend on it either.

Well, there is _no_ way to do a reliable flush. And the hw docs explicitly
says so. But we need to move stuff in/out of the graphics mem (i.e. the
gtt). The other option would be to copy stuff in/out, which is even worse:
- Wastes memory (actually simply uses twice as much for everything).
- Would be even slower than what my hack currently does.

And to add insult to injury, some of the chipsets from the 2nd gen (i8xx)
suffer from other cache coherency problems in addition to this.

> > Also please report if the glyph corruptions show up again.
>
> I will.
>
> Okay, while writing this I got a second warning:
>
> i8xx chipset flush failed, expected: 4043, cpu_read: 3531

Ok, that's bad. Can you change the following define in
include/drm/intel-gtt.h and see whether you still get failed chipset
flushes?

-#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
+#define I830_CC_CANARY_FLOCK_GTT_PAGES 16

The whole stuff make somewhat more sense this way around, anyway.

Oh, and add some details about your box, please (brand&model + cpu,
mostly, the rest is all in the dmesg, anyway).

Revision history for this message
Brian Rogers (brian-rogers) wrote :

For those who need it, I have a kernel with Daniel Vetter's patch in my experimental PPA: https://launchpad.net/~brian-rogers/+archive/experimental

Revision history for this message
In , Tony W (tonywhite100) wrote :
Download full text (3.7 KiB)

OK Guys, I've tried the :
fix-i8xx-gtt-cache-coherency-v7.patch
locking_for_chipset_flush.patch
&
gtt_space_null_means_no_pages_ref.patch
patches against 2.6.34-rc3 for about a week now.

I have :
Intel Corporation 82852/855GM Integrated Graphics Device (rev 02) (prog-if 00 [VGA controller])

libdrm 2.4.18
xorg-x11-drv-intel-2.9.1
xorg-x11-server-Xorg-1.7.6

The patch seems quite stable, although I just experienced a (Non fatal) Crash.
So nice job guys, it's much better.

The crash I just got looks like :

Apr 19 20:28:48 m3n kernel: [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
Apr 19 20:28:48 m3n kernel: render error detected, EIR: 0x00000000
Apr 19 20:28:51 m3n kdm[1368]: X server for display :0 terminated unexpectedly

It is the first time it has crashed using the patches, the screen just went black, input was totally lost, I was able to safely shut down the machine by pressing the power button once and I am seeing lots of stuff like :

Apr 19 16:35:59 m3n kernel: chipset flush no. 2752512, max retries 7
Apr 19 16:42:43 m3n kernel: chipset flush no. 2768896, max retries 7
Apr 19 16:45:03 m3n kernel: chipset flush no. 2785280, max retries 7
Apr 19 16:45:58 m3n kernel: chipset flush no. 2801664, max retries 7
Apr 19 16:47:54 m3n kernel: chipset flush no. 2818048, max retries 7
Apr 19 16:49:08 m3n kernel: chipset flush no. 2834432, max retries 7
Apr 19 16:50:26 m3n kernel: chipset flush no. 2850816, max retries 7
Apr 19 16:52:57 m3n kernel: chipset flush no. 2867200, max retries 7
Apr 19 16:56:29 m3n kernel: chipset flush no. 2883584, max retries 7
Apr 19 16:57:46 m3n kernel: chipset flush no. 2899968, max retries 7
Apr 19 16:58:27 m3n kernel: chipset flush no. 2916352, max retries 7
Apr 19 17:02:14 m3n kernel: chipset flush no. 2932736, max retries 7
Apr 19 17:05:27 m3n kernel: chipset flush no. 2949120, max retries 7
Apr 19 17:10:07 m3n kernel: chipset flush no. 2965504, max retries 7
Apr 19 17:12:00 m3n kernel: chipset flush no. 2981888, max retries 7
Apr 19 17:15:20 m3n kernel: chipset flush no. 2998272, max retries 7
Apr 19 17:17:07 m3n kernel: chipset flush no. 3014656, max retries 7
Apr 19 17:17:54 m3n kernel: chipset flush no. 3031040, max retries 7
Apr 19 17:18:57 m3n kernel: chipset flush no. 3047424, max retries 7
Apr 19 17:26:08 m3n kernel: chipset flush no. 3063808, max retries 7
Apr 19 17:26:37 m3n kernel: chipset flush no. 3080192, max retries 7
Apr 19 17:28:20 m3n kernel: chipset flush no. 3096576, max retries 7
Apr 19 17:28:55 m3n kernel: chipset flush no. 3112960, max retries 7
Apr 19 17:29:58 m3n kernel: chipset flush no. 3129344, max retries 7
Apr 19 17:30:20 m3n kernel: chipset flush no. 3145728, max retries 7
Apr 19 17:30:44 m3n kernel: chipset flush no. 3162112, max retries 7
Apr 19 17:31:26 m3n kernel: chipset flush no. 3178496, max retries 7
Apr 19 17:31:54 m3n kernel: chipset flush no. 3194880, max retries 7
Apr 19 17:35:13 m3n kernel: chipset flush no. 3211264, max retries 7
Apr 19 17:35:34 m3n kernel: chipset flush no. 3227648, max retries 7
Apr 19 17:37:11 m3n kernel: chipset flush no. 3244032, max retries 7
Apr 19 17:38:20 m3n kernel: chipset flush no. 3260416, max retries 7
Apr 19 17:...

Read more...

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #128 from Tony White <email address hidden> 2010-04-19 13:04:24 PDT ---
> OK Guys, I've tried the :
> fix-i8xx-gtt-cache-coherency-v7.patch
> locking_for_chipset_flush.patch
> &
> gtt_space_null_means_no_pages_ref.patch
> patches against 2.6.34-rc3 for about a week now.
>
> I have :
> Intel Corporation 82852/855GM Integrated Graphics Device (rev 02) (prog-if 00
> [VGA controller])
>
> libdrm 2.4.18
> xorg-x11-drv-intel-2.9.1
> xorg-x11-server-Xorg-1.7.6

Thanks alot for testing. Unfortunately the versions you're using are
known-broken. Please retest with the latest&greatest (currently libdrm
2.4.20 and xf86-video-intel 2.11). Also, when the gpu hangs (as indicated
by the hangcheck time elapsed error in the dmesg) always grab an
i915_error_state (from the dri directory of the debugfs filesystem). That
file contains the a dump of the gpu state when it died with all the
necessary info to debug such a hang (the dmesg only tells that the gpu
died, but misses all the other needed info).

Revision history for this message
In , Geir Ove Myhr (gomyhr) wrote :

(In reply to comment #129)
> > --- Comment #128 from Tony White 2010-04-19
> > libdrm 2.4.18
> > xorg-x11-drv-intel-2.9.1
> Thanks alot for testing. Unfortunately the versions you're using are
> known-broken. Please retest with the latest&greatest (currently libdrm
> 2.4.20 and xf86-video-intel 2.11).

Daniel, would you be able to give a list of commits that fix this kind of bugs? Ubuntu is currently frozen and has those versions, but it would be nice to have a list of candidate patches for updates. We already have
[0c47195ca805881e3fbd5b9224be5c930feeeb8c] i830: Clip solid fills to surface

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #130 from Geir Ove Myhr <email address hidden> 2010-04-19 13:53:18 PDT ---
> Daniel, would you be able to give a list of commits that fix this kind of bugs?
> Ubuntu is currently frozen and has those versions, but it would be nice to have
> a list of candidate patches for updates. We already have
> [0c47195ca805881e3fbd5b9224be5c930feeeb8c] i830: Clip solid fills to surface

For a definite answer, please ask Chris, but a quick scan shows the
following commit since libdrm 2.4.18 as an important fix
 - a4041e096ce0faea3dd39b4d78014d45a8cacec0 (intel: Repeat execbuffer if
   interrupted by signal)

Revision history for this message
In , Indan (indan) wrote :
Download full text (3.8 KiB)

(In reply to comment #127)
> The fact that it's 512 shows that the problem is a failed cacheflush and
> nothing else (this is actually what I've checked). The chipset flush
> checker changes the place it writes the check value every chipset flush.
> And it reuses the same place every 512th chipset flush. So when the
> chipset flush failes, the old value is there, which should be exactly 512
> less than what's expected.

Yeah, I figured it would be that, reading through your old comments.

By the way, I think I got those failed flushes without xcompmgr running.
(I killed it to see if there was any difference.) That might explain why I
didn't see failed flushes before, xcompmgr is more or less always running.

My case might be related to suspend, because both failures happened within
a minute or so from resume.

I wish I knew a way to trigger it easily, now it takes days to test anything.

> Well, there is _no_ way to do a reliable flush. And the hw docs explicitly
> says so. But we need to move stuff in/out of the graphics mem (i.e. the
> gtt). The other option would be to copy stuff in/out, which is even worse:
> - Wastes memory (actually simply uses twice as much for everything).
> - Would be even slower than what my hack currently does.
>
> And to add insult to injury, some of the chipsets from the 2nd gen (i8xx)
> suffer from other cache coherency problems in addition to this.

What I don't understand is why your patch slows things down so much for me,
it seems to do only a few thousand flushes anyway.

I guess copying around is what the old drivers did?

Some random ideas:

- Increase I830_MCH_WRITE_BUFFER_SIZE?

- Instead of writing zeroes, actually change the content of the flush page.
  Flushing caches doesn't seem to do much if the new content is the same as
  the old one?

- The text you quoted in one of your commit messages said that the memory
  content isn't coherent, but it didn't say anything about the mapping itself.
  Can't you update the gtt mapping to effectively flush it? I mean, if you
  move pages out of the gtt and back in, shouldn't that flush the old content?
  Maybe move it to a different index, e.g. insert new mapping to the start
  instead of the end, in case the hw caches it by address+index. Similar to
  Chris Wilson's gtt disabling thing, but instead of disabling, altering it
  in a smart, flush causing way.

If the problem is that the flush is needed to avoid the hardware from writing
stale data to old gtt mapped physical memory:

- If an entry is added, there should be no need for a flush, because the all
  memory is still valid. If an entry is removed, the gpu can continue to write
  to those pages. What about copying the content to a new physical page and
  keeping the original page for a while until the gpu is done with it?

(I don't know what I'm talking about, just trying to inspire you to come up
with some genius plan to solve all problems. :-)

> Ok, that's bad. Can you change the following define in
> include/drm/intel-gtt.h and see whether you still get failed chipset
> flushes?
>
> -#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
> +#define I830_CC_CANARY_FLOCK_GTT_PAGES 16
>
> The whole stuff make ...

Read more...

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #132)
> (In reply to comment #127)
> > Oh, and add some details about your box, please (brand&model + cpu,
> > mostly, the rest is all in the dmesg, anyway).
>
> See my first post: Thinkpad X40, 855GM (rev 02), Pentium M (family 6, model 13,
> stepping 6: It has clflush).
>
> But the hangs are gone, so I'm happy. I prefer slight glyph corruption that
> goes
> away when I cause a refresh (e.g. increase text size) with snappy performance
> to
> the sluggishness caused by the current patch.

I also own an 855GM (rev 02), but I had no glyph corruption with patch v6; without the locking patch I experienced crashes, so the most recent patch is really necessary for me, although I'd also like to see it more performant. But first comes reliability, and right now it's not crashing anymore.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #132 from Indan Zupancic <email address hidden> 2010-04-19 15:26:36 PDT ---
> What I don't understand is why your patch slows things down so much for me,
> it seems to do only a few thousand flushes anyway.

Well, worst-case a flush can take 1 ms.

> I guess copying around is what the old drivers did?

Nope. But for various reasons it changed mappings _much_ less. So much
less likely to crash.

> Some random ideas:
>
> - Increase I830_MCH_WRITE_BUFFER_SIZE?

Tried. Given up at 64 kb.

> - Instead of writing zeroes, actually change the content of the flush page.
> Flushing caches doesn't seem to do much if the new content is the same as
> the old one?

Patch does that atm for all writes. Furthermore I've never seen hw that
clever (it's a total worthless optimization, usually).

> - The text you quoted in one of your commit messages said that the memory
> content isn't coherent, but it didn't say anything about the mapping itself.
> Can't you update the gtt mapping to effectively flush it? I mean, if you
> move pages out of the gtt and back in, shouldn't that flush the old content?
> Maybe move it to a different index, e.g. insert new mapping to the start
> instead of the end, in case the hw caches it by address+index. Similar to
> Chris Wilson's gtt disabling thing, but instead of disabling, altering it
> in a smart, flush causing way.

Well, that's exactly where the shit usually hits the fan. Furthermore, at
least on i845 there are chipset errata that says (no joke) if you change a
mapping shortly before the gpu reads stuff from it, it may read adjacent
pages. Chris is trying to battle that one. Oh, and no, rewriting the gtt
entries doesn't flush data (only tlb, but not everywhere, see above).

> If the problem is that the flush is needed to avoid the hardware from writing
> stale data to old gtt mapped physical memory:
>
> - If an entry is added, there should be no need for a flush, because the all
> memory is still valid. If an entry is removed, the gpu can continue to write
> to those pages. What about copying the content to a new physical page and
> keeping the original page for a while until the gpu is done with it?

Something similar is already done. Look for scratch_page in intel-gtt.c

> > Ok, that's bad. Can you change the following define in
> > include/drm/intel-gtt.h and see whether you still get failed chipset
> > flushes?
> >
> > -#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
> > +#define I830_CC_CANARY_FLOCK_GTT_PAGES 16
> >
> > The whole stuff make somewhat more sense this way around, anyway.
>
> I will try this later, first I'm going to try without your latest commit
> ("fix i85x gtt chipset flush") to see how it behaves without that stuff,
> both performance and amount of failed flushes.

If your X40 is anything like mine, you're in for a bad surprise :(

> > Oh, and add some details about your box, please (brand&model + cpu,
> > mostly, the rest is all in the dmesg, anyway).
>
> See my first post: Thinkpad X40, 855GM (rev 02), Pentium M (family 6, model 13,
> stepping 6: It has clflush).

Thanks, I'm regularly losing my overview with all the different testers on
this bug ;)

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #133 from legolas558 <email address hidden> 2010-04-19 15:48:18 PDT ---
> I also own an 855GM (rev 02), but I had no glyph corruption with patch v6;
> without the locking patch I experienced crashes, so the most recent patch is
> really necessary for me, although I'd also like to see it more performant. But
> first comes reliability, and right now it's not crashing anymore.

Yep, I want to get this right first before performance tuning starts. But
I have already a few ideas how to improve the current situation.
- The current chipset flush always flushes both directions, but we usually
  only need one direction flushed. This is especially important because
  the slower flush is in gtt->cpu direction, which isn't performance
  critical at all.
- atm the driver executes enormous amounts of unnecessary flushes.
  Batching them up should fix this.

Revision history for this message
Christiansen (happylinux) wrote :

Just for the record and I have to back up Anand Kumria on comment #84.

I too am no longer able to start X (locks up/freezes completely) since kernel 2.6.32-21.32 on a ThinkPad with this adapter:

00:02.0 VGA compatible controller [0300]: Intel Corporation 82852/855GM Integrated Graphics Device [8086:3582] (rev 02)
Subsystem: IBM Device [1014:0557]

Lucid was reinstalled from the (yesterdays) latest LiveCD 2010.04.16 and all available updates applied as of 2010.04.19. Downgrading to kernel 2.6.32-21.31 from recovery console and I'm able to start X again.

Revision history for this message
In , Renegabriels (renegabriels) wrote :

(In reply to comment #110)
> Created an attachment (id=35065) [details]
> only call put_pages when gtt_space != NULL
>
> Ok, this might be the first real stab at that dreaded put_pages BUG. Everyone
> who's hitting this problem, please apply this patch on top of whatever kernel
> most easily reproduces the problem.

I've been hitting this bug at least once the last couple of days (my system crashed a couple of times, and only once I was able to extract logs). With this patch applied, I haven't seen this bug resurface so far.

However, my system crashes when starting doomsday or warzone2100 (both OpenGL games). I hadn't noticed this before, and is probably unrelated to this coherency bug. Where should I file this bug report? Mesa? Dmesg says:

[drm:i915_gem_do_execbuffer] *ERROR* Invalid object handle 48 at index 0

X log says:

[ 5132.404] (EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Bad file descriptor.

Revision history for this message
In , Brian Rogers (brian-rogers) wrote :

OpenGL hangs could be bug 26557.

Try reverting commit b4a6169412819cc3a027c6a118f0537911145a30.

Revision history for this message
In , Brian Rogers (brian-rogers) wrote :

That's a Mesa commit, BTW.

Revision history for this message
Gavin Jones (gavinj44) wrote : Re: [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)

Same problem here, on a Toshiba A200 portege laptop with the 855GM
card in it.

Gavin

On 19 Apr 2010, at 17:40, Christiansen <email address hidden> wrote:

> Just for the record and I have to back up Anand Kumria on comment #84.
>
> I too am no longer able to start X (locks up/freezes completely) since
> kernel 2.6.32-21.32 on a ThinkPad with this adapter:
>
> 00:02.0 VGA compatible controller [0300]: Intel Corporation
> 82852/855GM Integrated Graphics Device [8086:3582] (rev 02)
> Subsystem: IBM Device [1014:0557]
>
> Lucid was reinstalled from the (yesterdays) latest LiveCD 2010.04.16
> and
> all available updates applied as of 2010.04.19. Downgrading to kernel
> 2.6.32-21.31 from recovery console and I'm able to start X again.
>
> --
> MASTER: [i855] GPU lockup (apport-crash)
> https://bugs.launchpad.net/bugs/541511
> You received this bug notification because you are a direct subscriber
> of the bug.
>

Revision history for this message
In , r6144 (rainy6144) wrote :

(I don't own an 855 and my 845 machine is not available right now, so this is just wild speculation.)

Could some of the chipset buffers be indexed by the SDRAM bank number (and maybe even the row (side) number)? I'm imagining a scenario where the CPU and the GTT sides have separate SDRAM write buffers that are not kept coherent (their access to the actual RAM can be arbitrated), and each write buffer has one or two cache lines for each bank; this might be a relatively easy way to make simultaneous access to different banks in parallel. There seems to be 4 banks on the 845, and the bank number can be between bits 11-12 and 14-15, depending on the DRAM modules installed; perhaps the situation is similar on the 855 as well. If this is the case, 16 physically contiguous pages should cover all banks, while non-contiguous ones might not be so if we are particularly unlucky in intel_i830_setup_flush(), which is called when resuming.

To test this theory, maybe we can print the physical addresses (those within the System RAM range in /proc/iomem) of the allocated i8xx_pages. Then, when we see retried or even failed flushes, perhaps some patterns can be observed.

Revision history for this message
In , Daniel-ffwll (daniel-ffwll) wrote :

> --- Comment #139 from <email address hidden> 2010-04-19 20:32:56 PDT ---
> Could some of the chipset buffers be indexed by the SDRAM bank number (and
> maybe even the row (side) number)? I'm imagining a scenario where the CPU and
> the GTT sides have separate SDRAM write buffers that are not kept coherent
> (their access to the actual RAM can be arbitrated), and each write buffer has
> one or two cache lines for each bank; this might be a relatively easy way to
> make simultaneous access to different banks in parallel. There seems to be 4
> banks on the 845, and the bank number can be between bits 11-12 and 14-15,
> depending on the DRAM modules installed; perhaps the situation is similar on
> the 855 as well. If this is the case, 16 physically contiguous pages should
> cover all banks, while non-contiguous ones might not be so if we are
> particularly unlucky in intel_i830_setup_flush(), which is called when
> resuming.

Neat idea. I'll look into allocating the pages as one big chunk (ie higher
order alloc). But that doesn't explain why the problem seems to happen
only after a resume - the pages don't get reallocated on resume (look for
"goto setup" in intel_i830_setup_flush.

Revision history for this message
Valentijn Sessink (valentijn) wrote :

IBM X40 laptop, 82852/855GM rev. 02, also: no working graphical environment since 2.6.32-21; I manually installed 2.6.32.20 to get a working environment. Unfortunately, there's no crash information whatsoever, the system just freezes hard, without logs or otherwise usable information. Setting i915.modeset=0 does not help.

Revision history for this message
royden (ryts) wrote :

Setting i915.modeset=1 via grub at boot resulted in a successful launch of X & gdm for me, (given the same symptoms as V. Seesink ie no log indicators of problem)

Revision history for this message
In , legolas558 (legolas558) wrote :

18 chipset failures in 1h of uptime with a resume from hibernation (seems totally unrelated for me).

[ 140.678789] i8xx chipset flush failed, expected: 4642, cpu_read: 4130
[ 382.334636] i8xx chipset flush failed, expected: 32422, cpu_read: 31910
[ 916.360151] i8xx chipset flush failed, expected: 85629, cpu_read: 85117
[ 1461.747517] i8xx chipset flush failed, expected: 142082, cpu_read: 141570
[ 2256.590632] i8xx chipset flush failed, expected: 196727, cpu_read: 196215
[ 4106.345442] i8xx chipset flush failed, expected: 267271, cpu_read: 266759
[ 5147.195196] i8xx chipset flush failed, expected: 309181, cpu_read: 308669
[ 6185.589716] i8xx chipset flush failed, expected: 354133, cpu_read: 353621
[ 8005.430094] i8xx chipset flush failed, expected: 437064, cpu_read: 436552
[ 8114.898367] i8xx chipset flush failed, expected: 444113, cpu_read: 443601

no "max retries" line.

Xorg 1.7.6
libdrm 2.4.19
mesa 7.7.1

Revision history for this message
Bryce Harrington (bryce) wrote : Re: [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)

We don't need further confirmations from people seeing the issue.

Also, we don't need further reports of how it was worked around; we know
there's several ways of working around it. Unfortunately what works for
one person doesn't for another.

This has been an extremely frustrating bug from a developer perspective,
probably almost as frustrating as it is from a user perspective. It
seems whenever we make a change to fix something for one set of users,
it just causes breakage for some other set. There does not seem to be
any particular combination of knob settings that makes things functional
for *all* users.

So what we've opted to do is turn off KMS for this hardware but pretty
much leave all other settings to defaults. So people for whom this
configuration works will have 3D and all the usual -intel functionality,
just not the boot prettiness. For the set of users that find this is
not a good configuration, we've documented the issue in the release
notes with a link to the various workarounds people have found, here:

  https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes

If anyone discovers additional workarounds, or has ideas on improving
this documentation, please feel welcome to edit this page. It may help
your fellow 8xx users.

It is our hope that there will come to be upstream fixes that are viable
to backport. If we get enough fixes that we feel confident, we *might*
re-enable KMS for 8xx chips on lucid at some point.

There are likely to be a lot of patches flying around. If you wish to
provide them in a PPA, that's cool. Just be mindful that our goal
ultimately is to get a fix into Lucid, and time you can put towards that
goal could help a lot.

The nature of this bug is such that it's really sensitive to
conditions. So you may find a configuration or patch that makes the
issue totally go away on your system and someone else's, but breaks
things on 3 other people's systems with exactly the same hardware. So
getting a patch that fixes it for *everyone* is going to be really
tough.

Revision history for this message
Dimitry Verkholashin (dimitrygv) wrote : Re: [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)
Download full text (4.0 KiB)

c'mon you know it is a security bug that relates to Xorg (intel i915GM) ,
GDM, compiz, emerald and Ubuntu update manager (every crash is tied to
update manager or maybe the way it uses network and choice of algorithms to
verify your integrity sums of packages). every symptom had been reproduced
and can be reproduced again. i am running on some legacy equipment. let's
see if it works on UNIX without GDM

On Wed, Apr 21, 2010 at 2:16 AM, Bryce Harrington <<email address hidden>
> wrote:

> We don't need further confirmations from people seeing the issue.
>
> Also, we don't need further reports of how it was worked around; we know
> there's several ways of working around it. Unfortunately what works for
> one person doesn't for another.
>
> This has been an extremely frustrating bug from a developer perspective,
> probably almost as frustrating as it is from a user perspective. It
> seems whenever we make a change to fix something for one set of users,
> it just causes breakage for some other set. There does not seem to be
> any particular combination of knob settings that makes things functional
> for *all* users.
>
> So what we've opted to do is turn off KMS for this hardware but pretty
> much leave all other settings to defaults. So people for whom this
> configuration works will have 3D and all the usual -intel functionality,
> just not the boot prettiness. For the set of users that find this is
> not a good configuration, we've documented the issue in the release
> notes with a link to the various workarounds people have found, here:
>
> https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes
>
> If anyone discovers additional workarounds, or has ideas on improving
> this documentation, please feel welcome to edit this page. It may help
> your fellow 8xx users.
>
> It is our hope that there will come to be upstream fixes that are viable
> to backport. If we get enough fixes that we feel confident, we *might*
> re-enable KMS for 8xx chips on lucid at some point.
>
> There are likely to be a lot of patches flying around. If you wish to
> provide them in a PPA, that's cool. Just be mindful that our goal
> ultimately is to get a fix into Lucid, and time you can put towards that
> goal could help a lot.
>
> The nature of this bug is such that it's really sensitive to
> conditions. So you may find a configuration or patch that makes the
> issue totally go away on your system and someone else's, but breaks
> things on 3 other people's systems with exactly the same hardware. So
> getting a patch that fixes it for *everyone* is going to be really
> tough.
>
> --
> MASTER: [i855] GPU lockup (apport-crash)
> https://bugs.launchpad.net/bugs/541511
> You received this bug notification because you are a direct subscriber
> of a duplicate bug.
>
> Status in X.org xf86-video-intel: Confirmed
> Status in “xserver-xorg-video-intel” package in Ubuntu: Triaged
> Status in “xserver-xorg-video-intel” source package in Lucid: Triaged
>
> Bug description:
> Binary package hint: xserver-xorg-video-intel
>
> This is a MASTER bug report, i.e. not a real bug report, but a tool to help
> manage other bug reports.
>
> Most bug reports on i855 are probably due to ...

Read more...

Revision history for this message
pdecat (pdecat) wrote :

On my legacy hardware (Asus S1300N from 2003) with Lucid fully updated, the graphical login screen won't show without KMS.

Applying option 3 solves it :
echo options i915 modeset=1 | sudo tee /etc/modprobe.d/i915-kms.conf

If I then click on my login entry, Xorg crashes immediately berore I can enter my password.
The screen flickers several times then I am informed that Ubuntu works in lower graphics mode.

More details about my hardware :

$ lspci -vs 00:02
00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03)
        Subsystem: Holco Enterprise Co, Ltd/Shuttle Computer Device 3141
        Flags: bus master, fast devsel, latency 0, IRQ 28
        Memory at fd000000 (64-bit, non-prefetchable) [size=4M]
        Memory at d0000000 (64-bit, prefetchable) [size=256M]
        I/O ports at ff00 [size=8]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915

00:02.1 Display controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03)
        Subsystem: Holco Enterprise Co, Ltd/Shuttle Computer Device 3141
        Flags: fast devsel
        Memory at fda00000 (64-bit, non-prefetchable) [disabled] [size=1M]
        Capabilities: <access denied>

Regards,
Patrick.

Revision history for this message
pdecat (pdecat) wrote :

Oops, wrong lspci output, here's the correct one :

$ lspci -vs 00:02
00:02.0 VGA compatible controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
        Subsystem: ASUSTeK Computer Inc. Device 1712
        Flags: bus master, fast devsel, latency 0, IRQ 5
        Memory at f0000000 (32-bit, prefetchable) [size=128M]
        Memory at ffa80000 (32-bit, non-prefetchable) [size=512K]
        I/O ports at dc00 [size=8]
        Capabilities: [d0] Power Management version 1
        Kernel driver in use: i915
        Kernel modules: i915

00:02.1 Display controller: Intel Corporation 82852/855GM Integrated Graphics Device (rev 02)
        Subsystem: ASUSTeK Computer Inc. Device 1712
        Flags: bus master, fast devsel, latency 0
        Memory at e8000000 (32-bit, prefetchable) [size=128M]
        Memory at ff980000 (32-bit, non-prefetchable) [size=512K]
        Capabilities: [d0] Power Management version 1

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

Created an attachment (id=35211)
dmesg of deadlock with v8 patch

Hi,
with the v8 patch applied atop a drm-intel-next kernel (which is not too recent, around 2,5 weeks old), i got an apparent deadlock some time after resume. Lockdep is actually turned on (according to dmesg), so i really don't know why it says it's off.

Kernel 2.6.34-rc2 from drm-intel-next with v8 patch applied, lockdep enabled.
Xserver 1.7.6
libdrm 2.4.18
intel-drv 2.11

Dunno if it's related to the patch...

Cheers,
   Christian

Revision history for this message
In , Beier-informatik (beier-informatik) wrote :

(From update of attachment 35073)
resolved by updating to intel 2.11.

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=35214)
DRI debugfs after overlay crash

Revision history for this message
In , legolas558 (legolas558) wrote :

Created an attachment (id=35215)
gtt flush failures (not fatal)

I have attached the flush failures found in dmesg (probably not related to the overlay crash) and the DRI debugfs after a crash happening when watching videos.

Looks like the overlay bug is not yet fixed. I had to watch the entire video collection of The Rockets, but finally I got an overlay filled with a nice blue, music still playing but Xorg inevitably dead. I could access a VT and take the dump, but any further attempt to restart Xorg was failing miserably.

Xorg was being filled indefinitively with these lines:

(EE) intel(0): Failed to submit batch buffer, expect rendering corruption or even a frozen display: Input/output error.

Nothing else was relevant there.

I think the flush failures are not tied to the overlay crash (I also have them when the system doesn't crash), but anyway I have added them here.

Revision history for this message
Daniel Baumann (dnjl) wrote :

I'm trying to backport Daniel Vetters patches (http://cgit.freedesktop.org/~danvet/drm/log/?h=stuff/i8xx_cache_coherency_for_oga) to current lucids kernel (2.6.32-21.32). It seems to be not too complex as lucids kernel has applied drm of kernel 2.6.33.

The experimental result can be found there:
https://launchpad.net/~dnjl/+archive/experimental/+packages?field.name_filter=linux&field.status_filter=published&field.series_filter=lucid

Changes summary:

linux (2.6.32-21.32+i8xxkmsfix2~dnjl1) lucid; urgency=critical

  [ Daniel Baumann ]
  * Revert "SAUCE: i915 KMS -- support disabling KMS for known broken devices"
    - LP: #563277
  * Revert "SAUCE: i915 KMS -- blacklist i830"
    - LP: #542208, #563277
  * Revert "SAUCE: i915 KMS -- blacklist i845g"
    - LP: #541492, #563277
  * Revert "SAUCE: i915 KMS -- blacklist i855"
    - LP: #511001, #541511, #563277
  * x86, lib: Add wbinvd smp helpers (stolen from 2.6.34)
  * Applied patches done by Daniel Vetter to fix above problems:
    - agp/intel-gtt: fix i85x gtt chipset flush
    - agp/intel-gtt: extract mch buffer flush in i830 chipset flush
    - agp/intel-gtt: check cache-coherency on i830 class chipsets
    - drm/i915: add locking around chipset flush
    - agp/intel-gtt: steal the last gtt page
    - agp/intel-gtt: kill previous_size assignments
    - agp/intel-gtt: kill intel_i830_tlbflush
    - agp/intel: make intel-gtt.c into a real source file
    - agp/intel: split out gmch/gtt probe, part 2
    - agp/intel: split out gmch/gtt probe, part 1
    - drm/intel: kill mutli_gmch_chip
    - agp/intel: uncoditionally reconfigure driver on resume
    - agp/intel: split out the GTT support
    - agp/intel: introduce intel-agp.h header file

On my system (Dell Latitude X300 with 855GM) I'm running this kernel and the libdrm and xorg intel driver from xorg-edgers (https://launchpad.net/~xorg-edgers/+archive/ppa). So I'm able to get my system up and login without any kernel/module parameters.
But desktop effects are not working and it's not stable at all. But crashes would not hard freeze the whole system - only X crashs sometimes sporadic. So it's possible to grab the debugging data in sysfs.

Those who are interessted in, please test this. It's not perfect, I know, but its run somewhat.

Does anyone has any hints how to further this?

Revision history for this message
Daniel Baumann (dnjl) wrote :
tags: added: patch
Revision history for this message
Geir Ove Myhr (gomyhr) wrote :

Daniel Vetter plans to backport the fix. Here is what he writes (comment #121 upstream):

As already said, I hope to send the last patch pile (containing the
real fix) for review in a few days, pending merging of the previous
submissions. If it survives review intact I'll backport just the fix for
.34 and earlier kernels.

So taking testing/relase delays on each stage (-next, .34, -stable) into
account, expect a few weeks before this hits a stable kernel near you,
best-case scenario.

Revision history for this message
In , legolas558 (legolas558) wrote :

(In reply to comment #145)
> I think the flush failures are not tied to the overlay crash (I also have them
> when the system doesn't crash), but anyway I have added them here.

Some more information: while watching videos there are occasional glitching overlay frames (psychedelic colors, only 1 frame), sometimes interleaved with a blue fill like the one appearing when it gives up in a total Xorg crash. The blue frames seem to appear more frequently when Xorg is more prone to fatally giving up, but never seen them more than twice before a total Xorg crash.

Shall we split this bug into a new one?

Revision history for this message
In , Indan (indan) wrote :

(In reply to comment #146)
> (In reply to comment #145)
> > I think the flush failures are not tied to the overlay crash (I also have them
> > when the system doesn't crash), but anyway I have added them here.
>
> Some more information: while watching videos there are occasional glitching
> overlay frames (psychedelic colors, only 1 frame), sometimes interleaved with a
> blue fill like the one appearing when it gives up in a total Xorg crash. The
> blue frames seem to appear more frequently when Xorg is more prone to fatally
> giving up, but never seen them more than twice before a total Xorg crash.
>
> Shall we split this bug into a new one?

I think you're hitting the bug I had, which should be fixed by upgrading to
the 2.11 intel driver and libdrm 2.4.20.

Revision history for this message
Chris Halse Rogers (raof) wrote :

Since the DRI disablement patch didn't help as much as was hoped, unconditionally disables 3D, and will make it harder to grab a proper fix from upstream, it has been decided to revert it and instead document the various different workarounds for the Lucid release notes.

The attached debdiff drops the DRI disablement patch.

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package xserver-xorg-video-intel - 2:2.9.1-3ubuntu5

---------------
xserver-xorg-video-intel (2:2.9.1-3ubuntu5) lucid; urgency=low

  * Drop debian/patches/107_disable_dri_on_845_855.patch:
    + This attempt to work around the crashes on i845 and i855 documented in
      LP: #541492 and LP: #541511 simply shuffled the brokeness around. It
      hasn't helped enough, and it unconditionally disables 3D which worked for
      some users before.
 -- Christopher James Halse Rogers <email address hidden> Mon, 26 Apr 2010 10:14:02 +1000

Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Triaged → Fix Released
Geir Ove Myhr (gomyhr)
Changed in xserver-xorg-video-intel (Ubuntu Lucid):
status: Fix Released → Triaged
Revision history for this message
Steve Langasek (vorlon) wrote :

Chris, what needs to be added to the release notes for this?

Changed in ubuntu-release-notes:
status: New → Incomplete
Revision history for this message
Bryce Harrington (bryce) wrote :

I'd put a general purpose "8xx freezes" note in there a while back, and it looks perfectly well suited for this. It does not itemize the workarounds except for using vesa, but provides a link to a wiki page with workarounds. I think that's probably better since it gives us some flexibility in steering users more accurately as fixes become available.

So I'm closing the release-notes task as good-nuff.

Changed in ubuntu-release-notes:
status: Incomplete → Fix Released
Revision history for this message
In , Indan (indan) wrote :

Created an attachment (id=35329)
GPU hang with newest driver and libdrm. v8 without extra flushing patch.

(In reply to comment #147)
> I think you're hitting the bug I had, which should be fixed by upgrading to
> the 2.11 intel driver and libdrm 2.4.20.

Okay, after a week or two (?) of running v8 without the last extra flushing commit I finally got a hung GPU again. So it seems there is a corner case left for this particular bug somewhere.

Last time I counted I got around 2% failed flushes, but otherwise the system was rock solid. Text corruption was rare too, though I think it did happen the day the GPU hung.

Dump was taken with this script:

#!/bin/bash
PATH="/bin:/usr/bin"
mount /mnt/debug
cd /tmp/

while true; do
 if grep -q 0 /mnt/debug/dri/0/i915_wedged; then
  sleep 1;
 else
  mkdir dump
  dmesg > dump/dmesg
  cp /var/log/Xorg.0.log dump/
  cp -a /mnt/debug/dri/0/* dump/
  tar czf dump.tgz dump
  rm -rf dump
  mv dump.tgz /home/indan/
  sync;
  exit;
 fi
done

Revision history for this message
In , Indan (indan) wrote :

(In reply to comment #134)
> > --- Comment #132 from Indan Zupancic <email address hidden> 2010-04-19 15:26:36 PDT ---
> > What I don't understand is why your patch slows things down so much for me,
> > it seems to do only a few thousand flushes anyway.
>
> Well, worst-case a flush can take 1 ms.

That would explain it yes.

[cut]
> > If the problem is that the flush is needed to avoid the hardware from writing
> > stale data to old gtt mapped physical memory:
> >
> > - If an entry is added, there should be no need for a flush, because the all
> > memory is still valid. If an entry is removed, the gpu can continue to write
> > to those pages. What about copying the content to a new physical page and
> > keeping the original page for a while until the gpu is done with it?
>
> Something similar is already done. Look for scratch_page in intel-gtt.c

But if done properly the need for flushing would go away altogether. Considering it's quite stable here without those extra flushes, perhaps it's easier to fix the corner cases that still need flushing instead of getting flushing reliable?

> > > Ok, that's bad. Can you change the following define in
> > > include/drm/intel-gtt.h and see whether you still get failed chipset
> > > flushes?
> > >
> > > -#define I830_CC_CANARY_FLOCK_GTT_PAGES 8
> > > +#define I830_CC_CANARY_FLOCK_GTT_PAGES 16
> > >
> > > The whole stuff make somewhat more sense this way around, anyway.
> >
> > I will try this later, first I'm going to try without your latest commit
> > ("fix i85x gtt chipset flush") to see how it behaves without that stuff,
> > both performance and amount of failed flushes.
>
> If your X40 is anything like mine, you're in for a bad surprise :(

Dmesg is full with backtraces, but other than that it's quite stable.
Performance is good too again.

Next week I should have a bit more time to read the code and do more testing.

> > > Oh, and add some details about your box, please (brand&model + cpu,
> > > mostly, the rest is all in the dmesg, anyway).
> >
> > See my first post: Thinkpad X40, 855GM (rev 02), Pentium M (family 6, model 13,
> > stepping 6: It has clflush).
>
> Thanks, I'm regularly losing my overview with all the different testers on
> this bug ;)

No problem, you're doing great. :-)

Revision history for this message
Tom (tom6) wrote :

1650 Ati card. LiveCd is fine. Installing gets me to login screen fine, no trouble

After login screen it goes black, never seems to reach desktop. Tried pressing F2 and typing "sudo reboot" but nothing happened. Keyboard lights were not working either. I could have stuffed up on the typing tho.

I thought it was because the 1650 is obsolete and unsupported by ati now. Their legacy driver broke my 9.04 system but so did the OpenSource ati driver. Somehow 9.04 seemed to work fine with defaults tho?? lol

10.04 only works in "Failsafe" low graphics mode but this is set to 1280by1024 rather than my preferred 1024by768. I'm still playing around trying to get the resolution down from low graphics mode!! lol, all fun & games. I have a dual-boot (with 9.04) so my machine is still usable :)

Good luck and regards from
Tom :)

Revision history for this message
Bryce Harrington (bryce) wrote : Re: [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)

On Thu, Apr 29, 2010 at 04:57:49PM -0000, Tom wrote:
> 1650 Ati card.

This bug report is *only* for the i855 intel graphics card.

GPU lockup bugs are basically indistinguishable from a user point of
view, but they're almost always hardware-specific.

Revision history for this message
Filippo neri (fneri23) wrote :

My situation:
System: Thinkpad X40 Intel(R) Pentium(R) M processor 1.40GHz
Memory: 1.5 GiB, Intel 855GM Chipset
Ubuntu Release 10.04(lucid) installed from the beta2 CD and updated to the latest packages
(including lucid-proposed and lucid-backports.)
I have installed the following kernels:
2.6.32-19-generic
2.6.32-21-generic
2.6.32-22-generic
2.6.34-020634rc5-generic
(I upgraded to 2.6.32-22-generic today. )
Results (VERY repeatable!)
2.6.32-19-generic → OK
2.6.32-21-generic → Black screen
2.6.32-22-generic → Black screen
2.6.34-020634rc5-generic → OK
By OK I mean everything seems to work fine, and I can use the “Normal” Visual Effects setting.

filippo@X40:~$ uname -a

Linux X40 2.6.34-020634rc5-generic #020634rc5 SMP Tue Apr 20 10:07:04 UTC 2010 i686 GNU/Linux

filippo@X40:~$ dmesg | grep agp

[ 2.827117] Linux agpgart interface v0.103

[ 2.946581] agpgart-intel 0000:00:00.0: Intel 855GM Chipset

[ 2.947454] agpgart-intel 0000:00:00.0: detected 8060K stolen memory

[ 2.969120] agpgart-intel 0000:00:00.0: AGP aperture is 128M @ 0xe0000000

As far as I am concerned, this is a kernel bug introduced between 2.6.32-19 and 2.6.32-21,
but unfortunately, still present in 2.6.32-22. I have the choice of using 2.6.32-19, but I
will use 2.6.34-020634rc5 for the next few weeks to see if any problem arises.

Revision history for this message
Alban (seza) wrote :

So do I, keep 2.6.34 rc5 see all works fine with it..

Revision history for this message
Chris Halse Rogers (raof) wrote : Re: [Ubuntu-x-swat] [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)

For those who found that 2.6.32-19 (and -20) worked, but -21 and -22
don't work, it's likely that simply re-enabling KMS as recorded on
https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes will resolve your
problem.

Revision history for this message
Dana Olson (adolson) wrote :

Can't install Ubuntu 10.04 final due to X crashing due to this bug.

A workaround is found in this thread: http://ubuntuforums.org/showthread.php?t=1465883

I was able to install, but now am stuck in 640x480 resolution.

Revision history for this message
Dana Olson (adolson) wrote :

Sorry, I posted my comment #114 before seeing the rest of the comments in the thread. I thought I had seen them all, but apparently not.

I followed some advice (re-enable kms, disable dri and change driver to intel in xorg.conf) from Chris' link in his comment #113, and *so far* I am not having any lockups, and my screen res is back to 1280x1024.

Revision history for this message
Vilius (vilius) wrote :

Just upgraded my ThinkPad R50e from 9.10 to 10.04. Cannot get any further than black screen.

> 1. After power on your PC, press shift (keep press) until see boot loader menu. Choose a recovery mode option.
Nothing happens if I hold Shift. How do I get into the recovery mode?

Revision history for this message
pvanderploeg (pieter-nescio) wrote :

I have a Dell Latitude D400 with the I855 graphics card. The Lucid livecd does not boot. Just the word Ubuntu with the 5 red dots, and after a short while a blank screen, cd spins down and thats it. I reinstalled Karmic, and then upgraded to Lucid today (april 30). System does not boot anymore and I had to reinstall Karmic. My other laptop (D410) with a i915 graphics card does run the live cd ok, but I will wait with the upgrade.

Revision history for this message
Elena M. Lopez ("Nena") (axolote) wrote :

@pvanderploeg,
I have the same issue with my D400 with the same graphics card. I just tried the workaround listed in posts #33 and #34 in the following thread, and it worked for me (http://ubuntuforums.org/showthread.php?t=1465883&page=4).

Basically, what I did was as follows:

Step 1: When booting into the live CD, I pressed Tab, and added the i915.modeset=1 option to the boot command line.
Refer to thread/post: http://ubuntuforums.org/showpost.php?p=9203466&postcount=33

Step 2: I installed Lucid with no problems, but when restarted, I of course ran into the same boot issue. So, I needed to make the changes in step 1 permanent.

Step 3: I restarted into the live environment again (as in step 1), mounted my local root partition, and as root, did two things in the /etc/ directory.
  a) permanently added "i915 modeset=1" (w/o quotes) to the grub configuration file (/etc/default/grub).
 Refer to thread/post: http://ubuntuforums.org/showpost.php?p=9171045&postcount=11 (as redirected from http://ubuntuforums.org/showpost.php?p=9203466&postcount=34)

 b)created the configuration file /etc/modprobe.d/i915-kms.conf and added the line "options i915 modeset=1" (w/o quotes) according to the first workaround listed in this thread:
https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes

And it now everything works. Finally. And life is good again. If you have any questions, feel free to PM me through launchpad. Although I am not an expert, I can at least explain what worked for me, although I couldn't tell you why. :)
 ~Elena

Revision history for this message
Gustavo (gstv.inc) wrote :

My HP 1130us laptop, 82852/855GM

Linux 2.6.32-21-generic' They NOT WORK FOR ME Freezing on purple screen

Linux 2.6.32-21-generic (recovery mode)' They NOT WORK FOR ME Freezing on purple screen

Linux 2.6.32-19-generic (recovery mode)' This work
Linux 2.6.32-19-generic ...' This work
i don't' now why but I still use the 19 since the 21 not work..

I do not know much but I want to know why something that worked before now has problems to work.One system has to give importance first the basic operation such as video, sound, and boot without crash, need to give great importance to the GPU from Intel such as more simpler and less powerful, because PCs with these GPUs are cheaper than those that have Nvidia, many people bought these PCs for the low price.
it makes you imagine that much more low-end PCs with GPUs out there, and everyone has a tendency to test new systems in their simplest PC before using them in their high end PCs.
when I heard that Linux was running smoothly on my old PC gave me a great interest because many people saw it a way to rescue your old hardware and test server at home and many other things,I hope that things like these become rare

Revision history for this message
Jim Brumbaugh (bleumyst) wrote : Re: [Bug 541511] Re: MASTER: [i855] GPU lockup (apport-crash)

It's escape, not shift. And you only need to do this if your GRUB
automatically goes right into the OS.

If you successfully are at the GRUB menu, select the second line,
"recovery mode" rather than the normal boot and select "failsafe
graphics mode" from the menu which eventually appears.

  http://www.howtogeek.com/howto/ubuntu/show-the-grub-menu-by-default-on-ubuntu/

Ahimsa

"As long as there are slaughterhouses, there will be battlefields." -Leo Tolstoy

-Jess E.

"I want a processor so powerful I can read the
manual by the light of the heat sink."- R.I.P. MRX

On Fri, Apr 30, 2010 at 4:50 AM, Vilius <email address hidden> wrote:
> Just upgraded my ThinkPad R50e from 9.10 to 10.04. Cannot get any
> further than black screen.
>
>> 1. After power on your PC, press shift (keep press) until see boot loader menu. Choose a recovery mode option.
> Nothing happens if I hold Shift. How do I get into the recovery mode?
>
> --
> MASTER: [i855] GPU lockup (apport-crash)
> https://bugs.launchpad.net/bugs/541511
> You received this bug notification because you are a direct subscriber
> of a duplicate bug.
>
> Status in Ubuntu Release Notes: Fix Released
> Status in X.org xf86-video-intel: Confirmed
> Status in “xserver-xorg-video-intel” package in Ubuntu: Triaged
> Status in “xserver-xorg-video-intel” source package in Lucid: Triaged
>
> Bug description:
> Binary package hint: xserver-xorg-video-intel
>
> This is a MASTER bug report, i.e. not a real bug report, but a tool to help manage other bug reports.
>
> Most bug reports on i855 are probably due to the CPU/GPU incoherency problem that is now consolidated upstream at http://bugs.freedesktop.org/show_bug.cgi?id=27187 (which was split off from a bug report for i845). For now, we mark all automatically reported GPU lockups on i855 as duplicates of this unless there is a reason not to. There are some tests you may do to help upstream with this issue, and I will come back with instructions here. For those of you who know how to patch and compile a kernel you may look at comment #30 (and #6 for what kind of feedback they want) in the upstream bug report. Actually, if someone could volunteer to build an ubuntu-packaged kernel with this patch for others to test, that would be nice.
>
> There is a similar master bug report for i845 at bug 541492.
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/ubuntu-release-notes/+bug/541511/+subscribe
>

Revision history for this message
dan rhodes (daniel-r-rhodes) wrote :

What a disaster, as predicted. i915.modeset=1 works for me (though it didn't when I first had the problem last week).

Revision history for this message
Dana Olson (adolson) wrote :

As a follow-up, my system ran fine for a little bit with the fixes I mentioned in post #115, but then after a while the screen went black and I couldn't do anything but hold the power button in to power off.

The system wouldn't even reboot with Alt+SysRq - K, S, U, B...

This is not very fun - I am trying to set up Ubuntu for a Windows user, trying to convert her, but if the system randomly crashes like this, the conversion is bound to fail.

Filippo neri (fneri23)
Changed in ubuntu-release-notes:
status: Fix Released → In Progress
status: In Progress → Fix Committed
Revision history for this message
pvanderploeg (pieter-nescio) wrote :
Download full text (3.2 KiB)

I tried your procedure and it worked. Great. Thank you very much.

2010/5/1 Elena M. Lopez ("Nena") <email address hidden>

> @pvanderploeg,
> I have the same issue with my D400 with the same graphics card. I just
> tried the workaround listed in posts #33 and #34 in the following thread,
> and it worked for me (
> http://ubuntuforums.org/showthread.php?t=1465883&page=4).
>
> Basically, what I did was as follows:
>
> Step 1: When booting into the live CD, I pressed Tab, and added the
> i915.modeset=1 option to the boot command line.
> Refer to thread/post:
> http://ubuntuforums.org/showpost.php?p=9203466&postcount=33
>
> Step 2: I installed Lucid with no problems, but when restarted, I of
> course ran into the same boot issue. So, I needed to make the changes in
> step 1 permanent.
>
> Step 3: I restarted into the live environment again (as in step 1), mounted
> my local root partition, and as root, did two things in the /etc/ directory.
> a) permanently added "i915 modeset=1" (w/o quotes) to the grub
> configuration file (/etc/default/grub).
> Refer to thread/post:
> http://ubuntuforums.org/showpost.php?p=9171045&postcount=11 (as redirected
> from http://ubuntuforums.org/showpost.php?p=9203466&postcount=34)
>
> b)created the configuration file /etc/modprobe.d/i915-kms.conf and added
> the line "options i915 modeset=1" (w/o quotes) according to the first
> workaround listed in this thread:
> https://wiki.ubuntu.com/X/Bugs/Lucidi8xxFreezes
>
> And it now everything works. Finally. And life is good again. If you have
> any questions, feel free to PM me through launchpad. Although I am not an
> expert, I can at least explain what worked for me, although I couldn't tell
> you why. :)
> ~Elena
>
> --
> MASTER: [i855] GPU lockup (apport-crash)
> https://bugs.launchpad.net/bugs/541511
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Ubuntu Release Notes: Fix Released
> Status in X.org xf86-video-intel: Confirmed
> Status in “xserver-xorg-video-intel” package in Ubuntu: Triaged
> Status in “xserver-xorg-video-intel” source package in Lucid: Triaged
>
> Bug description:
> Binary package hint: xserver-xorg-video-intel
>
> This is a MASTER bug report, i.e. not a real bug report, but a tool to help
> manage other bug reports.
>
> Most bug reports on i855 are probably due to the CPU/GPU incoherency
> problem that is now consolidated upstream at
> http://bugs.freedesktop.org/show_bug.cgi?id=27187 (which was split off
> from a bug report for i845). For now, we mark all automatically reported GPU
> lockups on i855 as duplicates of this unless there is a reason not to. There
> are some tests you may do to help upstream with this issue, and I will come
> back with instructions here. For those of you who know how to patch and
> compile a kernel you may look at comment #30 (and #6 for what kind of
> feedback they want) in the upstream bug report. Actually, if someone could
> volunteer to build an ubuntu-packaged kernel with this patch for others to
> test, that would be nice.
>
> There is a similar master bug report for i845 at bug 541492.
>
>
>
> To unsubscribe from this bug, go to:
> https://bugs.launc...

Read more...

Revision history for this message
Ciccio (franapoli) wrote :

On my portege m100 (82852/855GM) the following worked:

1) upgraded kernel to 2.6.34-020634rc5-generic
2) removed "nomodeset" from kernel boot parameters

Both step were necessary.
Thank you everyone for the hints.

Revision history for this message
N8N (njnagel) wrote :

can confirm that this bug still exists on final release version of 10.04 (upgraded on the 29th from 9.10) previously functioning Gateway 7330GZ laptop with 82852/855GM graphics suddenly would go blank and freeze after GRUB screen. Could boot into safe mode. Options to reconfig graphics and troubleshoot problem in safe mode menu do not do anything. Removing and reinstalling xserver-xorg and xserver-xorg-video-intel (using default versions in synaptic) did not change anything Have temporarily copied contents of xorg.conf.failsafe to a new xorg.conf file (none previously existed?) until this can be resolved. Update manager shows no updates available as I type this.