(Needs a 3.10.5 kernel) saucy/raring has frequent image corruption (intel, sna)

Bug #1189850 reported by Sebastien Bacher on 2013-06-11
38
This bug affects 8 people
Affects Status Importance Assigned to Milestone
xf86-video-intel
Fix Released
Medium
xserver-xorg-video-intel (Ubuntu)
High
Unassigned

Bug Description

It seems similar to bug #1144558 which was supposed to be fixed, I was never able to trigger the issue easily in raring but in saucy it's quite easy to see in firefox tab's summary or in chromium's url bar

I'm attaching a screenshot showing the issue

ProblemType: Bug
DistroRelease: Ubuntu 13.10
Package: xserver-xorg-video-intel 2:2.21.9-0ubuntu1
ProcVersionSignature: Ubuntu 3.9.0-4.9-generic 3.9.4
Uname: Linux 3.9.0-4-generic i686
.tmp.unity.support.test.0:

ApportVersion: 2.10.2-0ubuntu1
Architecture: i386
CompizPlugins: [core,composite,opengl,compiztoolbox,decor,snap,gnomecompat,mousepoll,place,session,resize,move,wall,grid,imgpng,vpswitch,unitymtgrabhandles,regex,animation,expo,fade,workarounds,scale,ezoom,unityshell]
CompositorRunning: compiz
CompositorUnredirectDriverBlacklist: '(nouveau|Intel).*Mesa 8.0'
CompositorUnredirectFSW: true
Date: Tue Jun 11 12:20:23 2013
DistUpgraded: Fresh install
DistroCodename: saucy
DistroVariant: ubuntu
EcryptfsInUse: Yes
ExtraDebuggingInterest: No
GraphicsCard:
 Intel Corporation Core Processor Integrated Graphics Controller [8086:0046] (rev 02) (prog-if 00 [VGA controller])
   Subsystem: Dell Device [1028:040a]
InstallationDate: Installed on 2010-10-09 (975 days ago)
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release i386 (20101007)
MachineType: Dell Inc. Latitude E6410
MarkForUpload: True
PccardctlIdent:
 Socket 0:
   no product info available
PccardctlStatus:
 Socket 0:
   no card
PlymouthDebug: Error: [Errno 13] Permission non accordée: '/var/log/plymouth-debug.log'
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.9.0-4-generic root=UUID=555ebc11-d747-44d3-af56-5e7d17851ce3 ro quiet splash vt.handoff=7
SourcePackage: xserver-xorg-video-intel
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 11/30/2011
dmi.bios.vendor: Dell Inc.
dmi.bios.version: A11
dmi.board.name: 0HNGW4
dmi.board.vendor: Dell Inc.
dmi.chassis.type: 9
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvrA11:bd11/30/2011:svnDellInc.:pnLatitudeE6410:pvr0001:rvnDellInc.:rn0HNGW4:rvr:cvnDellInc.:ct9:cvr:
dmi.product.name: Latitude E6410
dmi.product.version: 0001
dmi.sys.vendor: Dell Inc.
version.compiz: compiz 1:0.9.9~daily13.04.18.1~13.04-0ubuntu1
version.libdrm2: libdrm2 2.4.45-2ubuntu1
version.libgl1-mesa-dri: libgl1-mesa-dri 9.1.3-0ubuntu2
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 9.1.3-0ubuntu2
version.xserver-xorg-core: xserver-xorg-core 2:1.13.3-0ubuntu10
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.7.3-0ubuntu2b2
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:7.1.0-0ubuntu2
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.21.9-0ubuntu1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.7-0ubuntu1

Created attachment 75706
lspci -vvv

Hello.

Since I've upgraded from 2.20.18 version of intel driver page previews in Firefox are rendered improperly (see attached screenshot). Tested versions of intel driver are 2.20.{18,19} and 2.21.{0,2,3}, Firefox's versions are 17.0-19.0. I don't think it is a Firefox issues it is completely gone when downgrading back to 2.20.18.

My system is Gentoo amd64, currently with latest Firefox and intel driver. My current kernel version is 3.8 and it is vanilla. I am using SNA acceleration.
If there is any additional info that would be helpful I am ready to provide it.

Created attachment 75707
glxinfo -l -t

Created attachment 75708
Screenshot with example of corrupted rendering

I still haven't been able to reproduce this one yet. Do you have a foolproof (and remember just how big a fool I am!) recipe?

This issue happens occasionally, but I don't have a 100% reproducible way to show it. One of the most sucessfull attempts to reproduce it is:

1. make all `speed dial` buttons (previews on about:newtab) in Firefox filled with something reasonably heavy, not plain-text pages (on my machine it is a couple of youtube pages, web interface to SAGE, couple of redmines, etc.)
2. close all tabs except one and this last one tab should be about:newtab page
3. middle-click all the previews as fast as you can one by one, so the pages begin to load in background
3. now hit Ctrl+W till you close everything including that about:newtab page where you've started. You shouldn't wait until all pages you've opened on step 3 are loaded.
4. now open about:newtab again and with a good chance some of the preview will be corrupted. Sometimes there is no corruption, but some preview is displayed on the wrong position, for example two different sites share the same preview image.

Another way to reproduce:

1. make at least one `speed dial` button (preview on about:newtab) in Firefox filled with any kind of preview, just any site you want
2. close all tabs except one and this last one tab should be about:newtab page
3. go to http://www.dreamworksanimation.com/ and add it to bookmarks, then close tab (sorry, bookmarking is the only way I know to make a specific site to show up in previews)
4. open about:newtab again and remove any preview image from it by pressing [X]
5. open bookmarks and drag dreamworksanimation bookmark you've made on step 3 into the freed on step 4 place
6. now visit http://www.dreamworksanimation.com/ so Firefox will generate preview
7. close tab and open again about:newtab. The preview for dreamworksanimation should be corupted

Sorry if the descriptions are a bit messy. Also I don't have any other issues with firefox sites rendering, just issues with rendering previews. I wish there was an easier way to reproduce it.

I did git bisecting between 2.20.18 and 2.20.19 and the result is this commit:

dc643ef753bcfb69685f1eb10828d0c8f830c30e is the first bad commit
commit dc643ef753bcfb69685f1eb10828d0c8f830c30e
Author: Chris Wilson <email address hidden>
Date: Thu Jan 17 12:27:55 2013 +0000

    sna: Apply read-only synchronization hints for move-to-cpu

    Signed-off-by: Chris Wilson <email address hidden>

:040000 040000 0f53950ba9a9756a39722f12c322c2d629c1a2a4 d5ff0a7307cc718ee94c78ee2fb1c9bf6158ed91 M src

As this bug is not 100% reproducible it could slipped out of my sight during some bisect runs, however it is something to start with. What do you think? Could this sommit lead to the rendering problems I have?

There was a related bug, fixed with

commit 19bd005056a2083de64753681b96716996e4237d
Author: Chris Wilson <email address hidden>
Date: Fri Feb 22 12:05:04 2013 +0000

    sna: Avoid migrating and making the GPU bo busy prior to mmapping it

    References: https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/1131134
    Signed-off-by: Chris Wilson <email address hidden>

that I thought was already in 2.21.3 and so you had tested it. It is actually in master, so can you try compiling from git and checking if that fixes the issue?

I'll admit to not fully explaining how that prevented the corruption, as the damage should had been migrated and then the kernel should have stalled upon the read... But it did have an effect and prevented a similar issue that bisected to the same commit.

(In reply to comment #6)
> It is
> actually in master, so can you try compiling from git and checking if that
> fixes the issue?

I've just tested master and the issue is still there.

Created attachment 75871
xf86-video-intel-2.21.3-revert-dc643ef753bcfb69685f1eb10828d0c8f830c30e.patch

With this patch applied on top of xf86-video-intel-2.21.3 the problem is gone (at least I tried hard to reproduce it, but failed). This patch is simply reverting dc643ef753bcfb69685f1eb10828d0c8f830c30e commit mentioned above.

Can you try converting each of those kgem_bo_sync__cpu_full() back to kgem_bo_sync__cpu() individually and see if we can narrow it down to one particular path?

Created attachment 75892
Force CPU synchronisation after writes

Another test to try.

(In reply to comment #11)
> Created attachment 75892 [details] [review]
> Force CPU synchronisation after writes
>
> Another test to try.

With this patch applied on top of 2.21.3 the problem seems to be fixed.

Created attachment 75920
kgem_bo_sync__cpu_full-revert-bad.patch

(In reply to comment #10)
> Can you try converting each of those kgem_bo_sync__cpu_full() back to
> kgem_bo_sync__cpu() individually and see if we can narrow it down to one
> particular path?

With this patch on top of 2.21.3 I've hit the bug almost immediately. In this case I've left first kgem_bo_sync__cpu_full() as is and converted only second one.

Created attachment 75921
kgem_bo_sync__cpu_full-revert-good.patch

(In reply to comment #10)
> Can you try converting each of those kgem_bo_sync__cpu_full() back to
> kgem_bo_sync__cpu() individually and see if we can narrow it down to one
> particular path?

With this patch on top of 2.21.3 I was unable to reproduce the bug anymore. In this case I've converted first kgem_bo_sync__cpu_full() and left second one as is.

I've looked through all callers to see if I can find one that missed the MOVE_WRITE to no avail. I've double checked the kernel to see if there is a loop hole, again to no avail. So I'm a little bit lost to see where the missed synchronisation is coming from, and I haven't yet thought of a good test to force/catch an error.

In the meantime, I've applied one minor tweak to xf86-video-intel.git,

commit 60ec35b8d25ecfabf1744ea7bc81109d7f2a90e2
Author: Chris Wilson <email address hidden>
Date: Tue Mar 5 11:14:37 2013 +0000

    sna: Be explicit when checking for an idle bo after CPU synchronisation

Do you mind giving that a quick test?

Also one other test is to try with the drm-intel-next kernel.

(In reply to comment #15)
> I've looked through all callers to see if I can find one that missed the
> MOVE_WRITE to no avail. I've double checked the kernel to see if there is a
> loop hole, again to no avail. So I'm a little bit lost to see where the
> missed synchronisation is coming from, and I haven't yet thought of a good
> test to force/catch an error.
>
> In the meantime, I've applied one minor tweak to xf86-video-intel.git,
>
> commit 60ec35b8d25ecfabf1744ea7bc81109d7f2a90e2
> Author: Chris Wilson <email address hidden>
> Date: Tue Mar 5 11:14:37 2013 +0000
>
> sna: Be explicit when checking for an idle bo after CPU synchronisation
>
> Do you mind giving that a quick test?

OK, I'll test it later today

(In reply to comment #16)
> Also one other test is to try with the drm-intel-next kernel.

Could you please give me a quick link to their git repo?
Would 3.9-rc1 would be enough?

Created attachment 76040
Disable read-read optimisations

And one last request, can you please test that this patch as a temporary solution?

(In reply to comment #20)
> Created attachment 76040 [details] [review]
> Disable read-read optimisations
>
> And one last request, can you please test that this patch as a temporary
> solution?

This patch also fixes the issue. It was tested on 3.7.10 kernel as well as all previous patches. Now gonna try with drm-intel-next.

(In reply to comment #21)
> (In reply to comment #20)
> > Created attachment 76040 [details] [review] [review]
> > Disable read-read optimisations
> >
> > And one last request, can you please test that this patch as a temporary
> > solution?
>
> This patch also fixes the issue. It was tested on 3.7.10 kernel as well as
> all previous patches. Now gonna try with drm-intel-next.

Thanks. In the meantime, I'm going to push the temporary workaround - obviously I still hope to find the real bug.

(In reply to comment #16)
> Also one other test is to try with the drm-intel-next kernel.

Ok, just tried out today's drm-intel-next kernel and was unable to reproduce this bug anymore. This sounds like good news.

(In reply to comment #23)
> (In reply to comment #16)
> > Also one other test is to try with the drm-intel-next kernel.
>
> Ok, just tried out today's drm-intel-next kernel and was unable to reproduce
> this bug anymore. This sounds like good news.

Oh, wait, I forgot to rebuild xf86-video-intel without patch. Sorry. Will try vanilla now

/o\ Can you confirm that result with vanilla xf86-video-intel?

(In reply to comment #25)
> /o\ Can you confirm that result with vanilla xf86-video-intel?

Sorry to disappoint you, but the issue is reproducible with vanilla xf86-video-intel and drm-intel-next.

(In reply to comment #22)
> Thanks. In the meantime, I'm going to push the temporary workaround -
> obviously I still hope to find the real bug.

Is there a way I can help? Attach some debug info or test something?

(In reply to comment #27)
> (In reply to comment #22)
> > Thanks. In the meantime, I'm going to push the temporary workaround -
> > obviously I still hope to find the real bug.
>
> Is there a way I can help? Attach some debug info or test something?

If you change the define in src/sna/sna_accel.c:

diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c
index ae6d3c1..5edad51 100644
--- a/src/sna/sna_accel.c
+++ b/src/sna/sna_accel.c
@@ -57,7 +57,7 @@
 #define FORCE_INPLACE 0
 #define FORCE_FALLBACK 0
 #define FORCE_FLUSH 0
-#define FORCE_FULL_SYNC 1 /* https://bugs.freedesktop.org/show_bug.cgi?id=61628 */
+#define FORCE_FULL_SYNC 0

 #define DEFAULT_TILING I915_TILING_X

that restores the buggy behaviour. If you can keep running with that patch and with --enable-debug to check if any assertions are triggered and see how things progress.

(In reply to comment #28)
> If you can keep running with that patch
> and with --enable-debug to check if any assertions are triggered and see how
> things progress.

OK, I've did what you've said, powered on and started to watch Xorg.0.log.

The first thing I did was to open Firefox and trigger this issue several times - no output.
Then I've tried to simulate some typical workflow i.e. opened programs I use on a daily basis and do some things inside them like checking mail, browsing a couple of webpages - still no output.
Then I've decided to close them and return to Firefox and again triggered this issue several times and opened a couple of heavy tabs with flash and suddenly caught this:

(EE) [mi] EQ overflowing. Additional events will be discarded until existing events are processed.
(EE)
(EE) Backtrace:
(EE) 0: /usr/bin/X (xorg_backtrace+0x34) [0x5969b4]
(EE) 1: /usr/bin/X (mieqEnqueue+0x263) [0x5776c3]
(EE) 2: /usr/bin/X (0x400000+0x4fcd4) [0x44fcd4]
(EE) 3: /usr/lib64/xorg/modules/input/evdev_drv.so (0x7f236e1d0000+0x6208) [0x7f236e1d6208]
(EE) 4: /usr/bin/X (0x400000+0x7a477) [0x47a477]
(EE) 5: /usr/bin/X (0x400000+0xa5527) [0x4a5527]
(EE) 6: /lib64/libpthread.so.0 (0x3a9c400000+0x10bf0) [0x3a9c410bf0]
(EE) 7: /lib64/libc.so.6 (ioctl+0x7) [0x3a9bce3437]
(EE) 8: /usr/lib64/libdrm.so.2 (drmIoctl+0x28) [0x3fd3c040d8]
(EE) 9: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f236fd9c000+0x1c1a0) [0x7f236fdb81a0]
(EE) 10: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f236fd9c000+0x1d9f7) [0x7f236fdb99f7]
(EE) 11: /usr/lib64/xorg/modules/drivers/intel_drv.so (0x7f236fd9c000+0x4fe3a) [0x7f236fdebe3a]
(EE) 12: /usr/bin/X (BlockHandler+0x44) [0x43f224]
(EE) 13: /usr/bin/X (WaitForSomething+0x11d) [0x593e7d]
(EE) 14: /usr/bin/X (0x400000+0x3ade2) [0x43ade2]
(EE) 15: /usr/bin/X (0x400000+0x29b5a) [0x429b5a]
(EE) 16: /lib64/libc.so.6 (__libc_start_main+0xed) [0x3a9bc2460d]
(EE) 17: /usr/bin/X (0x400000+0x29eb1) [0x429eb1]
(EE)
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause. It is a victim.
[ 8739.251] [mi] Increasing EQ size to 512 to prevent dropped events.
[ 8739.251] [mi] EQ processing has resumed after 64 dropped events.
[ 8739.251] [mi] This may be caused my a misbehaving driver monopolizing the server's resources.

After that I've tried to reproduce this trace again opening same tabs and triggering issue again and again, but without any luck. Is this stack trace useful in any way?

Hmm, I expect dmesg to contain a GPU hang and /sys/kernel/debug/0/i915_error_state to be populated, mind attaching it?

(In reply to comment #30)
> Hmm, I expect dmesg to contain a GPU hang and
> /sys/kernel/debug/0/i915_error_state to be populated, mind attaching it?

Too bad I turned off my machine later after I've caught that stack trace, so I can't give you the dump of i915_error_state, but I was checking both dmesg and xsession-errors and there was nothing unusual and no signs of error output from i915.

I'll try to catch it again and if I do I'll attach dmesg and dump of i915_error_state here.

*** Bug 61610 has been marked as a duplicate of this bug. ***

(In reply to comment #28)
> If you change the define in src/sna/sna_accel.c:
>
> diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c
> index ae6d3c1..5edad51 100644
> --- a/src/sna/sna_accel.c
> +++ b/src/sna/sna_accel.c
> @@ -57,7 +57,7 @@
> #define FORCE_INPLACE 0
> #define FORCE_FALLBACK 0
> #define FORCE_FLUSH 0
> -#define FORCE_FULL_SYNC 1 /*
> https://bugs.freedesktop.org/show_bug.cgi?id=61628 */
> +#define FORCE_FULL_SYNC 0
>
> #define DEFAULT_TILING I915_TILING_X
>
> that restores the buggy behaviour. If you can keep running with that patch
> and with --enable-debug to check if any assertions are triggered and see how
> things progress.

I've been running this way ever since you've asked me, but that stack trace was the only one I was able to trigger, though improper rendering happened a lot.
I am positive that when I caught that trace there were no errors in dmesg.

Now, 2.21.4 is out and I will continue trying to catch something, though
since it happens only in firefox maybe there is issue somewhere else?
What versions of firefox, cairo and gtk do you have?

Also I've noticed this message in .xsession-errors whenever I move previews in Firefox:

(firefox:3574): GdkPixbuf-CRITICAL **: gdk_pixbuf_new: assertion `width > 0' failed

This happens both with FORCE_FULL_SYNC 0 and 1.

I've been primarily using iceweasel (based on ff10) with the system cairo as that is many times faster for gfx. But I've also been using the bloated ff from ubuntu and fedora on different systems (and they use the ancient cairo embedded into firefox). There are a lot of differences in cairo between those versions, so it would not surprise me if it was a bug specific to an older cairo. But I've hoped to have seen it by now as well. :|

I've just tested binary Firefox's versions from their site. I've tried latest versions of 16,17,18 and 19 branches and I was able to trigger the issue in all of them.

Will play with cairo versions now, my current cairo is 1.10.2 with some distro patches on top.

Just note well that all firefox post version-10 use their builtin version of cairo. In order to use system cairo, firefox needs a patch to remove its reliance upon non-upstreamed API.

Tested firefox-19.0.2 with all available versions of cairo from repos: 1.10.2, 1.12.8, 1.12.10, 1.12.12. Issue is reproducible with all versions.

(In reply to comment #36)
> Just note well that all firefox post version-10 use their builtin version of
> cairo. In order to use system cairo, firefox needs a patch to remove its
> reliance upon non-upstreamed API.

Thanks for info, though I am using Gentoo and use Firefox built from sources on my machine and it is distro-patched to link against system-wide cairo so it's fine.

Hmmm, that's news to me. Do you have a link to the patches they apply against firefox?

Or a simple test is something like: http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/ which should be CPU bound in Xorg and not firefox.

Also I've noticed that "disable read-read optimisations" patch practically does the same as converting kgem_bo_sync__cpu_full back to kgem_bo_sync__cpu (I may be wrong here though it looks this way to me). I will not question this as you are developer and know best, though as tests shown only one particular branch of kgem_bo_sync__cpu_full triggers this issue, see kgem_bo_sync__cpu_full-revert-bad.patch. Maybe you could add some asserts in that branch, I will apply them and give you some more info?

(In reply to comment #38)
> Hmmm, that's news to me. Do you have a link to the patches they apply
> against firefox?

http://mirror.yandex.ru/gentoo-distfiles/distfiles/firefox-19.0-patches-0.3.tar.xz

> Or a simple test is something like:
> http://ie.microsoft.com/testdrive/Performance/ParticleAcceleration/ which
> should be CPU bound in Xorg and not firefox.

Well, I've visited this link and see some spherical thingy made of particles. What should I check?

(In reply to comment #40)
> Well, I've visited this link and see some spherical thingy made of
> particles. What should I check?

Just look at top; For this particular benchmark, it should be ratelimited by the Xorg process not firefox - or better look at sudo perf top, if firefox is hitting pixman functions, it is a bad firefox.

Seems like gentoo has the right patch though, it should be fine. Now if only the other distros also used that patch :(

(In reply to comment #42)
> Seems like gentoo has the right patch though, it should be fine. Now if only
> the other distros also used that patch :(

So, should I check top or not? Because I am a bit confused what exactly means
"ratelimited by the Xorg process not firefox". I am building perf right now though.

(In reply to comment #41)
> (In reply to comment #40)
> > Well, I've visited this link and see some spherical thingy made of
> > particles. What should I check?
>
> Just look at top; For this particular benchmark, it should be ratelimited by
> the Xorg process not firefox - or better look at sudo perf top, if firefox
> is hitting pixman functions, it is a bad firefox.

When running this demo in firefox `# perf top` says "42% libpixman-1.so.0.29.2" and this line sits on top of the list. Does that mean bad firefox? :(

Only if that pixman time is inside firefox and not Xorg... Have gentoo also disabled server-side gradients in cairo?

(In reply to comment #45)
> Only if that pixman time is inside firefox and not Xorg...

I am not familiar with this tool. How do I check this?

> Have gentoo also
> disabled server-side gradients in cairo?

Yes, part of changelog:

10 Sep 2010; Samuli Suominen <email address hidden>
+cairo-1.10.0-r2.ebuild, +files/cairo-1.10.0-buggy_gradients.patch:
Do not use server-side gradients. It hurts performance, and causes bad
rendering on at least nvidia. Bug 336696.

And this patch is still applied on top of cairo version I am running now. Though maintainers added option to disable it in the latest version in tree. It enabled by default though, so I tested this version also with disabled gradients. Should I check without it?

Yeah, that gradient patch dramatically hurts performance on Nvidia and Intel systems, whilst having little impact on EXA systems. Kill that patch with fire.

(In reply to comment #48)
> Yeah, that gradient patch dramatically hurts performance on Nvidia and Intel
> systems, whilst having little impact on EXA systems. Kill that patch with
> fire.

Tested without this patch, but the issue is still presented.

What do you think about comment #39? And how can I check if pixman time shown in `perf top` belongs to Xorg or Firefox? (see comment #46)

If you have the ncurses gui, the second column shows you the "comm" i.e. the process name. Similarly in the perf report.

I'm trying to install gentoo to see if that helps (the prospect of a modern ff using system cairo is very appealing).

(In reply to comment #51)
> If you have the ncurses gui, the second column shows you the "comm" i.e. the
> process name. Similarly in the perf report.

Oh, finally, I was able to get it. Yes, that pixman rendering belongs to Firefox process, not Xorg. Though there is somehow no "comm" column in my perf-top, ncurses gui allows to zoom into threads and that's the solution.

> I'm trying to install gentoo to see if that helps (the prospect of a modern
> ff using system cairo is very appealing).

That't nice to hear :) We have a handbook which covers most of the aspects of installation, but if you'll get stuck somewhere feel free to send me an e-mail, I'll be glad to help you.

Reading http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/www-client/firefox/firefox-19.0.2.ebuild?view=markup it seems that the use of system-cairo has been dropped. Which is a shame.

On the positive news though the latest unstable cairo has dropped the buggy gradients patch (unless legacy-drivers is set).

(In reply to comment #53)
> Reading
> http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/www-client/firefox/
> firefox-19.0.2.ebuild?view=markup it seems that the use of system-cairo has
> been dropped. Which is a shame.

Well, you've seen the patches applied on top of firefox and support for system cairo is there. Out of curiosity I've run some initial steps of firefox build and here's a bit filtered result:

grep cairo /var/tmp/portage/www-client/firefox-19.0.2/temp/build.log
 * 6009_fix_system_cairo_support.patch ...
    --enable-system-cairo system_libs
    --enable-default-toolkit=cairo-gtk2 mozilla.org default
  --enable-system-cairo
  --enable-default-toolkit=cairo-gtk2
checking for cairo >= 1.10... yes
checking CAIRO_CFLAGS... -I/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libdrm -I/usr/include/libpng15
checking CAIRO_LIBS... -lcairo
checking for cairo-tee >= 1.10... yes
checking CAIRO_TEE_CFLAGS... -I/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libdrm -I/usr/include/libpng15
checking CAIRO_TEE_LIBS... -lcairo
checking for cairo-xlib-xrender >= 1.10... yes
checking CAIRO_XRENDER_CFLAGS... -I/usr/include/cairo -I/usr/include/glib-2.0 -I/usr/lib64/glib-2.0/include -I/usr/include/pixman-1 -I/usr/include/freetype2 -I/usr/include/libdrm -I/usr/include/libpng15
checking CAIRO_XRENDER_LIBS... -lcairo -lXrender -lX11

and this is output from already built firefox I am running now:

ldd /usr/lib/firefox/libxul.so | grep cairo
        libcairo.so.2 => /usr/lib64/libcairo.so.2 (0x00007f205d497000)
        libpangocairo-1.0.so.0 => /usr/lib64/libpangocairo-1.0.so.0 (0x00007f2059cc2000)

So, system-wide cairo enabled at build time and it is really there as shown by ldd.

(In reply to comment #53)
> Reading
> http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/www-client/firefox/
> firefox-19.0.2.ebuild?view=markup it seems that the use of system-cairo has
> been dropped. Which is a shame.

You are not seeing thing like "we're enabling system cairo here ..." directly in ebuild because it is done inside mozcoreconf-2.eclass which inherited by mozconfig-3.eclass which inherited by firefox ebuild. Inheriting eclass can be thought of as pretty close equavivalent of using #include directive in C.

(In reply to comment #53)
> Reading
> http://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo-x86/www-client/firefox/
> firefox-19.0.2.ebuild?view=markup it seems that the use of system-cairo has
> been dropped. Which is a shame.

And the last one, you can find sources of eclasses in your $PORTDIR/eclass dir which is most probably /usr/portage/eclass.

P.S. sorry for a burst of comments.

Ok, I have ff-19 built at last using gentoo ~amd64 on a lowly ilk. It seems to be doing the right thing regarding using system-cairo and server-side gradients. Next step is to piece together enough components to see if I can reproduce the bug.

(In reply to comment #57)
> Ok, I have ff-19 built at last using gentoo ~amd64 on a lowly ilk. It seems
> to be doing the right thing regarding using system-cairo and server-side
> gradients. Next step is to piece together enough components to see if I can
> reproduce the bug.

Ok, tell me what info I should provide and I'll post it.

As s first step, my firefox and xf86-video-intel USE-flags are:
x11-drivers/xf86-video-intel-2.21.4 was built with the following:
USE="dri sna udev xvmc -glamor -uxa"

www-client/firefox-19.0.2 was built with the following:
USE="alsa dbus gstreamer jit libnotify minimal (multilib) pgo system-jpeg wifi -bindist -custom-cflags -custom-optimization -debug (-selinux) -startup-notification -system-sqlite" ABI_X86="64" LINGUAS="ru -af -ak -ar -as -ast -be -bg -bn_BD -bn_IN -br -bs -ca -cs -csb -cy -da -de -el -en_GB -en_ZA -eo -es_AR -es_CL -es_ES -es_MX -et -eu -fa -fi -fr -fy_NL -ga_IE -gd -gl -gu_IN -he -hi_IN -hr -hu -hy_AM -id -is -it -ja -kk -km -kn -ko -ku -lg -lt -lv -mai -mk -ml -mr -nb_NO -nl -nn_NO -nso -or -pa_IN -pl -pt_BR -pt_PT -rm -ro -si -sk -sl -son -sq -sr -sv_SE -ta -ta_LK -te -th -tr -uk -vi -zh_CN -zh_TW -zu"
CFLAGS="-march=core2 -mtune=generic -pipe -mno-avx"
CXXFLAGS="-march=core2 -mtune=generic -pipe -mno-avx"

I was able to reproduce that stack trace from Xorg log and intel driver is not an issue here at all.

I found out that the cause of this is the fast spinning mouse wheel. I have a mouse with a wheel which can be scrolled like in 'free roam' mode, without that 'clicks', you know. And if I scroll too fast that stack appears. As before dmesg is clean from any i915 errors and no error state was caught. So, that stack is not related to the bug at all.

I'm still using the optimized flushes on all of my machines and have yet to encounter corruption. :|

Well, I am still experiencing this issue even with latest intel driver :(

Are you running Gentoo now? What is your setup? Could you please give me the output of `emerge --info firefox` and `emerge --info xf86-video-intel`?

I haven't tried Firefox 20 yet though. Could it be the issue in Firefox itself?

Same issue with firefox 20 and xf86-video-intel 2.21.5

Hello.

At last, there is some positive dynamic! Though I still from time to time see corrupted rendering of certain elements on some pages, but at least I haven't seen for a while any completely corrupted previews like it was before. Portions of previews could be corrupted, but only those parts which are rendered corrupted while browsing. So now there are no previews consisiting of complete garbage.
(Both previews and pages are rendered via same drawWindow function in firefox as far as I can tell from sources)

Updates that introduced(?) these changes:

libdrm 2.4.43 -> 2.4.44
xorg-server 1.13.1 -> 1.13.4
GTK+ 2.24.16 -> 2.24.17
agg 2.5 -> 2.5-r2 (nothing big, maintainer changed couple of build options; added in the list because I use gnash in Firefox which uses agg, so maybe somehow connected)

There were other updates, but these are the only changes that are possibly related to the effects I see. I was (and currently do) running xf86-video-intel-2.21.6 with disabled FORCE_SYNC all the time.

That's unexpected - those updates should have had no impact upon this issue. :|

(In reply to comment #64)
> That's unexpected - those updates should have had no impact upon this issue.
> :|

Nevertheless, the overall look and feel in firefox was improved somehow. Now I've updated mesa to 9.1.2 and kernel to 3.9.0 and these positive effects are preserved.

The situation is much better now than it was when I opened this bug: I don't have random huge screen corruptions in firefox both in thumbnails and during normal browsing. Though I can still trigger this issue and get corrupted page preview, it doesn't interfere with browsing. All other applications are unaffected.

Since, things are quite good now, maybe it is a good idea to enable back that optimizations? What do you think? It looks like I am the only one who has this issue:(

Ok, having made a new release, it is time to see if anyone else is seeing this bug:

commit 8e42637050275945200797538a34c13c90b295cc
Author: Chris Wilson <email address hidden>
Date: Tue May 21 11:13:03 2013 +0100

    sna: Re-enable read-read optimisations

(In reply to comment #66)
> Ok, having made a new release, it is time to see if anyone else is seeing
> this bug:
>
> commit 8e42637050275945200797538a34c13c90b295cc
> Author: Chris Wilson <email address hidden>
> Date: Tue May 21 11:13:03 2013 +0100
>
> sna: Re-enable read-read optimisations

Thank you. I'll update this bug with any new info if I notice any changes bad or good.

Sebastien Bacher (seb128) wrote :
Sebastien Bacher (seb128) wrote :

It doesn't seem to be happening if I use uxa in xorg.conf

Chris Wilson (ickle) wrote :

It's the read-read optimisation that was re-enabled. Not sure where the root cause is as reproducing it reliably and quickly is tricky.

Chris Wilson (ickle) wrote :

i.e. not even sure if it is not a kernel bug.

Sebastien Bacher (seb128) wrote :

I can trigger the bug quite easy in saucy (open a tab in firefox do it almost every time), let me know if I can help testing/providing debug informations

Chris Wilson (ickle) wrote :

The corruption you see in the new tab panel happens when that image is generated and stored to disk. It's debugging the corruption as it occurs is the trick.

Chris Wilson (ickle) wrote :

My guess is that there is a path with a missing sync point, like this one:

diff --git a/src/sna/sna_accel.c b/src/sna/sna_accel.c
index 69a151c..be73e27 100644
--- a/src/sna/sna_accel.c
+++ b/src/sna/sna_accel.c
@@ -4861,6 +4861,13 @@ sna_copy_boxes(DrawablePtr src, DrawablePtr dst, GCPtr gc,
    }
   }

+ RegionTranslate(region, src_dx, src_dy);
+ ret = sna_drawable_move_region_to_cpu(&src_pixmap->drawable,
+ region, MOVE_READ);
+ RegionTranslate(region, -src_dx, -src_dy);
+ if (!ret)
+ goto fallback;
+
   if (alu != GXcopy) {
    PixmapPtr tmp;
    struct kgem_bo *src_bo;

Sebastien Bacher (seb128) wrote :

Thanks Chris, I've tried your git commit [1] and it fixes the issue for me

[1] http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=7d91051c50210560dbd93a9e36f30d9f74ce9133

Changed in xserver-xorg-video-intel (Ubuntu):
importance: Undecided → High
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package xserver-xorg-video-intel - 2:2.21.9-0ubuntu2

---------------
xserver-xorg-video-intel (2:2.21.9-0ubuntu2) saucy; urgency=low

  * sna-make-sure-the-source-is-coherent.diff: Fix corruptions on firefox
    (LP: #1189850)
 -- Timo Aaltonen <email address hidden> Tue, 11 Jun 2013 20:10:33 +0300

Changed in xserver-xorg-video-intel (Ubuntu):
status: Fix Committed → Fix Released
Sebastien Bacher (seb128) wrote :

hum, that patch improved things (I don't get the issue on the new tab preview grib when it was consistent before) but I just ran into a similar issue on a merge request url...

Chris Wilson (ickle) wrote :

Any chance you can capture a screenshot of it? Or failing that a photograph, and it would be very useful to know if it is not capturable. If you can reproduce it again, you can try FORCE_FULL_SYNC again, that would identify whether it is the same issue.

Chris Wilson (ickle) wrote :

I remembered another path that reads from a source CPU pixmap that made a few presumptions about coherency, so please also test with 15b92c9.

Simon K (octav14n) wrote :

I'm also getting this bug from time to time (it's gotten better with this #9 update though).
Screenshot is attached.

Before I even got this "jittery"-graphics-bug on websites, not only about:newtab. However until now i didn't get this behavior again.

Is my screenshot showing a result of this bug? Or do I have to open a new one?
Btw. Firefox seems to swap the images randomly? The YouTube-Preview is actually the same Preview as "heise" is using?! Is this a separate bug?

Simon K (octav14n) wrote :
Chris Wilson (ickle) wrote :

The newtab looks like this bug, and image swapping is the same (the timing was just right to get a recognisable image of the same size).

Sebastien Bacher (seb128) wrote :

@Chris: I will take a screenshot if that happen again

I'm running an update with http://cgit.freedesktop.org/xorg/driver/xf86-video-intel/commit/?id=15b92c98755c709f41e59baeb206e5a3e56e3178 and didn't see the issue yet, seems good so far, thanks for the work on fixing that bug ;-)

(In reply to comment #68)
> It's back:
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/
> 1189850

Thanks for the link. I've tried today's xf86-video-intel git with the commit which is marked as a solution via link you provided. I can confirm that I was unable to reproduce this issue, but I cannot say for sure as with recent changes this bug on my machine apperars much more rarely than before. It can reappear later, but I hope it won't. I'll provide any new info here if any.

Sebastien Bacher (seb128) wrote :

hum, it worked great for the whole day today, until now when I ran into that

Changed in xserver-xorg-video-intel (Ubuntu):
status: Fix Released → Triaged
Sebastien Bacher (seb128) wrote :

sorry, ignore the previous comment, I was running with "#define FORCE_FULL_SYNC 1", that seems a different issue

Changed in xserver-xorg-video-intel:
importance: Unknown → Medium
status: Unknown → Confirmed
bugbot (bugbot) on 2013-06-13
tags: added: corruption

ok, new update ... sorry my testing the other day was incorrect, I had the FORCE_FULL_SYNC define set from a previous testing. Using git master from intel I still see a frequent corruption on e.g the new tab screen :-(

Chris Wilson (ickle) wrote :

Sebastien, once 2.21.10 hits saucy, can you please let me know how we are faring here? And please include the latest steps to trigger the bug.

Chris Wilson (ickle) wrote :

kernel commit 22fd5ca947b58901927d100d2b1aa0f1672b3435
Author: Chris Wilson <email address hidden>
Date: Fri Jun 28 16:54:08 2013 +0100

    drm/i915: Only clear write-domains after a successful wait-seqno

Changed in xserver-xorg-video-intel (Ubuntu):
status: Triaged → Fix Committed

commit 22fd5ca947b58901927d100d2b1aa0f1672b3435
Author: Chris Wilson <email address hidden>
Date: Fri Jun 28 16:54:08 2013 +0100

    drm/i915: Only clear write-domains after a successful wait-seqno

    In the introduction of the non-blocking wait, I cut'n'pasted the wait
    completion code from normal locked path. Unfortunately, this neglected
    that the normal path returned early if the wait returned early. The
    result is that read-only waits may return whilst the GPU is still
    writing to the bo.

    Fixes regression from
    commit 3236f57a0162391f84b93f39fc1882c49a8998c7 [v3.7]
    Author: Chris Wilson <email address hidden>
    Date: Fri Aug 24 09:35:09 2012 +0100

        drm/i915: Use a non-blocking wait for set-to-domain ioctl

    Signed-off-by: Chris Wilson <email address hidden>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=66163
    Cc: <email address hidden>
    Signed-off-by: Daniel Vetter <email address hidden>

This bug just reappeared with xf86-video-intel-2.21.10. Next thing I am going to try is this commit you've posted above.

Changed in xserver-xorg-video-intel:
status: Confirmed → Fix Released

(In reply to comment #70)
> commit 22fd5ca947b58901927d100d2b1aa0f1672b3435
> Author: Chris Wilson <email address hidden>
> Date: Fri Jun 28 16:54:08 2013 +0100
>
> drm/i915: Only clear write-domains after a successful wait-seqno
>
> In the introduction of the non-blocking wait, I cut'n'pasted the wait
> completion code from normal locked path. Unfortunately, this neglected
> that the normal path returned early if the wait returned early. The
> result is that read-only waits may return whilst the GPU is still
> writing to the bo.
>
> Fixes regression from
> commit 3236f57a0162391f84b93f39fc1882c49a8998c7 [v3.7]
> Author: Chris Wilson <email address hidden>
> Date: Fri Aug 24 09:35:09 2012 +0100
>
> drm/i915: Use a non-blocking wait for set-to-domain ioctl
>
> Signed-off-by: Chris Wilson <email address hidden>
> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=66163
> Cc: <email address hidden>
> Signed-off-by: Daniel Vetter <email address hidden>

Yes, this commit fixes the issue for me (on 3.10 kernel with this patch only).

Thanks a lot for your help!

I've also been experiencing this exact issue on xubuntu 13.04. I hadn't ever seen it until 13.04.

Chris Wilson (ickle) on 2013-07-17
summary: - saucy has frequent image corruption (intel, sna)
+ (Needs 3.10.2) saucy has frequent image corruption (intel, sna)
Robert Hooker (sarvatt) on 2013-07-19
summary: - (Needs 3.10.2) saucy has frequent image corruption (intel, sna)
+ (Needs a 3.10.3 kernel) saucy has frequent image corruption (intel, sna)

I suggest applying same patch in Raring.

summary: - (Needs a 3.10.3 kernel) saucy has frequent image corruption (intel, sna)
+ (Needs a 3.10.3 kernel) saucy/raring has frequent image corruption
+ (intel, sna)
tags: added: raring
Chris Wilson (ickle) on 2013-08-03
Changed in xserver-xorg-video-intel (Ubuntu):
status: Fix Committed → Fix Released

I'm experiencing this bug (or something very similar) on my X220, and with bug 1211754 seeming to trigger it about every 10-15 mins, it's pretty annoying. I'm running 3.10.3-031003-generic from the kernel ppa and xserver-xorg-video-intel version 2:2.21.12-1ubuntu1.

See photo attached.

Chris Wilson (ickle) wrote :

That looks like a different bug - the invalid fence after resume.

summary: - (Needs a 3.10.3 kernel) saucy/raring has frequent image corruption
+ (Needs a 3.10.5 kernel) saucy/raring has frequent image corruption
(intel, sna)
Sebastien Bacher (seb128) wrote :

(just as a follow up some time later, now that the updated version landed in saucy, things seem to work great, I've not seen any corruption issue recently)

To post a comment you must log in.