[gen4] Corruption in Chrome omni bar results using Intel SNA

Bug #1098489 reported by Rohan Garg
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
xserver-xorg-video-intel (Ubuntu)
Invalid
Low
Unassigned

Bug Description

After enabling Intel SNA the favicons in the omnibar search results are corrupted.
I've attached a screenshot of the issue with the bug report, note the favicon of the first result is corrupted.

This bug is probably the same as bug #1098334, more information may be seen there.

ProblemType: Bug
DistroRelease: Ubuntu 13.04
Package: xorg 1:7.7+1ubuntu4
ProcVersionSignature: Ubuntu 3.7.0-7.15-generic 3.7.0
Uname: Linux 3.7.0-7-generic x86_64
ApportVersion: 2.8-0ubuntu1
Architecture: amd64
Date: Fri Jan 11 15:31:21 2013
EcryptfsInUse: Yes
InstallationDate: Installed on 2012-12-03 (38 days ago)
InstallationMedia: Kubuntu 13.04 "Raring Ringtail" - Alpha amd64+mac (20121203)
MarkForUpload: True
SourcePackage: xorg
UpgradeStatus: No upgrade log present (probably fresh install)
---
ApportVersion: 2.8-0ubuntu1
Architecture: amd64
CompizPlugins: No value set for `/apps/compiz-1/general/screen0/options/active_plugins'
CompositorRunning: None
DistUpgraded: Fresh install
DistroCodename: raring
DistroRelease: Ubuntu 13.04
DistroVariant: kubuntu
EcryptfsInUse: Yes
ExtraDebuggingInterest: No
GraphicsCard:
 Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller [8086:0116] (rev 09) (prog-if 00 [VGA controller])
   Subsystem: Apple Inc. Device [106b:00dc]
InstallationDate: Installed on 2012-12-03 (38 days ago)
InstallationMedia: Kubuntu 13.04 "Raring Ringtail" - Alpha amd64+mac (20121203)
MachineType: Apple Inc. MacBookPro8,2
MarkForUpload: True
Package: xserver-xorg-video-intel 2:2.20.17-0ubuntu1
PackageArchitecture: amd64
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.7.0-7-generic root=/dev/mapper/kubuntu-root ro quiet splash i915.lvds_channel_mode=2 i915.modeset=1 i915.lvds_use_ssc=0 vt.handoff=7
ProcVersionSignature: Ubuntu 3.7.0-7.15-generic 3.7.0
Tags: raring kubuntu single-occurrence
Uname: Linux 3.7.0-7-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: adm cdrom dip lpadmin plugdev sambashare sudo
XorgConf:
 Section "Device"
   Identifier "Card0"
   Driver "intel"
   Option "AccelMethod" "sna"
 EndSection
dmi.bios.date: 01/24/12
dmi.bios.vendor: Apple Inc.
dmi.bios.version: MBP81.88Z.0047.B27.1201241646
dmi.board.asset.tag: Base Board Asset Tag#
dmi.board.name: Mac-94245A3940C91C80
dmi.board.vendor: Apple Inc.
dmi.board.version: MacBookPro8,2
dmi.chassis.type: 10
dmi.chassis.vendor: Apple Inc.
dmi.chassis.version: Mac-94245A3940C91C80
dmi.modalias: dmi:bvnAppleInc.:bvrMBP81.88Z.0047.B27.1201241646:bd01/24/12:svnAppleInc.:pnMacBookPro8,2:pvr1.0:rvnAppleInc.:rnMac-94245A3940C91C80:rvrMacBookPro8,2:cvnAppleInc.:ct10:cvrMac-94245A3940C91C80:
dmi.product.name: MacBookPro8,2
dmi.product.version: 1.0
dmi.sys.vendor: Apple Inc.
version.compiz: compiz N/A
version.ia32-libs: ia32-libs N/A
version.libdrm2: libdrm2 2.4.40-1
version.libgl1-mesa-dri: libgl1-mesa-dri 9.0.1-0ubuntu1
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 9.0.1-0ubuntu1
version.xserver-xorg-core: xserver-xorg-core 2:1.13.1.901-0ubuntu1
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.7.3-0ubuntu2
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:7.0.0-0ubuntu1
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.20.17-0ubuntu1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:1.0.6-0ubuntu1
xserver.bootTime: Fri Jan 11 16:00:31 2013
xserver.configfile: /etc/X11/xorg.conf
xserver.errors:

xserver.logfile: /var/log/Xorg.0.log
xserver.version: 2:1.13.1.901-0ubuntu1
xserver.video_driver: intel

Revision history for this message
In , Joe Peterson (bx09m7e-say-zwnfzpm) wrote :

Using the xf86-video-intel driver in SNA mode (using version 2.20.9, but I've seen it back to 2.20.7 as well - I have not tested SNA in earlier versions), chromium tabs exhibit a strange "blotchy" flickering of text/graphics as the mouse is moved over them. Note that this may vary depending on the text (or more probably the length of the text) in the tab.

I wonder if the transparent gradient/fade of the text in these tabs plays a part here.

I will attach a video screen capture of this happening. One URL that causes it is the URL of a previous bug I reported: https://bugs.freedesktop.org/show_bug.cgi?id=55484 (the tab in the video uses this one).

[From my lspci: VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)]

Revision history for this message
In , Joe Peterson (bx09m7e-say-zwnfzpm) wrote :

Created attachment 67928
Video screen capture showing the chromium tab flicker

Revision history for this message
In , Chris Wilson (ickle) wrote :

Can you please check that attached video is the correct one? mplayer doesn't complain, but doesn't show anything either.

Also can you please attach your Xorg.0.log so that I know what hardware you have, and which WM you are using (and any compositing options)?

Revision history for this message
In , Joe Peterson (bx09m7e-say-zwnfzpm) wrote :

Created attachment 67929
Video screen capture showing the chromium tab flicker

Trying this attachment again - this time specifying binary manually... the previous upload seemed to alter the file.

Revision history for this message
In , Joe Peterson (bx09m7e-say-zwnfzpm) wrote :

Created attachment 67930
Xorg.0.log file from session showing problem

Revision history for this message
In , Joe Peterson (bx09m7e-say-zwnfzpm) wrote :

I am using openbox as a WM. I have not selected any special (non-default) compositing options that I know of.

Revision history for this message
In , Chris Wilson (ickle) wrote :

What it actually looks like is that the gradient is misapplied whilst rendering the glyphs.

Can you please test with either downgrading pixman to 0.26 or compile -intel with -UHAS_PIXMAN_GLYPHS?

Revision history for this message
In , Chris Wilson (ickle) wrote :

And for an extra level of paranoia, can you also check if the error persists if you do a debug build (unoptimized, -O0) of pixman, xserver, and -intel? Having been burnt by bugs uncovered by aggressive compiler optimisations before, it helps to keep me calm to have a sanity check. ;-)

Revision history for this message
In , Joe Peterson (bx09m7e-say-zwnfzpm) wrote :

Ok, I downgraded to pixman-0.26 (which then causes -intel to be build without HAS_PIXMAN_GLYPHS). I also compiled xorg-server and -intel with -O0. However, I was unable to compile pixman with -O0, because configure failed (checking for MMX support).

The same problem persists even with the above done. Let me know if I should try anything else, or if this is enough to verify that it's not pixman...

Revision history for this message
In , Chris Wilson (ickle) wrote :

Yes, that is enough to rule out the new pixman_glyph_t routines and enough that is not some random miscompilation. Which means we^W I need to look harder.

Revision history for this message
In , Chris Wilson (ickle) wrote :

First of all lets disable acceleration of glyphs:

diff --git a/src/sna/sna_glyphs.c b/src/sna/sna_glyphs.c
index 53494e3..4e510a4 100644
--- a/src/sna/sna_glyphs.c
+++ b/src/sna/sna_glyphs.c
@@ -69,7 +69,7 @@

 #include <mipict.h>

-#define FALLBACK 0
+#define FALLBACK 1
 #define NO_GLYPH_CACHE 0
 #define NO_GLYPHS_TO_DST 0
 #define NO_GLYPHS_VIA_MASK 0

That will tell us whether the corruption occurs as we render the glyphs using the GPU or as we upload. Similarly working through each of the NO_* options thereafter would be very helpful to identify which path in particular is affected.

Secondly,

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index ceef528..f901008 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -1863,7 +1863,7 @@ gen4_composite_picture(struct sna *sna,
        if (picture->pDrawable == NULL) {
                int ret;

- if (picture->pSourcePict->type == SourcePictTypeLinear)
+ if (picture->pSourcePict->type == SourcePictTypeLinear && 0)
                        return gen4_composite_linear_init(sna, picture, channel,
                                                          x, y,
                                                          w, h,
@@ -2046,7 +2046,6 @@ check_gradient(PicturePtr picture)
 {
        switch (picture->pSourcePict->type) {
        case SourcePictTypeSolidFill:
- case SourcePictTypeLinear:
                return false;
        default:
                return true;

will confirm whether is it the gradient that is implicated in this bug.

38 comments hidden view all 194 comments
Revision history for this message
Rohan Garg (rohangarg) wrote :
Revision history for this message
Rohan Garg (rohangarg) wrote :

~ » lspci -nn | grep VGA | sed 's/.*Intel Corporation //' shadeslayer@solembum
2nd Generation Core Processor Family Integrated Graphics Controller [8086:0116] (rev 09)

Timo Aaltonen (tjaalton)
affects: xorg (Ubuntu) → xserver-xorg-video-intel (Ubuntu)
Revision history for this message
Rohan Garg (rohangarg) wrote : BootDmesg.txt

apport information

tags: added: apport-collected kubuntu single-occurrence
description: updated
Revision history for this message
Rohan Garg (rohangarg) wrote : BootLog.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : Dependencies.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : DpkgLog.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : LightdmDisplayLog.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : LightdmGreeterLog.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : LightdmGreeterLogOld.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : LightdmLog.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : Lspci.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : Lsusb.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : ProcEnviron.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : ProcModules.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : UdevDb.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : UdevLog.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : XorgLog.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : XorgLogOld.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : Xrandr.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : xdpyinfo.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : xserver.devices.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : xserver.outputs.txt

apport information

Revision history for this message
Rohan Garg (rohangarg) wrote : Re: Corruption in Chrome omni bar results using Intel SNA

I'm seeing some severe performance issues now, framerate has dropped to 35 FPS ( down from 60 FPS earlier ). I'm attaching the appropriate files.

Revision history for this message
Rohan Garg (rohangarg) wrote :
Revision history for this message
Rohan Garg (rohangarg) wrote :

Chrome seems to have some glitches in the tabs as well

Rohan Garg (rohangarg)
tags: removed: single-occurrence
Revision history for this message
jhfhlkjlj (fdsuufijjejejejej-deactivatedaccount) wrote :

This strongly resembles bug 1098334.

Chris Wilson (ickle)
Changed in xserver-xorg-video-intel (Ubuntu):
assignee: nobody → Chris Wilson (ickle)
status: New → Confirmed
status: Confirmed → In Progress
Chris Wilson (ickle)
summary: - Corruption in Chrome omni bar results using Intel SNA
+ [gen4] Corruption in Chrome omni bar results using Intel SNA
Changed in xserver-xorg-video-intel:
importance: Unknown → Medium
status: Unknown → Confirmed
Changed in xserver-xorg-video-intel:
status: Confirmed → In Progress
description: updated
125 comments hidden view all 194 comments
Revision history for this message
In , Chris Wilson (ickle) wrote :

My experiments so far indicate that the errors only happen with rendering to the uncached frontbuffer, are not influenced by the number of rectangles in each primitive (though the failure does occur at different frequencies) and do not respond to adding extra MI_FLUSH. I don't think attaching examples of fail will help, as the next step will be trying to dissect the GPU state and work out why fragments fail inside a single shader.

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

Well here is just another one -

$ ./basic-copyarea
Opened connection to :0 for testing.
Testing setting of single pixels (root): passed [1 iterations x 4096]
Testing area sets (root): passed [1 iterations x 4096]
Testing area fills (root, using pixmap source): passed [1 iterations x 4096]
Testing area fills (root, using window source): passed [1 iterations x 4096]
Testing setting of single pixels (child): passed [1 iterations x 4096]
Testing area sets (child): passed [1 iterations x 4096]
Testing area fills (child, using pixmap source): passed [1 iterations x 4096]
Testing area fills (child, using window source): passed [1 iterations x 4096]
Testing setting of single pixels (pixmap): passed [1 iterations x 4096]
Testing area sets (pixmap): passed [1 iterations x 4096]
Testing area fills (pixmap, using pixmap source): passed [1 iterations x 4096]
Testing setting of single pixels (root): passed [2 iterations x 2048]
Testing area sets (root): failed to set pixel (0,0) to 0072dc99 [0e72dc99], found 00000000 [00000000] instead
00000000 00000000 00000000 0e72dc99 0e72dc99 0e72dc99
5146daef 5146daef 5146daef 0e72dc99 0e72dc99 0e72dc99
5146daef 5146daef 5146daef 0e72dc99 0e72dc99 0e72dc99

Unsure if it's related to anything and test seems to be lenghty -
but there seems to be some observable pattern.

Anyway - could I help here with testing some patch - or do you get same errors yourself on your available hardware ?

Revision history for this message
In , Chris Wilson (ickle) wrote :

The tests are intentionally overkill - they are also intended to try and test handling of large batches, as well as generally stress the system.

I hadn't noticed the basic-copyarea fail. That does look to be different. So far, the failure pattern had seemed to be a subspan doesn't get written (i.e. a single instance of one thread failed to execute correctly in the GPU, which I think could explain the general bug here.) basic-copyarea looks like a larger scale failure. And, yes, it does fail on my gm45 as well.

Revision history for this message
In , Harald Judt (hjudt) wrote :

SNA has grown worse now. Since about six weeks, I've started to see corruption in the terminal (cursor not disappearing or not showing at all, more text missing until marking with the mouse,...). I've switched to UXA, nothing bad visible there.

Revision history for this message
In , Harald Judt (hjudt) wrote :

With kernel-3.12.2 and current git, the situation now is much better. The corruptions happen less often (still often enough to be noticed though), and are usually confined to single characters, or only parts of single characters. While I'm writing this, I can notice what appears to be some slight corruption in the lines above which I've already written, but they are always gone immediately. Maybe today's just a lucky day, or the situation has improved.

What's more, the issue with the cursor not disappearing or not showing that I mentioned in my previous post no longer exists, so SNA on this gen4 is usable again.

Revision history for this message
In , Chris Wilson (ickle) wrote :

As a check that all problems are the same, can people who are still affected by this do a test with

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index a87af39..86c37d6 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -63,7 +63,7 @@
 #define NO_FILL_BOXES 0
 #define NO_VIDEO 0

-#define MAX_FLUSH_VERTICES 6
+#define MAX_FLUSH_VERTICES 1

 #define GEN4_GRF_BLOCKS(nreg) ((nreg + 15) / 16 - 1)

and see if that cures the last of the corruptions.

Revision history for this message
In , Combuster-archlinux (combuster-archlinux) wrote :

Just to report, maybe this isn't intel driver bug at all. At home I have ATI R270X (radeonsi driver) and Intel HD2000, at work X4500 and I get hit by this bug on all of them.

Small, single char, graphic corruption, easily triggered by scrolling through a window with lots of small text (tailing a log in debug mode in terminal). But I see it appers and dissapears as I'm typing this comment.

I'm attaching the screenshots (/dev/shm is corrupted on black one).

@Chris, I'll try applying patches right now and report back.

Revision history for this message
In , Combuster-archlinux (combuster-archlinux) wrote :

Created attachment 90707
/dev/shm corruption

Revision history for this message
In , Combuster-archlinux (combuster-archlinux) wrote :

I don't want to speak too soon but it seems that the latest patch fixes the problem.

First I've tried the latest git version and it didn't help, I've noticed corrupted fonts as soon as I logged on and start poking things in terminal.

After I've applied the patch in comment #129 and restarted, no corruption at all (at least for now, it usually can be reproduced right after logging in).

Chris, any thoughts on why is this happening with radeon opesource driver also ?

Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #132)
> Chris, any thoughts on why is this happening with radeon opensource driver
> also ?

I was hoping that it would be a bug in common component, but in this case it sounds like they have a similar bug in managing internal GPU state.

Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #132)
> I don't want to speak too soon but it seems that the latest patch fixes the
> problem.

Let it run for a day or so to be sure. The other thing that is worth checking is whether setting MAX_FLUSH_VERTICES to 2 is also stable, or 4 etc. Setting it to 1 has a major impact on performance (we are roughly an order of magnitude slower at rendering than what can be expected).

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

Maybe this bug could be more easily tracked down when the amount of vertices is actually much higher - since in this case it seems to crash almost immediatelly.

I understand there is some 'maximum queue' size GPU could handle - but the engine should be able to track size of all commands and not outgrown it.

So what else could break ordering ?

As it's very easy to trigger the problem with higher max - maybe it could be used to deduce which part of code unexpectedly touches the command queue ?

(but of course it's just my naive assumption about how the SNA driver works).

Revision history for this message
In , Chris Wilson (ickle) wrote :

The issue that the VUE (which is a memory slot used by the GPU for a vertex entry) are reused by a second thread before the first thread is complete, causing the first thread to generate invalid texture coordinates and corrupt rendering. That is a hardware read-write hazard bug (or at least I have not found any controls in the EU state to prevent it).

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

(In reply to comment #136)
> The issue that the VUE (which is a memory slot used by the GPU for a vertex
> entry) are reused by a second thread before the first thread is complete,
> causing the first thread to generate invalid texture coordinates and corrupt
> rendering. That is a hardware read-write hazard bug (or at least I have not
> found any controls in the EU state to prevent it).

I don't think it's that easy to explain -

The typical problem in my case is - when I freshly start the Xsession -
I do not observe any rendering bug. I need to use this machine for a while,
to start to get those errors.

So if there would be some 'easy to trigger hardware race' - it should be reproducible all the time.

But in my case it seem - the probability increases over the time heavily - maybe with the amount of cached BO segments ?

Or maybe it depends on how the memory order is set - i.e. the problem is triggered when certain memory read pattern start to appear ?

Since once the issue of flicker starts - it then happens all the time - and then suddenly it disappears for a while again ?

So maybe there needs to be prohibited some memory offset/alignment of buffers with vertices ?

Revision history for this message
In , Chris Wilson (ickle) wrote :

Nope. The behaviour of this bug is very well characterised by the above analysis.

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

Well I'm curious how this explains this - I've taken current git -
changed the value to '44' - and I'm typing this text. I could see a lot of errors during text typing - but these errors seems to be somehow limited only to certain regions of shown text.

It's not destroyed everywhere - only in some particularly piece and in particular time - it's kind of weird defect to observe - and it also happen for only some specific use-case

i.e. my text edit 'fte' which is using X drawing code has absolutely no issue - all characters are always correct.
But in firefox - typing this BZ I could see a lot of changing letters (usually everything after 10 letter on the line is weirdly changing - like if the font cache would be not working correctly.

Typing on keyboard shows rending bugs - as soon as right mouse button is hit - everything is instantly redrawn correctly and pop window is shown.

Another thing I notice is - I've about 20 lines of email headers in thunderbird. And exactly only the 4th line is showing problems with letters - even when I just move the mouse over the TB window - this text is being continuously modified - but everything else in that window is without any problem.

So if you say - there is hw bug - how is that - it could very easily render everything correctly ?

Why only certain portions of text have distortions - why it's not randomly spread over the whole screen (which I'd have expected for time collisions)?

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

Just to add some more comment on MAX_FLUSH_VERTICES - when set to valu 96 - it gives the highest throughput on x11perf -aa10text 3.7MChar/s - using any higher value doesn't make any different (so the max seems to be somewhere between <64-96]

Also when this 3.7MChar is rendered - the parallel move of some other window on the screen becomes very slow - so engine gets overloaded ??

Or the Xserver generates such long queue of events it becomes so slow ?

Revision history for this message
In , Harald Judt (hjudt) wrote :

(In reply to comment #134)
> (In reply to comment #132)
> > I don't want to speak too soon but it seems that the latest patch fixes the
> > problem.
>
> Let it run for a day or so to be sure. The other thing that is worth
> checking is whether setting MAX_FLUSH_VERTICES to 2 is also stable, or 4
> etc. Setting it to 1 has a major impact on performance (we are roughly an
> order of magnitude slower at rendering than what can be expected).

Setting max_flush_vertices to 1 fixes the problem here, with the mentioned noticeable performance loss. IIRC, hibernating/resuming can accelerate the appearance of the bug, but I'm not quite sure about this, it might also be coincidence.

I will now try to decrease the value from the default one to find the sweet spot.

Revision history for this message
In , Chris Wilson (ickle) wrote :

*** Bug 71773 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Harald Judt (hjudt) wrote :

All tests with MAX_FLUSH_VERTICES greater than 2 reveal those single garbled glyphs. I'm still testing with a value of 2. There isn't much noticeable difference between 3, 4, 5.

00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03) (prog-if 00 [VGA controller])
        Subsystem: Dell Device 0276

Revision history for this message
In , Lonnie Lee Best (launchpad-startport) wrote :

Here's another bug report regarding this same issue:
https://bugs.launchpad.net/bugs/1227569

Revision history for this message
In , Edward Sheldrake (ejs1920) wrote :

Created attachment 91383
gedit in openbox with 2.99.907 GM45 SNA

I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed to the bad commit as:
9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit
commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed
Author: Chris Wilson <email address hidden>
Date: Mon Dec 16 11:39:20 2013 +0000

    sna/gen4: Sacrifice performance to workaround render corruption

The problems are lines of text keep disappearing (and reappearing) in gedit, and the occasionally the screen becomes unresponsive for a short time and these messages appear in dmesg:

[ 1702.349954] [drm] stuck on render ring
[ 1702.349966] [drm] capturing error event; look for more information in /sys/class/drm/card0/error
[ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x32a1000 ctx 0) at 0x32a1110

Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #145)
> [ 1702.349954] [drm] stuck on render ring
> [ 1702.349966] [drm] capturing error event; look for more information in
> /sys/class/drm/card0/error
> [ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside
> bo (0x32a1000 ctx 0) at 0x32a1110

Attach the error state.

Revision history for this message
In , Edward Sheldrake (ejs1920) wrote :

Created attachment 91388
intel_error_decode output

I saved the output from intel_error_decode but I didn't save the raw error data.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Worth trying just:

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index 637137e..dc80de3 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -660,9 +660,11 @@ inline static int gen4_get_rectangles(struct sna *sna,
                if (rem <= 0) {
                        if (sna->render.vertex_offset) {
                                gen4_vertex_flush(sna);
- if (gen4_magic_ca_pass(sna, op))
+ if (gen4_magic_ca_pass(sna, op)) {
+ OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
                                        gen4_emit_pipelined_pointers(sna, op, op->op,
                                                                     op->u.gen4.wm_kernel);
+ }
                        }
                        OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
                        rem = MAX_FLUSH_VERTICES;

if you are happy that it reproduces reliably.

Revision history for this message
In , Edward Sheldrake (ejs1920) wrote :

(In reply to comment #148)
> Worth trying just:
>
> diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
> index 637137e..dc80de3 100644
> --- a/src/sna/gen4_render.c
> +++ b/src/sna/gen4_render.c
> @@ -660,9 +660,11 @@ inline static int gen4_get_rectangles(struct sna *sna,
> if (rem <= 0) {
> if (sna->render.vertex_offset) {
> gen4_vertex_flush(sna);
> - if (gen4_magic_ca_pass(sna, op))
> + if (gen4_magic_ca_pass(sna, op)) {
> + OUT_BATCH(MI_FLUSH |
> MI_INHIBIT_RENDER_CACHE_FLUSH);
> gen4_emit_pipelined_pointers(sna,
> op, op->op,
>
> op->u.gen4.wm_kernel);
> + }
> }
> OUT_BATCH(MI_FLUSH | MI_INHIBIT_RENDER_CACHE_FLUSH);
> rem = MAX_FLUSH_VERTICES;
>
> if you are happy that it reproduces reliably.

This change did not fully solve the problem. One text file in gedit displayed as blank initially for a bit, although things then seemed fine. But the freeze with "[drm] stuck on render ring" happened during a second run of gtkperf - while the "GtkDrawingArea - Text" test was running.

But I didn't spot any corrupted characters while running 2.99.907 with MAX_FLUSH_VERTICES set back to 6 - although I hadn't been running that for very long, and I only see a single garbled char occasionally, and I think they only appear in Firefox.

Revision history for this message
In , Edward Sheldrake (ejs1920) wrote :

Created attachment 91389
intel_error_decode output

the decoded error that occurred while running gtkperf, 2.99.907 with the change in comment #148

Revision history for this message
In , Edward Sheldrake (ejs1920) wrote :

(In reply to comment #149)
>
> But I didn't spot any corrupted characters while running 2.99.907 with
> MAX_FLUSH_VERTICES set back to 6

And now I have spotted the single character corruption with 2.99.907 + MAX_FLUSH_VERTICES set to 6.

Revision history for this message
In , Harald Judt (hjudt) wrote :

The bug occurs with MAX_FLUSH_VERTICES = 2 too, both in firefox and xterm, so unfortunately setting it to 1 seems to be the only solution here.

Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #145)
> Created attachment 91383 [details]
> gedit in openbox with 2.99.907 GM45 SNA
>
> I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed
> to the bad commit as:
> 9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit
> commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed
> Author: Chris Wilson <email address hidden>
> Date: Mon Dec 16 11:39:20 2013 +0000
>
> sna/gen4: Sacrifice performance to workaround render corruption
>
> The problems are lines of text keep disappearing (and reappearing) in gedit,
> and the occasionally the screen becomes unresponsive for a short time and
> these messages appear in dmesg:
>
> [ 1702.349954] [drm] stuck on render ring
> [ 1702.349966] [drm] capturing error event; look for more information in
> /sys/class/drm/card0/error
> [ 1702.354334] [drm:i915_set_reset_status] *ERROR* render ring hung inside
> bo (0x32a1000 ctx 0) at 0x32a1110

Ok, I think I have this fixed:

commit 9d8473c5d9489db439aca73f470bda29a22ebab6
Author: Chris Wilson <email address hidden>
Date: Tue Jan 7 13:43:35 2014 +0000

    sna/gen4: Check for available batch space before restoring state after CA pass

    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=73348
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55500
    Signed-off-by: Chris Wilson <email address hidden>

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

(In reply to comment #153)
> (In reply to comment #145)
> > Created attachment 91383 [details]
> > gedit in openbox with 2.99.907 GM45 SNA
> >
> > I am finding that GM45 SNA seems unusable with 2.99.907 - git bisect pointed
> > to the bad commit as:
> > 9289e2c56b7f0cc78c5123691ad96611f0e04bed is the first bad commit
> > commit 9289e2c56b7f0cc78c5123691ad96611f0e04bed
> > Author: Chris Wilson <email address hidden>
> > Date: Mon Dec 16 11:39:20 2013 +0000
> >
> > sna/gen4: Sacrifice performance to workaround render corruption

Interestingly this commit has increased the number of buggy character occurrence - although I admit I'm overriding MAX_FLUSH_VERTICES to 9 (since I don't like the slowness with 1) so I've been rather living with faster desktop and occasional wrong characters on the screen - but with this commit it seems to appearance of visual problem increased to the level which makes noticeable reading difficulties that probably enforce me to switch to very slow '1' for MAX :(...

(I'm not sure if it's the direct cause - since I've been using before about 3 week old version of git tree)

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

I think nice illustration could be - that while before with MAX 9 it seemed like i.e. gnome-terminal running top was not really rendering broken characters - it now seems to show a lot of messed characters.

On the other hand - MAX 1 seems to be now more fluent then before - so except for some dramatic drops in performance in benchmark tools like 'x11perf -aa10text' it seems to be quite usable.

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

bad news - now I'm in fact able to spot badly rendered characters in Firefox also with MAX 1

So something went wrong....

Revision history for this message
In , Chris Wilson (ickle) wrote :

Something worth experimenting with is detuning the GPU, e.g.:

diff --git a/src/sna/gen4_render.c b/src/sna/gen4_render.c
index e239c21..bc6af68 100644
--- a/src/sna/gen4_render.c
+++ b/src/sna/gen4_render.c
@@ -52,7 +52,7 @@
  */
 #define FORCE_SPANS 0
 #define FORCE_NONRECTILINEAR_SPANS -1
-#define FORCE_FLUSH 1 /* https://bugs.freedesktop.org/show_bug.cgi?id=55500 */
+#define FORCE_FLUSH 0 /* https://bugs.freedesktop.org/show_bug.cgi?id=55500 */

 #define NO_COMPOSITE 0
 #define NO_COMPOSITE_SPANS 0
@@ -74,7 +74,7 @@
 #define URB_CS_ENTRIES 0

 #define URB_VS_ENTRY_SIZE 1
-#define URB_VS_ENTRIES 32
+#define URB_VS_ENTRIES 16

 #define URB_GS_ENTRY_SIZE 0
 #define URB_GS_ENTRIES 0
@@ -83,7 +83,7 @@
 #define URB_CL_ENTRIES 0

 #define URB_SF_ENTRY_SIZE 2
-#define URB_SF_ENTRIES 64
+#define URB_SF_ENTRIES 1

 /*
  * this program computes dA/dx and dA/dy for the texture coordinates along
@@ -93,9 +93,9 @@
 #define SF_KERNEL_NUM_GRF 16
 #define PS_KERNEL_NUM_GRF 32

-#define GEN4_MAX_SF_THREADS 24
-#define GEN4_MAX_WM_THREADS 32
-#define G4X_MAX_WM_THREADS 50
+#define GEN4_MAX_SF_THREADS 8
+#define GEN4_MAX_WM_THREADS 16
+#define G4X_MAX_WM_THREADS 16

 static const uint32_t ps_kernel_packed_static[][4] = {
 #include "exa_wm_xy.g4b"

penalvch (penalvch)
Changed in xserver-xorg-video-intel (Ubuntu):
assignee: Chris Wilson (ickle) → nobody
importance: Undecided → Low
status: In Progress → Incomplete
Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

Created attachment 91613
Grabbed snapshot with patch from comment 157

As could be seen - yes - with little effort I'm able to capture broken characters with given patch compiled in.

The only needed thing is to start to edit the dialog in the Firefox and start to add/remove random characters over places.

Revision history for this message
Rohan Garg (rohangarg) wrote :

Issue seems to be fixed in Trusty.

Changed in xserver-xorg-video-intel (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
In , Harald Judt (hjudt) wrote :

After updating to current git, the situation has become worse and I now see more visible corruptions even with MAX_FLUSH_VERTICES=1. And I did not have to wait some hours for them to appear as before, they are visible right after starting X.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Typical, it appeared stable through my firefox testing, but if you try reverting b7565a26401e283df94b68019e8093f8104428f4, I expect the corruptions to disappear again.

Revision history for this message
In , Zdenek Kabelac (zdenek-kabelac) wrote :

(In reply to comment #160)
> Typical, it appeared stable through my firefox testing, but if you try
> reverting b7565a26401e283df94b68019e8093f8104428f4, I expect the corruptions
> to disappear again.

Yep - correct revert of this commit make MAX_VERTEX 1 again producing correct rendering - even thought it's again noticable slower - thus now it's clear why I've considered it usable with MAX 1 before (in my comment 155).

So yep - revert & MAX 1 works again - but it's quite slow.

Is it now any better to deduce which operation is make such bad memory interaction ?

It seems like the 'synchronization' is really needed only at very certain moments - where the GPU is producing memory corruption error on the screen - but how to catch in which moment ?

I'm still suspecting some memory layout of those memory object - since when I see corruption - it usually it specific places (i.e. edit of this firefox input widget and just only certain characters at certain positions are render with errors)

How can I try to increasing memory alignment ?
(i.e. each object only at 16KB boundary?)

penalvch (penalvch)
Changed in xserver-xorg-video-intel:
importance: Medium → Undecided
status: In Progress → New
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in xorg (Ubuntu):
status: New → Confirmed
penalvch (penalvch)
affects: xserver-xorg-video-intel → xorg (Ubuntu)
Changed in xorg (Ubuntu):
status: New → Invalid
no longer affects: xorg (Ubuntu)
Displaying first 40 and last 40 comments. View all 194 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.