screen artifacts after resume, part row of pixels in error (945GM)

Bug #91966 reported by lcampagn
44
Affects Status Importance Assigned to Milestone
xf86-video-intel
Fix Released
Medium
xserver-xorg-video-intel (Ubuntu)
Fix Released
High
Unassigned

Bug Description

Binary package hint: xserver-xorg-video-i810

I have a lenovo X60 tablet with SXGA (1400x1050) screen running latest feisty. The graphics controller is reported by lspci as "Intel Corporation Mobile 945GM/GMS/940GML Express Integrated Graphics Controller (rev 03)" (8086:27a2), and I'm using the i810 driver + DRI in my xorg.conf.

When I resume from suspend, I get a strange artifact on my screen. It appears that the pixels in row 975 from column 0 to 375 are swapped with the pixels in row 752 from column 1024 to 1400. Sometimes applications will get very messy because of this, for example scrolling my browser window causes a big smear on the right side of colors that are taken from the bottom-left of the screen. I can refresh the screen to get rid of the smear, but the original two lines mentioned above always remain swapped.

I have tried turning off 915resolution (I need this to achieve the full screen resolution), tried changing various video-related settings in /etc/default/scpi-support, and obviously tried restarting the X server.

ProblemType: Bug
Architecture: i386
Date: Tue Mar 13 12:53:37 2007
DistroRelease: Ubuntu 7.04
Uname: Linux underwood 2.6.20-9-generic #2 SMP Mon Feb 26 03:01:44 UTC 2007 i686 GNU/Linux

Revision history for this message
newsie (spam-newsie) wrote :

having the same issue with an identical configuration. please let me know if you need more info.

Revision history for this message
lcampagn (luke-campagnola) wrote :

This problem can be avoided by using s2ram (package uswsusp) to suspend, but of course this causes all kinds of other problems. I'm trying to see if I can get the system suspend script to implement VGA state saving the way s2ram does.

Revision history for this message
Robert Thau (rst-alum) wrote :

FYI, I am seeing very similar artfacts after suspend on a Panasonic CF-Y5 running Kubuntu feisty, so it's not just the Thinkpads. FWIW, even starting a new X server (separate session).

Revision history for this message
Robert Thau (rst-alum) wrote :

Ummm... even starting a new X server (separate session) does not eliminate the artifact. Sorry 'bout that.

Revision history for this message
Manfred Haelters (manfred-haelters) wrote :

Same problem here... Also on Thinkpad X60 tablet

Revision history for this message
lcampagn (luke-campagnola) wrote :

It looks like this driver will eventually replace the i810 driver, and the problem exists in both so I'm moving the bug report to the intel driver.

Changed in xserver-xorg-video-intel:
status: New → Confirmed
Revision history for this message
Bryce Harrington (bryce) wrote :

Is this issue still occurring on gutsy-beta?

Revision history for this message
Eitan Isaacson (eeejay) wrote :

Yes.

I forgot all about it in Feisty since I was using my own built xserver-xorg-video-i810 version 1.9.93, since the bundled i810 driver (1.7.*) had this bug.
Now in Gutsy I am using the xserver-xorg-video-intel driver, and it gives me exactly the same artifact (see screenshot).

Revision history for this message
In , Eitan-isaacson (eitan-isaacson) wrote :

After resuming from suspend I get a pixel-line of the screen which is
copied to another location of the screen (see attached PNG). This bug is similar to bug 9381 which I also encountered when I was using i810_drv which is the default in Ubuntu Feisty. I fixed the problem in Feisty by building and installing version 1.9.93 of the i810 driver. It seems that this problem has recurred in the intel driver.

This is a Fujitsu Lifebook S7110.

Other interesting info I was asked to provide last time:

[ 34.480000] [drm] Initialized drm 1.1.0 20060810
[ 34.480000] [drm] Initialized i915 1.6.0 20060119 on minor 0

Revision history for this message
In , Eitan-isaacson (eitan-isaacson) wrote :

Created an attachment (id=11890)
Screenshot of corruption

Revision history for this message
In , Eitan-isaacson (eitan-isaacson) wrote :

Sorry for the spam.
The driver version:
xserver-xorg-video-intel 2:2.1.1-0ubuntu4

Revision history for this message
In , Eitan-isaacson (eitan-isaacson) wrote :

I found the actual git changeset that caused this regression:
a4f1a7872f6f959bb4bc6568face710bee3589de

Revision history for this message
Eitan Isaacson (eeejay) wrote : Re: screen artifacts after resume with i810 driver

I pinpointed the changeset in which this happens on the xorg git repo. This is the fd.o bug I opened:
https://bugs.freedesktop.org/show_bug.cgi?id=12667

Revision history for this message
Sam Peterson (peabodyenator) wrote :

I am experiencing the exact same problem and I have a screen shot of my own to add

Revision history for this message
Sam Peterson (peabodyenator) wrote :

Forgot to mention that I'm on a compaq presario c571nr with the 945GM chipset.

Revision history for this message
unggnu (unggnu) wrote :

@eejay
Could you please recheck it with final Gutsy, maybe with the Live CD? If it still happens could you please post your /var/log/Xorg.0.log and try it with disabled DRI?

Changed in xserver-xorg-video-intel:
status: Confirmed → Incomplete
Revision history for this message
Eitan Isaacson (eeejay) wrote :

I'm updated to final Gutsy, but I also tried the final live CD as you suggested.
No luck.
Attached is the log, as per your request.

Revision history for this message
Eitan Isaacson (eeejay) wrote :

Oh, I forgot to mention: When I disable DRI I don't get any artifacts. I'll attach a log for that too.

Revision history for this message
Peter Clifton (pcjc2) wrote :

Can you attach your /etc/defaults/acpi-support config file?

You might like to play with the options there and see if they have any effect. On my 945GM, I have to POST the device after resume or natsy timing is sent to the LCD (very bad for them), but otherwise there is some flexibility in setting whether the PCI config is written back, or the vbetool program is used to save and restore video state. I don't need either of these for my laptop, but it may well depend on your BIOS.

Once this bug manifests, does it always persist?

There is a register dumping utility as part of the intel driver source whicih might it be useful to run before and after the problem appears for a comparison. As its DRI related though, it could possibly be that the state which is being messed about is in the DRI client, or the chip's 3D engine, so we won't see those in the normal register dump. (If necessary, I can suppy a compiled version of the dump utility, or point you where to get the sources).

Also, if altering the settings in /etc/defaults/apci-support doesn't help, could you try the package at:
http://www2.eng.cam.ac.uk/~pcjc2/ubuntu/xserver-xorg-video-intel_2.1.1-0ubuntu10~pcjc2_i386.deb

This has some unreleased fixes, one of which relating to 3D / DRI state after resume which manifested as a texture corruption on 855 hardware.

Revision history for this message
Peter Clifton (pcjc2) wrote :

Appologies, that is the wrong link and doesn't include the 3D fix. This one:

http://www2.eng.cam.ac.uk/~pcjc2/ubuntu/xserver-xorg-video-intel_2.1.1-0ubuntu10~pcjc2.2_i386.deb

Revision history for this message
Juha Tiensyrjä (juha-tiensyrja) wrote :

The above package does not fix the problem for me, nor does disabling DRI.

I have a Lenovo ThinkPad X60s using an external monitor, and when the bug manifests, it does persist.

Distro: Ubuntu 7.10 (Gutsy)
uname -a: Linux tele 2.6.22-14-generic #1 SMP Sun Oct 14 23:05:12 GMT 2007 i686 GNU/Linux
lspci:
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)

Can I provide any additional info?

Revision history for this message
In , hey560 (hey560) wrote :

I can confirm this bug has resurfaced. It was fixed in the 1.9.9x branch as described in bug #9381.

I see this bug on Ubuntu Gutsy Gibbon (7.10) final using xserver-xorg-driver-intel version 2.1.1.

Revision history for this message
Peter Clifton (pcjc2) wrote : Re: screen artifacts after resume with i810 driver

With the changeset pinpointed, it should be a lot easier to fix. Thanks eeejay (I'd missed that comment before now).

Peter Clifton (pcjc2)
Changed in xserver-xorg-video-intel:
status: Incomplete → Confirmed
Revision history for this message
Eitan Isaacson (eeejay) wrote :

Unlike Juha, disabling DRI, and then doing suspend/resume actually helps.
All of your suggestions Peter, and the package you pointed at don't help.

Revision history for this message
Juha Tiensyrjä (juha-tiensyrja) wrote :

Actually, disabling DRI did help.

Sorry about that, I didn't realise I needed to put a 'Disable "dri"' in my xorg.conf, instead I just commented out everything that said DRI in it.

Revision history for this message
In , Gordon Jin (gordon-jin) wrote :

So it's said to be a regression caused by Eric's commit.

Eitan, could you attach X log?

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Bug #134464 is marked as a duplicate of this bug. They are related, but I don't think the corrupt row of pixels has anything to do with suspend/resume. I think that's a spurious coincidence.

I see the same screen corruption people have described (and shown in pictures), but _without_ suspend/resume.

The corrupt part row of pixels is visible after first starting X, as soon as I enable dual head mode with xrandr.

Suspend/resume makes no difference.

I tried Peter Clifton's package linked earlier here and that doesn't fix the corruption for me.

As I reported in #134464, Bryce Harrington's git head package from October 4th does fix the screen corruption for me. (It also fixes Xv). I didn't change X or ACPI config in any way. No need to disable DRI, touch ACPI, etc.

So I suggest other people try this package, and see if it fixes the problem for them:

http://people.ubuntu.com/~bryce/Testing/intel/xserver-xorg-video-intel_2.1.1~git20071004-0ubuntu1_i386.deb

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Although the row of corrupt pixels is fixed for me by the above package, 3d acceleration is now broken. This message is present in Xorg.0.log, and wasn't before: "(EE) AIGLX error: drmMap of framebuffer failed (Invalid argument)(EE) AIGLX: reverting to software rendering".

In fact, glxgears is much slower than I would expect even for software rendering, managing about 3fps with the default window size. So slow that it can't count: it reports 330 fps, but it's drawing about 3! I wasn't impressed by the X server crashing on the third run of glxgears either. (Could it be software renderer bugs are getting missed now 3d users tend not to use it?)

So, the above package isn't a proper fix. It may give clues to someone about where to find the fix for the corrupt row of pixels. Or, it may be that whatever causes AIGLX to fail also marks the cause of corruption now.

Revision history for this message
Eitan Isaacson (eeejay) wrote :

I suspect that the reason the corrupt row is disappearing is because DRI is disabled when the server reverts to software rendering. This is something we already knew: disabling DRI makes the screen artifact go away.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

You might find this bug report useful: https://bugs.freedesktop.org/show_bug.cgi?id=12667

I notice that the git changeset which is said to cause the regression (in that report), adds a check and message "Non-contiguous GTT entries".

Interestingly, the package from Bryce (which fixes the screen but breaks 3d) for me results in this message in my Xorg.0.log. Specifically, I get an error message:

(EE) intel(0): Non-contiguous GTT entries: (6295552,0x163ffbe000) vs (131072,0x 3f820000)
(WW) intel(0): xf86AllocateGARTMemory: allocation of 1536 pages failed
        (Cannot allocate memory)

This doesn't appear in my Xorg.0.log with the stock Gutsy driver.

Revision history for this message
Peter Maydell (pmaydell) wrote :

>disabling DRI makes the screen artifact go away.

I'm not so sure. I've seen this corruption on my multihead configuration, and that never has DRI enabled, because the total width is > 2048 pixels so the Intel driver is forced to disable DRI on the 945 chipset...

Revision history for this message
Peter Clifton (pcjc2) wrote :

Is there any commonality amongst the people who see this bug as to whether they are on 32bit or 64 bit systems?

Revision history for this message
Peter Maydell (pmaydell) wrote :

32 bit here.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote : Re: [Bug 91966] Re: screen artifacts after resume, part row of pixels in error (945GM)

Peter Maydell wrote:
> >disabling DRI makes the screen artifact go away.
>
> I'm not so sure. I've seen this corruption on my multihead
> configuration, and that never has DRI enabled, because the total width
> is > 2048 pixels so the Intel driver is forced to disable DRI on the 945
> chipset...

I'm intrigued by you getting > 2048 pixels width. Whenever I've
tried, xrandr has simply refused, so I have to have dual head with one
screen below the other. I also have 945GM.

Let's try it now:

jamie@amilo:~$ xrandr --output TMDS-1 --auto --output LVDS --auto --right-of TMDS-1
xrandr: screen cannot be larger than 1680x1880 (desired size 2960x1050)

Nope, doesn't work. How do you get it > 2048 pixels width?

I have the impression it's 3d specifically which is disabled by the
2048 pixel limit, but not DRI/DRM in general and all the different
style of memory management which goes with it?

-- Jamie

Revision history for this message
Peter Clifton (pcjc2) wrote :

Jamie.. 32bit or 64bit?

(EE) intel(0): Non-contiguous GTT entries: (6295552,0x163ffbe000) vs (131072,0x 3f820000)
                                                                                             ^ Greater than 32 bits... (ok, a 32bit processor can do this math, but it is an interesting result given the code)

The code operates using the uint64_t type, however on anything but G33 class and 965 HW, the page table bits which correspond to an address space > 32bits are masked out. Something "strange" is going on with the above message. By my reading of the code, this shouldn't happen if you've got a 945GM

This commit looks generally interesting...

commit f3168e3b0c5664a322ca6bb1c81fc94844cb30ab
Author: Eric Anholt <email address hidden>
Date: Wed May 2 14:08:30 2007 -0700

    Disable non-working GTT decoding on i830, and fix map/unmap of GTT.

Specifically the latter part, which adds these lines (I'm ignoring the 830 changes since this bug is hitting 945 hardware):

@@ -584,6 +587,15 @@ I830UnmapMMIO(ScrnInfoPtr pScrn)
    xf86UnMapVidMem(pScrn->scrnIndex, (pointer) pI830->MMIOBase,
                   I810_REG_SIZE);
    pI830->MMIOBase = NULL;
+
+ if (IS_I9XX(pI830)) {
+ if (IS_I965G(pI830))
+ xf86UnMapVidMem(pScrn->scrnIndex, pI830->GTTBase, 512 * 1024);
+ else {
+ xf86UnMapVidMem(pScrn->scrnIndex, pI830->GTTBase,
+ pI830->FbMapSize / 1024);
+ }
+ }
 }

 static Bool
@@ -2279,6 +2291,9 @@ I830ScreenInit(int scrnIndex, ScreenPtr pScreen, int argc, char **argv)
       pI830->used3D = pI8301->used3D;
    }

+ /* Need MMIO mapped to do GTT lookups during memory allocation. */
+ I830MapMMIO(pScrn);
+
 #if defined(XF86DRI)
    /*
     * If DRI is potentially usable, check if there is enough memory available
@@ -2419,6 +2434,8 @@ I830ScreenInit(int scrnIndex, ScreenPtr pScreen, int argc, char **argv)
       allocation_done = TRUE;
    }

+ I830UnmapMMIO(pScrn);
+
    i830_describe_allocations(pScrn, 1, "");

    if (!IS_I965G(pI830) && pScrn->displayWidth > 2048) {

Revision history for this message
Peter Clifton (pcjc2) wrote :

I didn't quite realise what the above commit tells us (forgot to check when that commit was in the source, and whether it was included in 2.1.1: Answer.. it was)

commit f3168e3b0c5664a322ca6bb1c81fc94844cb30ab
Author: Eric Anholt <email address hidden>
Date: Wed May 2 14:08:30 2007 -0700

    Disable non-working GTT decoding on i830, and fix map/unmap of GTT.

commit a4f1a7872f6f959bb4bc6568face710bee3589de
Author: Eric Anholt <email address hidden>
Date: Mon Apr 30 17:13:09 2007 -0700

    Allow physical-memory allocations within stolen memory.

So, eeejay, I'm not entirely surprised there was a regression at the a4f1... commit, the more interesting question.. was it fixed again later with f3168... (and broken again at some later point?)

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Peter Clifton wrote:
> Jamie.. 32bit or 64bit?

32-bit CPU, Core Duo.

> (EE) intel(0): Non-contiguous GTT entries: (6295552,0x163ffbe000) vs (131072,0x 3f820000)
> ^ Greater than 32 bits... (ok, a 32bit processor can do this math, but it is an interesting result given the code)
>
> The code operates using the uint64_t type, however on anything but G33
> class and 965 HW, the page table bits which correspond to an address
> space > 32bits are masked out. Something "strange" is going on with the
> above message. By my reading of the code, this shouldn't happen if
> you've got a 945GM

That's a pretty clear bug then. It's always nice to have a clear bug.

Revision history for this message
Peter Clifton (pcjc2) wrote :

Jamie, do you consistently get the above error about non-contiguous GTT entries? This code doesn't actually look to have changed since 2.1.1, althoguh lots of other stuff dealing with memory allocation has.

Some possibilities:
Subtle 64bit code bug with the 64bit data-types used, caused by incompatible build / target systems (I don't know enough about gcc or 64bit archs to say really).
Smashed stack / memory corruption in the code somewhere
Bad logic detecting which chips don't have >32bit page table entries.

When I get a moment, I'll write a program to dump out the page-table on the 945 series hardware, and we can take a look what these areas look like.

eeejay, are you willing to keep testing git driver versions past commit a4f1a7872f6f... ?

Revision history for this message
Peter Maydell (pmaydell) wrote :

>How do you get it > 2048 pixels width?

Your xorg.conf has to be set up with a Virtual screen size at least as big as the largest you want to resize it to. See my xorg.conf over on bug 134464. (PS: careful with running OpenGL programs in that configuration, last time I tried it it made the X server dump core [known bug, in launchpad somewhere])

Revision history for this message
In , Jesse Barnes (jbarnes-virtuousgeek) wrote :

A log from before a4f1a7872f6f959bb4bc6568face710bee3589de and after might be helpful too, since they may have different memory layouts.

Revision history for this message
Peter Clifton (pcjc2) wrote :

Can people experiencing this issue try running the register dumper program I've hacked together from the "intel_reg_dumper" utility in the intel driver source tree...

I'm attaching it here, as external access to where I usually host packages for testing is down at the moment.

Revision history for this message
Peter Clifton (pcjc2) wrote :

Compiled binary produced with the above. (Needs to be run as root to memory map the registers required).

Revision history for this message
Peter Clifton (pcjc2) wrote :

Sam, (Picking on you because your screen-res looks the same as mine, and as such the debugging is more similar), can you post your Xorg.0.log with the option

Option "ModeDebug" "True"

in the device section for the video card in the xorg.conf?

Thanks, Peter C.

Revision history for this message
Peter Clifton (pcjc2) wrote :

Sorry, actually, it's eeejay's screenshot which matches my resolution,

Revision history for this message
Eitan Isaacson (eeejay) wrote :

Here is my X log after a suspend/resume, with the debug option enabled.

Revision history for this message
Eitan Isaacson (eeejay) wrote :

And Peter. here is the registry dump I made with your patch.

Cheers!

Revision history for this message
Peter Clifton (pcjc2) wrote :

Thanks... as hoped, it was very interesting...

At least one of the corruptions starts at the very end of stolen memory, pagetable entry 1982.

pagetable entry 1981, value 0x87ffbd001
pagetable entry 1982, value 0x87ffbe001 <-------- Last physical page in the stolen address range? (Also mapped to 35654 other pages!)
pagetable entry 1983, value 0x82fe5a001

Interstingly, the page of RAM mapped by that entry is repeated in MANY other mappings, as a scratch page to avoid crashing if a write were made to some invalid page in the page table.

That page also crops up for some reason in the "middle" of stolen ram:

pagetable entry 1536, value 0x87fe00001
pagetable entry 1537, value 0x87ffbe001 <--------- Why is this here??
pagetable entry 1538, value 0x87fe02001

(NB, I don't think its a coincidence that the other page address where I calculated corruption was seen was 1536.??).

This is NOT the case in my page table, where the above entries are contiguous, and the last stolen page of ram is NOT used as the scratch page.

Next debugging steps... figure out if this is a BIOS bug, or an AGP driver bug messing up the page table.

Revision history for this message
In , hey560 (hey560) wrote :

There is a very active discussion of this bug on Ubuntu Launchpad (ubuntubug 91966):

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/91966

It contains Xorg.0.log files with debug on showing the symptoms of this bug.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote : Re: [Bug 91966] Re: screen artifacts after resume, part row of pixels in error (945GM)

Peter Clifton wrote:
> Jamie, do you consistently get the above error about non-contiguous GTT
> entries? This code doesn't actually look to have changed since 2.1.1,
> althoguh lots of other stuff dealing with memory allocation has.

No. Since the previous report, I have rebooted and used the same
driver a few times, and none of those times reported non-contiguous GTT.

Yet, it was quite consistent when I first installed it and restarted X
a few times.

What seems most likely, then, is that running that driver a _second_
time (after it's run once, or after another driver) produces that
non-contiguous GTT and doesn't fix it.

I've now switched back to Gutsy's own driver, because I did have
problems with the driver that removes the screen artifact. (Suspend
was completely unreliable, tending to crash with the too-familiar
"text mode garbage blocks of colour and stuck system", and also the
DVI output was unstable: Occasionally there was a glitch in the
picture, and after 30-60 minutes, the picture would disappear
completely. Gutsy's driver seems 95% stable with suspend, and no DVI
glitch).

Following your ideas in another comment, I looked closer at the screen
artifact, which is the subject of my next comment.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

> Compiled binary produced with the above. (Needs to be run as root to
> memory map the registers required).
>
> ** Attachment added: "intel_reg_dumper"
> http://launchpadlibrarian.net/10214393/intel_reg_dumper

Unfortunately that does not run on Gutsy:

jamie@amilo:~$ sudo ./intel_reg_dumper
./intel_reg_dumper: error while loading shared libraries: libpciaccess.so.0: cannot open shared object file: No such file or directory

Can you provide a statically linked version, or say which binary
package needs to be installed for that lib, or simply provide the
binary library file? Thanks.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Following an idea from some other comment, I decided to look at the
screen corruption more carefully.

I thought it was drawing the wrong thing sometimes, but not always,
because I could drag a window over the affected row and it would be
corrected.

But on looking closer, I see there are _two_ corrupt partial rows, and
at least one of them is _always_ corrupt.

The form of the corruption is that whatever is drawn in one row,
immediately appears in the other, and vice versa. In other words,
both rows always have exactly the same pixels as each other.

So when I drag a window over the first row, it appears to clean that
row up, but in fact the _other_ row gets an exact copy of the pixels
drawn to the first row while the window is being dragged over it.

And vice verse: when I drag a window over the second row, the first
row gets an exact copy of the pixels drawn to the second row. So both
rows always contain identical pixels, but which values depends on
which area of the screen was drawn to most recently.

This does look like it might be a memory mapping problem, perhaps
something like a page shared incorrectly.

To help debug, I looked at the coordinates.

I have a dual head setup, with LVDS below TMDS-1. LVDS is 1280x800,
TMDS-1 is 1680x1050. xrandr says:

Screen 0: minimum 320 x 200, current 1680 x 1850, maximum 1680 x 1880
LVDS connected 1280x800+400+1050
TMDS-1 connected 1680x1050+0+0

The two corrupted part rows are on TMDS-1, at these coordinates:

    (1024,736) to (1679,736)
    (0,959) to (655,959)

Perhaps the coordinates will relate to addresses in framebuffer memory
in some useful way for debugging.

I was a little surprised to find they are 656 pixels wide - that seems
a peculiar number.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

> jamie@amilo:~$ sudo ./intel_reg_dumper
.> /intel_reg_dumper: error while loading shared libraries: libpciaccess.so.0: cannot open shared object file: No such file or directory

Fixed by (duh) installing the libpciaccess0 package.

The register and page table dump is attached, and it's quite revealing.
My result is basically the same as Peter's analysis of eeejey's register dump.

The coordinates of my display corruption, (1024..1679, 736) and (0..655, 959) correspond _exactly_ with page table entries 1537 and 1982. This is calculated from DSPABASE and DSPASTRIDE.

The cause of the corruption is without doubt, page table entry 1537, shown in context:

pagetable entry 1534, value 0x83fdfe001
pagetable entry 1535, value 0x83fdff001
pagetable entry 1536, value 0x83fe00001
pagetable entry 1537, value 0x83ffbe001 <- anomaly
pagetable entry 1538, value 0x83fe02001
pagetable entry 1539, value 0x83fe03001
pagetable entry 1540, value 0x83fe04001

Page table entry 1982 maps to the same physical page, and this is why both areas of the display always show the same pixels:

pagetable entry 1979, value 0x83ffbb001
pagetable entry 1980, value 0x83ffbc001
pagetable entry 1981, value 0x83ffbd001
pagetable entry 1982, value 0x83ffbe001

The page table after 1982 is all over the place, despite being in the middle of the framebuffer. This doesn't cause screen artifacts, but it might be indicative of other problems which are hidden, if it's not supposed to be like this:

pagetable entry 1983, value 0x836d55001
pagetable entry 1984, value 0x836d20001
pagetable entry 1985, value 0x836d4d001
pagetable entry 1986, value 0x836d4e001
pagetable entry 1987, value 0x836d66001
pagetable entry 1988, value 0x836dcf001

Also as with Peter's comment, the suspect entry's value is 0x83ffbe001, and it is used by many later page table entries too. (But it's not the only value which is used by many page table entries; 0x8371cc001 is also.)

So there we go. Corrupt page table (or erroneous allocation) is the reason for the display corruption, but what is the reason for the corrupt page tables?

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Peter Clifton (pcjc2) wrote :

[snip] what is the reason for the corrupt page tables?

Urm... pass on that one ;)

Jamie? Does the page table look the same before suspend? (Or when the bug is not showing?)

Since the page-table can be acccessed via mapped memory in the video-card's memory/IO/register space, it seems possible that an errant 3D instruction / other GPU command could actually trample it. Or.. (perhaps more likely), it could be the allocator routines in the driver / DRI manager / "AGP" driver etc..

As far as I could tell, the entries are removed and re-added after suspend / resume, so it doesn't seem like it would be the BIOS's fault for messing them up.. however if the BIOS is bad, and executes SMI code on some system event, it does have the ability to trample pretty much anything.

I have 945GM and don't see this bug, haven't been able to reproduce it (even trying to force the exact same resolution, reproduction steps etc..). Which does somewhat suggest that the bug is at least triggered by something not common to all machines showing this bug.

This Intel driver guys know about this bug, and hopefully they will have better insight as to the cause. I've passed along the useful information from the page-table dumping (and a link to this bug).

Revision history for this message
Phi (ladyphi) wrote :

Aha! I finally found the right bug report. I have a Fujitsu Lifebook s7110 with an Intel 945GM/GMS graphics chip that has the exact same symptoms as described here. I get pixel swapping after resume (and also when I first boot up, but only if GRUB times out...odd) to "text mode garbage blocks of colour and stuck system" when I switched drivers in Feisty to fix it. Been searching for a solution for ages.

Since this bug is off to Intel now, I'm not sure how to help at the moment. I'll keep watching this thread.

Revision history for this message
Phi (ladyphi) wrote :

I'll post the output of intel_reg_dumper without the swapped pixels (before suspend), as well as after a suspend-resume with swapped pixels. Here is the before.

Revision history for this message
Phi (ladyphi) wrote :

And here is the after.

If it helps, my screen resolution is 1440x1050 and I'm running up to date Xubuntu 7.10.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Download full text (7.4 KiB)

Found it!
I'll put y'all out of your misery. The workaround is:

    Edit /etc/defaults/acpi-support, to comment out "POST_VIDEO=true".

This is hardly a general solution for Gutsy (then again, acpi-support seems to be a seething mass of hacks upon hacks already, it would seem to fit in somewhere...).

Basically, it's not suspend/resume which causes the page table corruption, it's the "vbetool post" command. You can cause identical effects without suspending, using this command from inside a terminal window:

    sudo sh -c 'chvt 1; sleep 2; vbetool post; vbetool vbestate restore </var/lib/acpi-support/vbestate; sleep 2; chvt 7'

If you run that command after booting and starting X, from a terminal window, then when it finally returns to showing X, the screen artifact will be present, and it may be causing more serious random memory corruption behind the scenes (which hasn't been noticed so far).

If you don't run that command during suspend/resume (by editing /etc/default/acpi-support), then there's no screen artifact or page table corruption.

The other vbetool commands: vbetool vbemode get; vbetool vbestate restore don't seem to affect the page table *in my tests*. The vbestate restore could in principle clobber the page table; this is no guarantee that it doesn't.

vbetool vbestate restore is still required, unfortunately, by ACPI resume, to restore the text console. It does *not* seem to be required for X to resume, only the text console fails to display if you disable this command.

Perhaps the cleanest fix is either (1) to have "vbetool post" save and restore the page table around POST, or (2) for the kernel to implement memory protection when executing VBE calls and detect such changes and update its own structures. Is the page table a generic AGP page table, independent of the video driver, which the kernel tracks?

Now that I have a fix which works on my laptop, I'm happy. (Just command out POST_VIDEO in /etc/default/acpi-support). Still, it will be nice when the bug is fixed in the general distribution, without a laptop-specific workaround.

Peter Clifton wrote:
> [snip] what is the reason for the corrupt page tables?
> Urm... pass on that one ;)
> Jamie? Does the page table look the same before suspend? (Or when the bug is not showing?)

At first I though I saw the screen corruption after boot, without suspending, but when I tried it today I was surprised to see no corruption after a fresh boot (and power cycle), and then it appeared after suspend/resume... I guess I was wrong about it appearing straight away, then.

(The reason I didn't respond earlier was because I didn't reboot my laptop for a couple of weeks, as I had work on it I didn't want to lose track of. It's nice to see that suspend/resume are reliable enough to use them several times a day for a couple of weeks now! Earlier versions of this driver often crashed during suspend, or during switch to console. It seems reliable now.)

Then I suspected one of the steps in the ACPI process, and looked in the scripts in /etc/acpi/{suspend,resume}.d for clues. Then I systematically did some tests with your intel_reg_dumper, and here's what I found....

Attached is a...

Read more...

Revision history for this message
lcampagn (luke-campagnola) wrote :

That fixed it for me! I'm sure I fussed with that option before, but I'm
sure glad you did too :)

On Nov 18, 2007 11:47 PM, Jamie Lokier <email address hidden> wrote:

> Found it!
> I'll put y'all out of your misery. The workaround is:
>
> Edit /etc/defaults/acpi-support, to comment out "POST_VIDEO=true".
>

Revision history for this message
Eitan Isaacson (eeejay) wrote :

Wow Jamie, thanks for all that.
I don't understand 3/4 of it, but at least i got rid of that artifact :)

Revision history for this message
Sam Peterson (peabodyenator) wrote :

This is great guys!!! Works like a charm for me. Glad this is fixed.

Revision history for this message
Phi (ladyphi) wrote :

This works for me as far as suspend/resume goes (yay!), but oddly I still get the artefact if I let GRUB time out when booting. It makes me wonder what's different between booting normally, and having a timer run out, then booting. Perhaps it's a separate issue from this bug.

Revision history for this message
In , Gordon Jin (gordon-jin) wrote :

Jesse made some suspend/resume fixes in 2.2 release. Could anyone of you validate with 2.2 release or git tip?

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Phi wrote:
> This works for me as far as suspend/resume goes (yay!), but oddly I
> still get the artefact if I let GRUB time out when booting. It makes me
> wonder what's different between booting normally, and having a timer run
> out, then booting. Perhaps it's a separate issue from this bug.

Amazingly, I've just noticed the same thing.

All my tests when I wrote that long post were from cold boot, after
power cycling, but I always booted from GRUB without waiting for the
timout.

It's hard to think of any difference that may cause.

Today, I happened to do a power cycle again (battery flat), and this
time I let GRUB boot timeout out and... now I have the corrupt
row of pixels on my screen, from a fresh boot without suspending.

I've attached the register dump. This is after booting from GRUB with timeout, (then some probably irrelevant distractions involving single user mode), then to X. X started showing the same picture on internal and external displays. Then I did xrandr to start dual head, and saw the corrupt half-row of pixels. Previously xrandr hasn't changed the page table entries, so I'd guess it hasn't this time.

The register dump shows the usual anomaly at page table 1537, set to 0x83ffbe001, the same as entry 1982 (where it's the end of a series and looks normal).

Interestingly, unlike the state after suspend/resume/POST, entries 1537 and 1982 are the _only_ places where this value appears. Whereas suspend/resume/POST causes that value to appear in lots of entries.

Before I was beginning to suspect the VESA BIOS, but now I'm beginning to suspect the kernel again...

Phi, would you care to do some more tests to see if GRUB timeout _repeatably_ makes the difference?

Revision history for this message
hey560 (hey560) wrote :

The fix from Jaime worked for me too!

A developer over at freedesktop.org suggested trying version 2.2 of the xserver-xorg-video-intel driver. Supposedly there are some suspend/resume fixes.

I would try myself but I can't seem to backport it properly to gutsy.

Heres the link to the bug report: http://bugs.freedesktop.org/show_bug.cgi?id=12667#c8

Revision history for this message
Cay Horstmann (cay) wrote :

FWIW, the suggested fix (Edit /etc/default/acpi-support, comment out "POST_VIDEO=true") works for me as well. Thinkpad X60 with external 1680x1050 monitor, running Gutsy. Thanks so much!

Revision history for this message
In , Gordon Jin (gordon-jin) wrote :

Without response, I'm assuming it's fixed by 2.2 release. Please reopen if you still see it in the new driver.

Bryce Harrington (bryce)
Changed in xserver-xorg-video-intel:
importance: Undecided → High
Revision history for this message
Bryce Harrington (bryce) wrote :

In the upstream bug report, they say it's fixed in the -intel 2.2 release. Can someone test Hardy alpha2 or newer and verify the issue is fixed? If not, please give feedback on the upstream bug report.

Changed in xserver-xorg-video-intel:
status: Confirmed → Fix Committed
Changed in xserver-xorg-video-intel:
status: Unknown → Fix Released
Revision history for this message
Jamie Lokier (jamie-shareable) wrote : Re: [Bug 91966] Re: screen artifacts after resume, part row of pixels in error (945GM)

Bryce Harrington wrote:
> In the upstream bug report, they say it's fixed in the -intel 2.2
> release. Can someone test Hardy alpha2 or newer and verify the issue is
> fixed? If not, please give feedback on the upstream bug report.

I would test it.

But installing Hardy's X server seems to require a lot of packages to
be updated from Gutsy, and I'm not prepared to risk breaking my laptop
which I use for work for that.

If someone backports the -intel 2.2 server to a Gutsy-compatible
package, I'll be happy to re-test everything I tested before.

Btw, it's hard to see how the 2.2 server could properly fix the page
table corruption when it's not running.... although it's easy to see
how it could hide the symptom or repair the problem when the server is
running. The suspiciously corrupt page table occurs in console text
mode, too, although it has no visible effect.

-- Jamie

Revision history for this message
Timo Aaltonen (tjaalton) wrote :

Trusting upstream that this is fixed, so closing the bug. If you can reproduce it with Hardy, please reopen.

Changed in xserver-xorg-video-intel:
status: Fix Committed → Fix Released
Revision history for this message
Robert (robrwo) wrote :

I'm having a similar problem on my Lenovo ThinkPad Z60m.

Bottom 30 pixels or so are gabled after resuming. If I log off or restart X, the problem goes away.

Also, I can't move the mouse cursor onto that garbled area, and the area does not show up in screenshots.

Revision history for this message
Robert (robrwo) wrote :

Sometimes the bottom is black rather than garbled, but still inaccessible.

I note that when I switch users or log off, the bottom row is used by gdm. But when I log back in, sometimes the problem continues: the bottom row of pixels is inaccessible, and is either black or shows garbled screen artifacts.

I am using Xfce 4.4.1 on Xubuntu Gutsy.

Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Have just installed Hardy Beta. I can confirm, the bug appears to be fixed in Hardy.

At least, the "vbetool post" command doesn't produce the artifact, and uncommenting "POST_VIDEO=true" (i.e. undoing the workaround) then suspending doesn't produce the artifact either.

Checking the page tables with "intel_reg_dumper", there is no anomaly at page 1537 (nor elsewhere) with the "unused page" address appearing anywhere other than in long contiguous blocks, as it should.

Excellent!

Revision history for this message
Jayen (j--n) wrote :

Hey, I'm a debian user, and with xorg's xf86 video intel driver (versions 2.2.1 and 2.3.1) and using grub2 and using grub2 to have a splash image, I don't get any real text in my X sessions. Only text from xterm and xemacs appear. Not windowmaker, gnome, kde, pidgin, firefox, thunderbird, etc. Does anyone have any ideas on what I can try, or where I can go for help?

Changed in xserver-xorg-video-intel:
importance: Unknown → Medium
Changed in xserver-xorg-video-intel:
importance: Medium → Unknown
Changed in xserver-xorg-video-intel:
importance: Unknown → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.