Corrupt display after a while (after resume?) on intel graphics

Bug #1053959 reported by gcc
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
xf86-video-intel
Fix Released
High
xserver-xorg-video-intel (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

I'm using a Thinkpad X201 with Intel graphics:

chris@lap-x201:~$ lspci -vvv -s 00:02.0
00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02) (prog-if 00 [VGA controller])
 Subsystem: Lenovo Device 215a
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 42
 Region 0: Memory at f2000000 (64-bit, non-prefetchable) [size=4M]
 Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
 Region 4: I/O ports at 1800 [size=8]
 Expansion ROM at <unassigned> [disabled]
 Capabilities: <access denied>
 Kernel driver in use: i915
 Kernel modules: i915

chris@lap-x201:~$ lspci -vvvn -s 00:02.0
00:02.0 0300: 8086:0046 (rev 02) (prog-if 00 [VGA controller])
 Subsystem: 17aa:215a
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0
 Interrupt: pin A routed to IRQ 42
 Region 0: Memory at f2000000 (64-bit, non-prefetchable) [size=4M]
 Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
 Region 4: I/O ports at 1800 [size=8]
 Expansion ROM at <unassigned> [disabled]
 Capabilities: <access denied>
 Kernel driver in use: i915
 Kernel modules: i915

I usually notice this problem after several days of use, which would normally include several suspend and resume cycles and (dis)connections of an external monitor.

chris@lap-x201:~$ lsb_release -rd
Description: Ubuntu 12.04.1 LTS
Release: 12.04

chris@lap-x201:~$ apt-cache policy xserver-xorg-video-intel
xserver-xorg-video-intel:
  Installed: 2:2.17.0-1ubuntu4.1
  Candidate: 2:2.17.0-1ubuntu4.1
  Version table:
 *** 2:2.17.0-1ubuntu4.1 0
        500 http://gb.archive.ubuntu.com/ubuntu/ precise-updates/main i386 Packages
        100 /var/lib/dpkg/status
     2:2.17.0-1ubuntu4 0
        500 http://gb.archive.ubuntu.com/ubuntu/ precise/main i386 Packages

[ 75101.039] (II) intel(0): [DRI2] Setup complete
[ 75101.039] (II) intel(0): [DRI2] DRI driver: i965
[ 75101.039] (II) intel(0): Allocated new frame buffer 1280x800 stride 5120, tiled
[ 75101.050] (II) UXA(0): Driver registered support for the following operations:
[ 75101.050] (II) solid
[ 75101.050] (II) copy
[ 75101.050] (II) composite (RENDER acceleration)
[ 75101.050] (II) put_image
[ 75101.050] (II) get_image
[ 75101.050] (==) intel(0): Backing store disabled
[ 75101.050] (==) intel(0): Silken mouse enabled
[ 75101.050] (II) intel(0): Initializing HW Cursor
[ 75101.333] (II) intel(0): RandR 1.2 enabled, ignore the following RandR disabled message.
[ 75101.336] (==) intel(0): DPMS enabled
[ 75101.336] (==) intel(0): Intel XvMC decoder enabled
[ 75101.336] (II) intel(0): Set up textured video
[ 75101.341] (II) intel(0): [XvMC] xvmc_vld driver initialized.
[ 75101.341] (II) intel(0): direct rendering: DRI2 Enabled
[ 75101.341] (==) intel(0): hotplug detection: "enabled"

[ 75101.363] (II) AIGLX: enabled GLX_MESA_copy_sub_buffer
[ 75101.363] (II) AIGLX: enabled GLX_INTEL_swap_event
[ 75101.363] (II) AIGLX: enabled GLX_SGI_swap_control and GLX_MESA_swap_control
[ 75101.363] (II) AIGLX: GLX_EXT_texture_from_pixmap backed by buffer objects
[ 75101.363] (II) AIGLX: Loaded and initialized i965
[ 75101.363] (II) GLX: Initialized DRI2 GL provider for screen 0
[ 75101.364] (II) intel(0): Setting screen physical size to 338 x 211

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: xserver-xorg-video-intel 2:2.17.0-1ubuntu4.1
Uname: Linux 3.6.0-030600rc6-generic i686
.tmp.unity.support.test.0:

ApportVersion: 2.0.1-0ubuntu12
Architecture: i386
CompizPlugins: [core,composite,opengl,compiztoolbox,decor,grid,clone,regex,resize,animation,gnomecompat,snap,workarounds,vpswitch,fadedesktop,imgpng,mousepoll,move,place,ezoom,session,staticswitcher,fade,cube,scale,gears]
CompositorRunning: compiz
Date: Fri Sep 21 11:44:24 2012
DistUpgraded: 2012-08-25 13:19:45,909 DEBUG enabling apt cron job
DistroCodename: precise
DistroVariant: ubuntu
EcryptfsInUse: Yes
ExtraDebuggingInterest: Yes, even including gdb or git bisection work if needed
GraphicsCard:
 Intel Corporation Core Processor Integrated Graphics Controller [8086:0046] (rev 02) (prog-if 00 [VGA controller])
   Subsystem: Lenovo Device [17aa:215a]
MachineType: LENOVO 3323DAG
ProcEnviron:
 LANGUAGE=en_GB:en
 TERM=xterm
 PATH=(custom, user)
 LANG=en_GB.UTF-8
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-3.6.0-030600rc6-generic root=UUID=3d626876-a6a6-4378-b989-3acdf40c39c7 ro quiet splash resume=/dev/sda5 vt.handoff=7
SourcePackage: xserver-xorg-video-intel
UdevDb: Error: [Errno 2] No such file or directory
UpgradeStatus: Upgraded to precise on 2012-08-25 (26 days ago)
dmi.bios.date: 12/17/2010
dmi.bios.vendor: LENOVO
dmi.bios.version: 6QET62WW (1.32 )
dmi.board.name: 3323DAG
dmi.board.vendor: LENOVO
dmi.board.version: Not Available
dmi.chassis.asset.tag: No Asset Information
dmi.chassis.type: 10
dmi.chassis.vendor: LENOVO
dmi.chassis.version: Not Available
dmi.modalias: dmi:bvnLENOVO:bvr6QET62WW(1.32):bd12/17/2010:svnLENOVO:pn3323DAG:pvrThinkPadX201:rvnLENOVO:rn3323DAG:rvrNotAvailable:cvnLENOVO:ct10:cvrNotAvailable:
dmi.product.name: 3323DAG
dmi.product.version: ThinkPad X201
dmi.sys.vendor: LENOVO
version.compiz: compiz 1:0.9.7.8-0ubuntu1.4
version.libdrm2: libdrm2 2.4.32-1ubuntu1
version.libgl1-mesa-dri: libgl1-mesa-dri 8.0.3+8.0.2-0ubuntu3.2
version.libgl1-mesa-dri-experimental: libgl1-mesa-dri-experimental N/A
version.libgl1-mesa-glx: libgl1-mesa-glx 8.0.3+8.0.2-0ubuntu3.2
version.xserver-xorg-core: xserver-xorg-core 2:1.11.4-0ubuntu10.7
version.xserver-xorg-input-evdev: xserver-xorg-input-evdev 1:2.7.0-0ubuntu1.2
version.xserver-xorg-video-ati: xserver-xorg-video-ati 1:6.14.99~git20111219.aacbd629-0ubuntu2
version.xserver-xorg-video-intel: xserver-xorg-video-intel 2:2.17.0-1ubuntu4.1
version.xserver-xorg-video-nouveau: xserver-xorg-video-nouveau 1:0.0.16+git20111201+b5534a1-1build2

Revision history for this message
In , Mikopp (mikopp) wrote :

Created attachment 49568
Xorg Log

I have Windows 7 running in a virtualbox (no acceleration defined). Sometimes it starts to flicker, it blacks out and repaints only areas on mouse movement. I see the following error in xorg log over and over again

[107833.198] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

eventually my xorg freezes and crashes. Restart of xorg does not work I have to reboot to get working again.

I have a dual screen setup, one running on HDMI and the other on DP. The problematic virtualbox runs on the DP.

lspci:
00:00.0 Host bridge: Intel Corporation Core Processor DRAM Controller (rev 02)
00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82577LM Gigabit Network Connection (rev 05)
00:1a.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
00:1b.0 Audio device: Intel Corporation Device 3b57 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 1 (rev 05)
00:1c.1 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 2 (rev 05)
00:1c.2 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 3 (rev 05)
00:1c.3 PCI bridge: Intel Corporation 5 Series/3400 Series Chipset PCI Express Root Port 4 (rev 05)
00:1d.0 USB Controller: Intel Corporation 5 Series/3400 Series Chipset USB2 Enhanced Host Controller (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation 5 Series/3400 Series Chipset LPC Interface Controller (rev 05)
00:1f.2 RAID bus controller: Intel Corporation Mobile 82801 SATA RAID Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 5 Series/3400 Series Chipset SMBus Controller (rev 05)
00:1f.6 Signal processing controller: Intel Corporation 5 Series/3400 Series Chipset Thermal Subsystem (rev 05)
02:00.0 Network controller: Intel Corporation WiFi Link 6000 Series (rev 35)
03:00.0 SD Host controller: Ricoh Co Ltd Device e822 (rev 01)
3f:00.0 Host bridge: Intel Corporation Core Processor QuickPath Architecture Generic Non-core Registers (rev 02)
3f:00.1 Host bridge: Intel Corporation Core Processor QuickPath Architecture System Address Decoder (rev 02)
3f:02.0 Host bridge: Intel Corporation Core Processor QPI Link 0 (rev 02)
3f:02.1 Host bridge: Intel Corporation Core Processor QPI Physical 0 (rev 02)
3f:02.2 Host bridge: Intel Corporation Core Processor Reserved (rev 02)
3f:02.3 Host bridge: Intel Corporation Core Processor Reserved (rev 02)

Revision history for this message
In , Mikopp (mikopp) wrote :

Created attachment 49571
dmesg

lot's of errors in dmesg too

Revision history for this message
In , Mikopp (mikopp) wrote :

not sure if it is interesting but here is xrandr:

Screen 0: minimum 320 x 200, current 3200 x 1200, maximum 8192 x 8192
eDP1 connected (normal left inverted right x axis y axis)
   1366x768 60.2 +
   1024x768 60.0
   800x600 60.3 56.2
   640x480 59.9
VGA1 disconnected (normal left inverted right x axis y axis)
HDMI1 connected 1280x1024+1920+176 (normal left inverted right x axis y axis) 376mm x 301mm
   1280x1024 60.0*+ 75.0
   1152x864 75.0
   1024x768 75.1 60.0
   800x600 75.0 60.3
   640x480 75.0 60.0
   720x400 70.1
DP1 disconnected (normal left inverted right x axis y axis)
HDMI2 disconnected (normal left inverted right x axis y axis)
DP2 connected 1920x1200+0+0 (normal left inverted right x axis y axis) 519mm x 320mm
   1920x1200 60.0*+
   1600x1200 60.0
   1280x1024 75.0 60.0
   1152x864 75.0
   1024x768 75.1 60.0
   800x600 75.0 60.3
   640x480 75.0 60.0
   720x400 70.1

Revision history for this message
In , Chris Wilson (ickle) wrote :

Looks like something is leaking vma (might be through a bo leak?) and we exhaust our mmap space. Can you grab the contents of /sys/kernel/debug/dri/0/* after you hit this issue?

Revision history for this message
In , Mikopp (mikopp) wrote :

I will, but I cannot predict when it happens next, might be a couple of days.

Revision history for this message
In , Mikopp (mikopp) wrote :

oh that was too fast, I don't have debug in my kernel

Revision history for this message
In , Leho Kraav (lkraav) wrote :

apparently running into this with my i3 laptop.

every once in a while Xorg.0.log gets flooded with:

 * [943834.434] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

dmesg gets:

 * [943834.434] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

primary problem that i can connect this to is that web videos do not want to switch to full screen mode. quite likely the flood occurs when i try to press the full screen video button on youtube or whatnot. there's a slight flicker, then return to window. other than that, i'm *not* seeing crashing or hanging. so far.

here's what i have of the #intel-gfx conversation:

--- Log opened K juuli 27 14:25:20 2011
14:25 danvet>macmaN, is this on a 32bit install?

yes, pae enabled, 4gb ram total, no swap.

$ free
             total used free shared buffers cached
Mem: 3810164 3002584 807580 0 5976 1583180
-/+ buffers/cache: 1413428 2396736
Swap: 0 0 0

14:27 danvet>macmaN, can you pastebin i195_gem_objects from debugfs?

/sys/kernel/debug/dri/0 $ cat i915_gem_objects
13273 objects, 247914496 bytes
1022 [855] objects, 87339008 [34508800] bytes in gtt
  6 [6] active objects, 5406720 [5406720] bytes
  6 [6] pinned objects, 4501504 [4501504] bytes
  1010 [843] inactive objects, 77430784 [24600576] bytes
  0 [0] freed objects, 0 [0] bytes
7 pinned mappable objects, 9744384 bytes
75 fault mappable objects, 380928 bytes
2147479552 [268435456] gtt total

14:31 danvet>macmaN, are you sometimes running more demanding stuff like games, hd video decoding?

no games, but hd video yes, off youtube, running xbmc, vlc every once in a while.

$ uptime
 12:52:18 up 18 days, 20:57, 3 users, load average: 0.74, 0.36, 0.32

logoffs from X are very rare, other than kernel upgrades/debugging, usually the machine goes into overnight suspends. this is the first time i've seen these errors, not sure if uptime or the quantity of "demanding stuff" has reached this far before.

16:59 danvet>macmaN exhausted the drm_mmap_offset address range of 4gb ...

... comment for ickle

Revision history for this message
In , Leho Kraav (lkraav) wrote :

Created attachment 49660
/sys/kernel/debug/dri/0/vma per ickle's request

on further testing, full screen video is actually not a problem, i just played some vimeo stuff without issues.

but loading this page http://www.youtube.com/user/freedrumlessons is enough to start erroring. i have videos blocked out with noscript, so we don't even have to have video to get errors.

Revision history for this message
In , Leho Kraav (lkraav) wrote :

could the vma issue this have been bettered since 2.6.39.3ish? i noticed that nowhere in my bug comments did i specify what kernel im running and noone has asked. does it matter or is it for sure unfixed in 3.x?

$ uname -a
Linux travelmate 2.6.39-pf2 #3 SMP PREEMPT Sat Jul 9 15:16:49 EEST 2011 i686 Intel(R) Core(TM) i3 CPU U 330 @ 1.20GHz GenuineIntel GNU/Linux

Revision history for this message
In , Chris Wilson (ickle) wrote :

Looking at the vma report, it doesn't seem to be the issue per se. I'm trying to think of what could cause it otherwise. If you can pinpoint why later kernels appears to work better, that would help!

Revision history for this message
In , Leho Kraav (lkraav) wrote :

i have no idea about later kernels actually. after some very annoying btrfs BUG's i ran into in 2.6.38, i'm absolutely in love with the stability of this 2.6.39 setup. really don't have resources to take risks right away, but will keep the need in mind.

Revision history for this message
In , Mikopp (mikopp) wrote :

Created attachment 49837
/proc/dri/0

not sure if it helps but these are the proc dri files

Revision history for this message
In , Mikopp (mikopp) wrote :

Created attachment 49951
slabinfo after another incident of this

Revision history for this message
In , Mikopp (mikopp) wrote :

I get this now without VirtualBox. the screen does not blank, just parts of it do not draw until after I cross that area with my mouse. same error in the xorg log file

Revision history for this message
In , Leho Kraav (lkraav) wrote :

i have not seen this issue with 2.16.0 and post-2.16 git HEAD at all.

Revision history for this message
In , Mikopp (mikopp) wrote :

I updated to 2.16. I'll report back once it occurs again. I don't have a definitive test, but I know that it occurs if I have my VM running for more than 3 days so lets see.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Honestly I can't think of a single change that should have impacted upon this bug. So please don't get your hopes up too much that is fixed and stays fixed. :|

Revision history for this message
In , Mikopp (mikopp) wrote :

I have this now much more rapidly, without a VM to impact things.

I'm on xf86-video-intel-2.17 and kernel 3.1.10 now.
It now only takes a day of working with eclipse (seems to be tied to GTK, other SWT applications deliver the same result).

Even worse now, once this starts to happen, the X CPU spikes regularly to 100% for half minutes at a time. I get the same error in the xorg log. at some point it gets so bad you can't work. restart X does not work then anymore only a reboot helps.

reoccurring error in dmesg
[drm:i915_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0
reoccurring in xorg log
[141361.521] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

Revision history for this message
In , Mikopp (mikopp) wrote :

I have this now much more rapidly, without a VM to impact things.

I'm on xf86-video-intel-2.17 and kernel 3.1.10 now.
It now only takes a day of working with eclipse (seems to be tied to GTK, other SWT applications deliver the same result).

Even worse now, once this starts to happen, the X CPU spikes regularly to 100% for half minutes at a time. I get the same error in the xorg log. at some point it gets so bad you can't work. restart X does not work then anymore only a reboot helps.

reoccurring error in dmesg
[drm:i915_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0
reoccurring in xorg log
[141361.521] (WW) intel(0): intel_uxa_prepare_access: bo map failed: No space left on device

This is becoming a real problem now and hits my working environment. Not sure why it became worse, but I think it is only after my recent update to the 3.x kernel series.

Revision history for this message
In , Chris Wilson (ickle) wrote :

The light at the end of the tunnel is a long way away on this one, I'm afraid. I've a series of kernel patches that should prevent the ENOSPC, but they are not ready for review, and depend on another series that is also not ready.

In the meantime, you could try enabling sna as that handles the bo cache completely differently and I hope doesn't quite get into the same trouble. The mmap address exhaustion is a real issue, though another possible workaround is to use a 64-bit kernel.

Revision history for this message
In , Mikopp (mikopp) wrote :

Could you please clearify on the SNA? I don't have sandybridge, but an Intel(R) Arrandale.

On the other hand if you have patches that I can try out and you think are reasonable well working, I am happy to do that.

Revision history for this message
In , Chris Wilson (ickle) wrote :

SNA works with all of our supported chipsets, and even on Ironlake is significantly faster than UXA. I am curious as to how it fare in this situation. The underlying problem still exists, just the usage of buffer objects might be sufficiently different to hide it.

The tree for testing the ENOSPC fixes is available from http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=reap-mmap-offsets. I don't pretend that is in a clean state at all. ;-) The patch of interest is http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=reap-mmap-offsets&id=c4a07eef055773efba7855bcaf5f26277695a5ae

Revision history for this message
In , Mikopp (mikopp) wrote :

ok I'm using sna now, let's see. I'll try your patch later on the weekend.

I have also another issue and wonder if it is related, since kernel 3.1 xrandr is taken a lot longer to do its thing (change screen arrangement and resolution) and it blocks even the mouse from working, stops the whole X for a full minute.

Should I report a separate issue or can this be related.

Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #22)
> ok I'm using sna now, let's see. I'll try your patch later on the weekend.
>
> I have also another issue and wonder if it is related, since kernel 3.1 xrandr
> is taken a lot longer to do its thing (change screen arrangement and
> resolution) and it blocks even the mouse from working, stops the whole X for a
> full minute.
>
> Should I report a separate issue or can this be related.

Whilst we know of a reason why xrandr is slow in general (probing of disconnected outputs causes timeouts rather than a quick "not detected"), I was not aware that the situation had got any worse with 3.2

Revision history for this message
In , Mikopp (mikopp) wrote :

its 3.1.10 not yet 3.2, but yes it got far worse. It used to be slow, but not blocking X completely. If you let me know what kind of information you need I will open a new ticket for that

Revision history for this message
In , Chris Wilson (ickle) wrote :

Just start the report of an Xorg.log with timings from 3.0, the bad Xorg.log with timings from 3.1.0 and an strace -tt of X from 3.1.0 would be useful. The first priority is just to open a ticket stating the problem so that we have it tracked and raise awareness of the issue.

Revision history for this message
In , Chris Wilson (ickle) wrote :

*** Bug 46044 has been marked as a duplicate of this bug. ***

Revision history for this message
In , Chris Wilson (ickle) wrote :

Also note that in bug 46044, we also hit the VFS file limit.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Can you please test whether disable the bo cache is sufficient to avoid the issue:

commit 5b5cd6780ef7cae8f49d71d7c8532597291402d8
Author: Chris Wilson <email address hidden>
Date: Fri Feb 24 11:14:26 2012 +0000

    uxa: Add a option to disable the bo cache

    If you are suffering from regular X crashes and rendering corruption
    with a flood of ENOSPC or even EFILE reported in the Xorg.log, try
    adding this snippet to your xorg.conf:

    Section "Driver"
      Option "BufferCache" "False"
    EndSection

    References: https://bugs.freedesktop.org/show_bug.cgi?id=39552
    Signed-off-by: Chris Wilson <email address hidden>

Revision history for this message
In , Chris Wilson (ickle) wrote :

*** Bug 44185 has been marked as a duplicate of this bug. ***

Revision history for this message
In , dn (nobled) wrote :

(In reply to comment #28)
> Section "Driver"
> Option "BufferCache" "False"
> EndSection
X refused to start instead:

[546141.376] Parse error on line 2 of section Driver in file /etc/X11/xorg.conf
 "Driver" is not a valid section name.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Oops, Section "Device" not "Driver". You would have thought I would have checked before committing..

Revision history for this message
In , dn (nobled) wrote :

Okay, fixing that, it failed for a different reason:

[604303.989] Parse error on line 5 of section Device in file /etc/X11/xorg.conf
 This section must have an Identifier line.
[604303.990] (EE) Problem parsing the config file
[604303.990] (EE) Error parsing the config file
[604303.990]
Fatal server error:
[604303.990] no screens found

What does it mean by Identifier line? And are there any other requirements that are going to show up after this one gets fixed.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Ok, the minimum complete snippet is:

Section "Device"
  Identifier "Device0"
  Driver "Intel"
  Option "BufferCache" "False"
EndSection

Revision history for this message
In , dn (nobled) wrote :

Okay, been running it for a few hours with the cache disabled. file-nr is still going up every time I check it even if I haven't done anything. xrestop shows about 55 MB in pixmaps, while i915_gem_objects shows 267 MB, and is also going up every time I check it.

Revision history for this message
In , dn (nobled) wrote :

Huh, just now I hit some *other* kind of limit before file-nr was even halfway to file-max just now. I guess it was ENOSPC instead of ENFILE?

Basically the exact symptoms described Bug 44185 -- with the addition that compiz coincidentally(?) crashed (bug 46303) while everything was screwing up, which made window decorations disappear as usual--but once it automatically restarted, I was left staring at my desktop background with nothing else showing, all windows / the mouse had disappeared.

The numbers don't *seem* that different--I'm pretty sure it's gone past 1.6GB before, so no idea why it hit ENOSPC this time, and that one time weeks ago, and not the dozens of other times in between?

$ cat /proc/sys/fs/file-nr
314400 0 796780

$ sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
307611 objects, 1535336448 bytes
748 [731] objects, 95580160 [26685440] bytes in gtt
  3 [1] active objects, 3424256 [16384] bytes
  8 [8] pinned objects, 8704000 [8704000] bytes
  737 [722] inactive objects, 83451904 [17965056] bytes
  0 [0] freed objects, 0 [0] bytes
8 pinned mappable objects, 8704000 bytes
685 fault mappable objects, 3039232 bytes
2147479552 [268435456] gtt total

Revision history for this message
In , dn (nobled) wrote :

Whoops. I closed all my programs one by one, and the leak only occurs with gnome-system-monitor 2.28.2 running. With it closed, the number of BOs in i915_gem_objects is stable. Even when it's minimized, while it's running the climb in numbers is fairly steady.

Sorry I didn't test this properly before just now -- I forgot I had that running in the background, even.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Sounds like we have a lead at last \o/. Thanks.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Is this bug peculiar to gnome-system-monitor 2.28.2? The systems I have all have gnome-system-monitor 3.2 and I have not seen it misbehave yet (multiple generations, with and without compositing).

Revision history for this message
In , dn (nobled) wrote :

(In reply to comment #38)
> Is this bug peculiar to gnome-system-monitor 2.28.2? The systems I have all
> have gnome-system-monitor 3.2 and I have not seen it misbehave yet (multiple
> generations, with and without compositing).

Huh. I just tried it on another computer with Ubuntu 11.10 on it, which has 3.2.1, and I couldn't reproduce it there either. I'm gonna try booting an Ubuntu 11.04 live usb image, since that's what I'm currently running and seeing it on.

Revision history for this message
In , Mikopp (mikopp) wrote :

I have no gnome system monitor running, I'm on KDE and still have this problem. I do have gtk applications running though, mostly RCP/SWT based stuff. maybe it is something that these two have in common? I can also say that I always have graphic issues in these applications and not in the QT ones.

Revision history for this message
In , Chris Wilson (ickle) wrote :

I've applied some patches originally intended to aide chasing this bug down, but since proved to fix another bug they went straight to master. Can you please checkout xf86-video-intel.git and monitor for the leak? Thanks.

Revision history for this message
In , dn (nobled) wrote :

Created attachment 58513
Xorg.0.log from machine where gnome-system-monitor causes leaks

...Okay, couldn't reproduce it off the LiveCD either. Here are all the differences I can think of between this laptop and that machine:

- Running 3.2 kernel
- Installed the slightly dated natty xorg-edgers repo: X server 1.10.4+, drivers and mesa from February git snapshots
- HD3000 hardware
- Running in the gnome 'fallback' compiz composited environment

I did install the xorg-edgers repo on the LiveCD and upgraded/killed/restarted X and all. Still couldn't reproduce it.

There isn't an x11trace equivalent to apitrace that would record what exactly gnome-system-monitor might be spamming the server with, is there?

(And it's not exclusive to gnome-system-monitor; the leaks are just accumulating much slower now. They only don't happen at all if I'm doing absolutely nothing but checking the BO count repeatedly.)

Revision history for this message
In , dn (nobled) wrote :

(In reply to comment #41)
Whoops, should've refreshed the page. Yeah will do.

Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #42)
> There isn't an x11trace equivalent to apitrace that would record what exactly
> gnome-system-monitor might be spamming the server with, is there?

There is xtrace (or xscope) but no means to replay yet (at least that I know of).

Revision history for this message
In , dn (nobled) wrote :

Ah. I just restarted X with the driver from commit 0a8218a535babb5969a58c3a7da0215912f6fef8 -- leak still happens.

Revision history for this message
In , Chris Wilson (ickle) wrote :

The situation should be improved by

commit a14917eeb2cc160d13f4fddefe5f7f9c80953ce1
Author: Chris Wilson <email address hidden>
Date: Fri Feb 24 21:13:38 2012 +0000

    drm/i915: Release the mmap offset when purging a buffer

    If we discard a buffer due to memory pressure, also release its alloted
    mmap address space. As it may be sometime before userspace wakes up
    and notices that it has buffers to purge from its cache, we may waste
    valuable address space on unusable objects for a period of time.

    Signed-off-by: Chris Wilson <email address hidden>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=47738
    Signed-off-by: Daniel Vetter <email address hidden>

but I'm still searching for just why we end up with so many buffers.

Revision history for this message
In , Chris Wilson (ickle) wrote :

The only thing I've found so far...

commit a16616209bb2dcb7aaa859b38e154f0a10faa82b
Author: Chris Wilson <email address hidden>
Date: Sat Apr 14 19:03:25 2012 +0100

    uxa: Fix leak of glyph mask for unhandled glyph composition

    ==1401== 7,344 bytes in 34 blocks are possibly lost in loss record 570 of 58
    ==1401== at 0x4027034: calloc (in /usr/lib/valgrind/vgpreload_memcheck-am
    ==1401== by 0x8BE5150: drm_intel_gem_bo_alloc_internal (intel_bufmgr_gem.
    ==1401== by 0x899FC04: intel_uxa_create_pixmap (intel_uxa.c:1077)
    ==1401== by 0x89C2C41: uxa_glyphs (uxa-glyphs.c:254)
    ==1401== by 0x21F05E: damageGlyphs (damage.c:647)
    ==1401== by 0x218E06: ProcRenderCompositeGlyphs (render.c:1434)
    ==1401== by 0x15AA40: Dispatch (dispatch.c:439)
    ==1401== by 0x1499E9: main (main.c:287)

    Signed-off-by: Chris Wilson <email address hidden>

Could this seemingly insignificant path be the cause of your misery?

Revision history for this message
In , Leho Kraav (lkraav) wrote :

I am quite certain 2.17 +sna has never bombed on me with this. I'm now up to kernel 3.3.1.

OTOH I can't move up to 2.18.0, since Firefox gets corruption all over the place. Have not tested 2.18.0+3.3 yet, rather feels like I should be waiting for 2.18.1.

Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #48)
> I am quite certain 2.17 +sna has never bombed on me with this. I'm now up to
> kernel 3.3.1.
>
> OTOH I can't move up to 2.18.0, since Firefox gets corruption all over the
> place. Have not tested 2.18.0+3.3 yet, rather feels like I should be waiting
> for 2.18.1.

Ah that means you are encountering the bug in 2.18.0-sna and so you won't be suffering from this bug any longer (as far as I can tell this is pure an UXA issue).

Revision history for this message
In , dn (nobled) wrote :

(In reply to comment #47)
> The only thing I've found so far...
>
> commit a16616209bb2dcb7aaa859b38e154f0a10faa82b
> Author: Chris Wilson <email address hidden>
> Date: Sat Apr 14 19:03:25 2012 +0100
>
> uxa: Fix leak of glyph mask for unhandled glyph composition
>
> Could this seemingly insignificant path be the cause of your misery?

The bad news is, that commit didn't fix it. The good news is, an earlier one *did*:

commit fde8a010b3d9406c2f65ee99978360a6ca54e006
Author: Chris Wilson <email address hidden>
Date: Fri Mar 30 12:47:21 2012 +0100

    uxa: Remove broken render glyphs-to-dst

    Reported-by: Vincent Untz <email address hidden>
    Reported-by: Robert Bradford <email address hidden>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=48045
    Signed-off-by: Chris Wilson <email address hidden>

I made sure; reverting that commit on top of master causes the leak to re-appear. (Only question is, was the leak in that function all along, or did it happen to call a leaky path somewhere else?)

Just a note on my testing: gnome-system-monitor only begins to leak after it's been running for 60 seconds while viewing the 'Resources' tab-- ie, enough time has passed to fill up the whole X axis of the rolling 'CPU History' graph. Once history starts falling off the back edge, the numbers start climbing.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Ah, it had a very, very similar bug. Almost as if I based both functions on the same skeleton code ;-)

(It leaked the localSrc, localDst, if either were allocated, if it decided that it would be unable to render the glyphs using the GPU).

Glad to have an answer finally.

Revision history for this message
gcc (chris+ubuntu-qwirx) wrote :
Revision history for this message
gcc (chris+ubuntu-qwirx) wrote :

Screenshot displaying the corruption (marked in red).

bugbot (bugbot)
tags: added: corruption
tags: added: resume
Revision history for this message
In , W-florijn-k (w-florijn-k) wrote :

A patch referencing this bug report has been merged in Linux v3.7-rc1:

commit d8cb5086695dcdd076e911fc298a5a6701497371
Author: Chris Wilson <email address hidden>
Date: Sat Aug 11 15:41:03 2012 +0100

    drm/i915: Try harder to allocate an mmap_offset

Revision history for this message
gcc (chris+ubuntu-qwirx) wrote :

I think it's this bug:

http://lists.freedesktop.org/archives/intel-gfx/2012-April/016154.html
https://bugs.freedesktop.org/show_bug.cgi?id=39552
https://bugs.freedesktop.org/show_bug.cgi?id=46044 (duplicate)

as kern.log fills with the same message when it happens:

Oct 17 13:37:25 lap-x201 kernel: [266534.112127] [drm:drm_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0

So I don't think it's a resume problem, only that it happens when X has been running for a long time (it happened again this morning and I left my laptop on overnight, not suspended).

Revision history for this message
In , Chris-freedesktop-org (chris-freedesktop-org) wrote :

I'm seeing bug #46044 and I'm not 100% convinced that it's a duplicate of this one, as marked.

I've been running xserver-xorg-video-intel 2:2.20.8-0ubuntu2.1~precise2, which I assume includes this patch, and seen the problem. Just downgraded to 2:2.17.0-1ubuntu4.2 as that was just released into precise updates.

I think it's closely related though. I do see the number of "inactive objects" in /sys/kernel/debug/dri/0/i915_gem_objects climbing sky-high if there's any animation in my AWN taskbar. The clock doesn't trigger it, but the dropbox sync icon does.

If I stop dropbox, it drops back down to normal levels; if I start dropbox and it's syncing (animated rotating arrows icon) the number of "inactive objects" grows by a few per second. I also see the same errors in /var/log/kern.log, over and over again, once the graphics corruption starts:

Oct 17 13:37:25 lap-x201 kernel: [266534.112127] [drm:drm_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0

I've started syslogging the number of inactive objects to see if it reaches the same kind of heights. I've also logged a bug on Launchpad, so we can see whether a new driver release is needed in Ubuntu:

https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/1053959

Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #53)
> Oct 17 13:37:25 lap-x201 kernel: [266534.112127]
> [drm:drm_gem_create_mmap_offset] *ERROR* failed to allocate offset for bo 0
>
> I've started syslogging the number of inactive objects to see if it reaches
> the same kind of heights. I've also logged a bug on Launchpad, so we can see
> whether a new driver release is needed in Ubuntu:
>
> https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/

Could be this bug as the attached Xorg.log there indicates you are using 2.17.0 (I'd like to see your Xorg.log with 2.20.x if you have it, just to confirm) or it could just be a client pixmap leak. So also watch xrestop.
> 1053959

Revision history for this message
In , Chris-freedesktop-org (chris-freedesktop-org) wrote :

Created attachment 74005
Xorg.0.log with intel_drv.so 2.20.8

Sorry for the delay, it's awfully confusing that we're both called Chris Wilson, I didn't realise that you'd replied to my comment :)

Here is Xorg.0.log, hopefully showing that the driver running is 2.20.8 (so with the patch)?

I haven't been running xrestop, sorry. I've just started. Even if this is a client pixmap leak, surely one app shouldn't be able to bring down my desktop? Wouldn't that be a bug, perhaps in X itself rather than the Intel driver? But I've never seen anything like this happen with any other graphics card or system in 15 years of using Linux on the desktop.

I did notice that when my Chromium goes all black, if I make the window smaller then it starts working again, for a while, then it goes black again and I have to make it smaller again, and so on until I can't read web pages any more and I have to log out and back in again.

Changed in xserver-xorg-video-intel:
importance: Unknown → High
status: Unknown → Fix Released
Revision history for this message
In , Chris Wilson (ickle) wrote :

(In reply to comment #55)
> Created attachment 74005 [details]
> Xorg.0.log with intel_drv.so 2.20.8
>
> Sorry for the delay, it's awfully confusing that we're both called Chris
> Wilson, I didn't realise that you'd replied to my comment :)
>
> Here is Xorg.0.log, hopefully showing that the driver running is 2.20.8 (so
> with the patch)?
>
> I haven't been running xrestop, sorry. I've just started. Even if this is a
> client pixmap leak, surely one app shouldn't be able to bring down my
> desktop? Wouldn't that be a bug, perhaps in X itself rather than the Intel
> driver? But I've never seen anything like this happen with any other
> graphics card or system in 15 years of using Linux on the desktop.

It's a Denial-of-Service. There's a limited address space for mmapping of buffers, so that if the client does leak, eventually we will not be able to map a new buffer and it will remain blank. (KDE is full of such examples, or at least one sufficiently common one.)

We've harden the kernel to recover as much of that space as possible, but that is limited by the guarantees given by the userspace API. On the other side, it is possible to use alternative fallback methods if the mmapping fails, that hardening is present in SNA. Nevertheless, the side-effects will remain unpleasant until the source of the bug is found.

Revision history for this message
In , Chris-freedesktop-org (chris-freedesktop-org) wrote :

Hi Chris,

I'm afraid I don't understand the protocol/library/guarantees well enough to interpret what you're saying with 100% confidence.

The behaviour that I'm seeing is not consistent with one app DOSing itself. Chromium goes black, I restart Chromium (with the same tabs open), it's still black. I kill and restart the X server, restart Chromium again (with the same tabs open), and now it works again (for ~8-24 hours until the same thing happens again).

I think you're saying that individual clients are allowed to allocate pixmaps out of the (very) limited mmaped space that is video RAM for this card and shared between app apps. And it's possible for a client to leak this space (it seems that maybe AWN or some of its applets does this), which eventually results in a DoS for other X apps (gnome-terminal, chromium) and makes the desktop unusable.

My view is that if clients can do this, it represents a violation of X's responsibility to maintain stability of the desktop for all clients, in a way that doesn't seem to be consistent with the behaviour of X's behaviour with any other graphics driver.

If clients can quite easily exhaust that resource, I don't think they should be allowed to allocate it at all. Why does the X server allow clients to allocate direct mapping pixmaps? What if the X server managed the graphics card's mapped memory, and decided for itself which pixmaps are actually mapped into the limited space available?

One of the reasons that I prefer X over for example Windows is that it has always tried to protect itself against badly behaved apps crashing the desktop. If that is no longer the case, it annoys me quite a bit. Do you think this is a deeper bug in X that needs to be fixed?

Cheers, Chris.

Revision history for this message
In , Chris-freedesktop-org (chris-freedesktop-org) wrote :

I've verified that killing and restarting /usr/share/avant-window-navigator/applets/indicator-applet.desktop restores normal behaviour in other apps, so I don't have to restart the X server any more.

i915_gem_objects before and after:

chris@lap-x201:~$ sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
43847 objects, 103444480 bytes
2565 [1740] objects, 351232000 [209481728] bytes in gtt
  97 [32] active objects, 54587392 [13967360] bytes
  2468 [1708] inactive objects, 296644608 [195514368] bytes
7 pinned mappable objects, 12759040 bytes
66 fault mappable objects, 385024 bytes
2147483648 [268435456] gtt total

chris@lap-x201:~$ kill 5103

chris@lap-x201:~$ sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
2548 objects, 292028416 bytes
1694 [1186] objects, 253702144 [143339520] bytes in gtt
  49 [41] active objects, 46473216 [14962688] bytes
  1645 [1145] inactive objects, 207228928 [128376832] bytes
7 pinned mappable objects, 12759040 bytes
140 fault mappable objects, 4497408 bytes
2147483648 [268435456] gtt total

chris@lap-x201:~$ awn-applet -p /usr/share/avant-window-navigator/applets/indicator-applet.desktop -u 1347961205 -w 23068725 -i 1 &

chris@lap-x201:~$ sudo cat /sys/kernel/debug/dri/0/i915_gem_objects
1616 objects, 194662400 bytes
798 [651] objects, 150114304 [93024256] bytes in gtt
  70 [54] active objects, 46964736 [15323136] bytes
  728 [597] inactive objects, 103149568 [77701120] bytes
8 pinned mappable objects, 12775424 bytes
91 fault mappable objects, 815104 bytes
2147483648 [268435456] gtt total

Chris Wilson (ickle)
Changed in xserver-xorg-video-intel (Ubuntu):
status: New → Fix Released
Revision history for this message
In , Chris-freedesktop-org (chris-freedesktop-org) wrote :

Hi Chris,

Am I experiencing a bug in the X server then? Do you want me to open a new bug? Something is seriously wrong if one app is able to bring down my entire desktop by accident.

Cheers, Chris.

Revision history for this message
In , Chris Wilson (ickle) wrote :

Thinking about it, a bug against Xorg core to teach it per-client resource limits is actually not a bad idea. I would imagine that XACE, the security extension to X that already does all the permission checks, should be modifiable to also perform resource limit checks.

Revision history for this message
In , Chris-freedesktop-org (chris-freedesktop-org) wrote :

Thanks, filed bug #60925. Cheers, Chris.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.