[i945GM] X server gets stuck in loop waiting on DRI, reads mouse events but ignores all socket I/O hence appears frozen

Bug #398968 reported by Jamie Lokier
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
xserver-xorg-video-intel (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

Binary package hint: xserver-xorg-video-intel

This Intel freeze is one where the mouse kept working, and I was able to login over SSH and do some diagnostics which point to DRI. I have a 945GM chipset, on a laptop.

I was able to get some diagnostic output which might help identify the cause.

Kernel DRI is implicated: it's stuck in the kernel, but interruptible with a signal (in this type of freeze). But after turning off DRI, the Intel hardware is still in a stuck state, implying it's not necessarily a bug in kernel DRI, just getting stuck due to an unexpected hardware state, which X server initialisation does not clear. I ran GDB on an X server with DRI disabled in this state, and found which memory location it is polling.

All my X locking up problems started with upgrade from Intrepid to Jaunty, very soon after Jaunty was released. I recall one lockup with mouse movement soon after the upgrade, but most have been complete freezes without even network access. This was a rare one with mouse working, so I took a closer look.

I am not using Compiz, just a standard Ubuntu 9.04 GNOME desktop, and mostly just Firefox, Emacs and Gnome-Terminal. Ubuntu's System > Settings > Appearance > Visual Effects is set to None, and I'd not run any 3d applications.

Freeze in ioctl(/dev/dri/card0, DRM_I915_GEM_THROTTLE), mouse still works
=========================================================================

This Intel freeze is one where the mouse kept working, and I was able to login over SSH and do some diagnostics which point to DRI.

Since updating to Jaunty, I've seen freezes every few days or weeks on my laptop while using X. Usually there is nothing I can do except power cycle - no network, no keyboard response, Alt-SysRq not working. Occasionally though, the mouse keeps working.

Kernel version is: 2.6.28-13-generic
CPU is 32-bit x86, a Core Duo 2.0GHz with 2.5GB RAM.
Graphics is Intel 945GM according to Xorg.0.log.
Kernel package:
   linux-image-2.6.28-13-generic 2.6.28-13.45
X packages:
   xserver-xorg-video-intel 2:2.6.3-0ubuntu9.3
   xserver-common 2:1.6.0-0ubuntu14
   xserver-xorg 1:7.4~5ubuntu18
   xserver-xorg-core 2:1.6.0-0ubuntu14
   xserver-xorg-dev 2:1.6.0-0ubuntu14
   xorg 1:7.4~5ubuntu18

GPU according to Xorg.log: Intel 945GM.

Like many people I've been having apparently random freezes with Jaunty on Intel graphics h/w, and more than one kind of freeze.

This time, mouse movement worked: specifically, the pointer moved around in response. However, there was no visible reaction from clicking or key presses or focus changes, and no other screen updates. Just pointer movement.

I straced the X server, and saw it blocked on ioctl(0x6458) on /dev/dri/card0; I think that's DRM_I915_GEM_THROTTLE for this chip.

Mouse movements interrupted the ioctl() with a SIGIO signal. It handled SIGIO, reading some events, then it went back to calling the DRI ioctl(). This must be why mouse movement worked.

See [X_DRI_freeze_strace.txt] for the strace, and [X_DRI_freeze_fds.txt] for the file descriptors.

During SIGIO, it called select() on the /dev/input/eventN only. It didn't select on any of the socket descriptors.

Because it didn't check any sockets, all attempts to communicate with the X server, including starting a new app from the text console, had no effect on the strace; it remained blocked in the DRI ioctl(). So it's not surprising there was no screen update from apps.

Clearly the problem was being stuck in the ioctl(11, 0x6458, 0), and clearly that is not _intended_ to block for long.

That was /dev/dri/card0, and I think the ioctl is DRM_I915_GEM_THROTTLE for this chip.

Inside the ioctl(), the Xorg process was waiting in kernel function i915_wait_request(), according to "ps -opid,comm,wchan 3971". The numeric WCHAN is 0x6080e if that's helpful. See [X_DRI_freeze_wchan.txt].

It looks like the kernel is broken or the chip is stuck in some "impossible" way. Judging by the way the X server does not try to handle sockets while this call is blocked, I guess it is not supposed to block for long.

As last resort, I killed the server: "kill 3971" didn't kill it. "kill -HUP 3971" didn't kill it. "kill -QUIT 3971" did.

The Xorg.0.log from this X server is [X_DRI_freeze_log.txt]. You can see from the end that there's nothing interesting at the time it gets stuck.

Once DRI is stuck, it stays stuck:
=================================

After killing it, gdm automatically started another X server. The new X server started well, judging by it's log. But it only produced a black screen (not even the X stipple background). strace showed it stuck waiting for the same ioctl() as before, on /dev/dri/card0... gdm eventually gave up waiting for it.

The Xorg.log from the second X server is [X2_DRI_frozen_startup.txt].

Disabling DRI with [Option "DRI" "no"] broke the X server
=========================================================

So I edited /etc/X11/xorg.conf, adding 'Option "DRI" "no"', and killed the second X server with "kill -QUIT" to start a third.

The third X server failed to start. It didn't even get stuck, it just output some error messages and gave up. Perhaps you can't disable DRI with that option along (see next section); is that another bug to report?

The Xorg.log from this X server is [X3_no_DRI_broken_hwstate.txt]

=> This log may be particularly useful because it says a few things about the stuck hardware state.

  (WW) intel(0): ESR is 0x00000001, instruction error
  (WW) intel(0): PRB0_CTL (0x0001f001) indicates ring buffer enabled
  (WW) intel(0): PRB0_HEAD (0x00000000) and PRB0_TAIL (0x00000050) indicate ring buffer not flushed
  (WW) intel(0): Existing errors found in hardware state.
  Error in I830WaitLpRing(), timeout for 2 seconds
  pgetbl_ctl: 0x9ffc0001 getbl_err: 0x00000000
  ipeir: 0x00000000 iphdr: 0x54300004
  LP ring tail: 0x00000050 head: 0x00000000 len: 0x0001f001 start 0x007bf000
  eir: 0x0000 esr: 0x0001 emr: 0xffff
  instdone: 0x4081 instpm: 0x0000
  memmode: 0x00000301 instps: 0x800f0044
  hwstam: 0xffff ier: 0x0000 imr: 0xffff iir: 0x0000
  Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20 acthd 0x9c764c
  Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20
(... above line repeated exactly 64 times ...)
          0001fffc: 00000000 MI_NOOP 1
  Ring end
  space: 130984 wanted 131064

  Fatal server error:
  lockup

  Please consult the The X.Org Foundation support
           at http://wiki.x.org
   for help.
  Please also check the log file at "/var/log/Xorg.0.log" for additional information.

  Error in I830WaitLpRing(), timeout for 2 seconds
  pgetbl_ctl: 0x9ffc0001 getbl_err: 0x00000000
  ipeir: 0x00000000 iphdr: 0x54300004
  LP ring tail: 0x00000050 head: 0x00000000 len: 0x0001f001 start 0x007bf000
  eir: 0x0000 esr: 0x0001 emr: 0xffff
  instdone: 0x4081 instpm: 0x0000
  memmode: 0x00000301 instps: 0x800f0044
  hwstam: 0xffff ier: 0x0000 imr: 0xffff iir: 0x0000
  Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20 acthd 0x9c764c
  Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20
(... above line repeated exactly 64 times ...)
          0001fffc: 00000000 MI_NOOP 1
  Ring end
  space: 130984 wanted 131064

  FatalError re-entered, aborting
  lockup

Disabling DRI even harder
=========================

'Option "DRI" "no"' didn't work, so I looked for anything else DRI-related in the xorg.conf, in case that helped disable it properly.

That resulted in commenting out these lines, which came with it by default:

(in section "Module")
# Load "dri"

(at the end)
#Section "DRI"
# Mode 0666
#EndSection

And then X was able to start without exiting quickly. Does that mean the 'Option' didn't disable DRI? Perhaps I should report this as a bug too?

This one started, but still didn't actually work. This time it got stuck in 100% CPU in userspace.

The log from this one is [X4_no_DRI_stuck_in_loop.txt].

Because it was stuck in userspace, I looked at it with GDB.

GDB said X was stuck inside intel_drv.so, in a short loop like this:

  0xb78b1830: mov (%edx),%eax
  0xb78b1832: test %eax,%eax
  0xb78b1834: jns 0xb78b1830

  %edx = 0xb785f200, and %eax = 0x0000000a.

It's polling the word at address 0xb785f200, waiting for bit 31 to become set.

So I looked up that address in /proc/26213/maps, while the process was still being debugged:

  b77fe000-b787e000 rw-s dc100000 00:00 7477 /sys/devices/pci0000:00/0000:00:02.0/resource0

The offset is 0xb785f200-0xb77fe000 == 0x61200.

The corresponding PCI device is:

  00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)
          Subsystem: Fujitsu Siemens Computers Device 10ad
          Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
          Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
          Latency: 0
          Interrupt: pin A routed to IRQ 16
          Region 0: Memory at dc100000 (32-bit, non-prefetchable) [size=512K]
          Region 1: I/O ports at 1800 [size=8]
          Region 2: Memory at c0000000 (32-bit, prefetchable) [size=256M]
          Region 3: Memory at dc200000 (32-bit, non-prefetchable) [size=256K]
          Capabilities: [90] Message Signalled Interrupts: Mask- 64bit- Queue=0/0 Enable-
                  Address: 00000000 Data: 0000
          Capabilities: [d0] Power Management version 2
                  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                  Status: D0 PME-Enable- DSel=0 DScale=0 PME-
          Kernel modules: intelfb

So we conclude that with all of DRI disabled, intel_drv.so gets stuck polling the word at offset 0x61200 from PCI region 0 of a 945GM, waiting for bit 31 to become set. When stuck, it reads the value 0x0a every time, when stopped and viewed in GDB.

That register is called PP_STATUS in i810_reg.h, bit 31 is PP_ON; that file doesn't say what 0x0a means.

I don't know if it's a different register on 945GM. I only found an obvious definition in i810_reg.h.

Since this is with all of DRI disabled in this (fourth) run of X, it might be unrelated to the original lockup. It might be stuck due to some other part of the video card in an improper state due to previous X servers being killed. In particular, PP_STATUS is related to "panel power" as far as I can tell, and nothing to do with DRI - unless DRI gets stuck waiting for vsync or something like that.

But it's quite suspicious.

Afterwards
==========

After rebooting, X starts with the same settings (DRI completely disabled) and works fine. I've been using it like this ever since, and haven't had any lockups. That's 6 days ago.

They were quite variable before, only happening every few days or after about 2 weeks, and I haven't used the computer a lot since the bug was noticed.

If I don't get any lockups for a month with the new settings, we can be more confident that this one needs DRI enabled to trigger it.

Hope the debugging is useful, or even better if it's already fixed upstream :-)

Tags: jaunty
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

Here are some lined grepped from my kernel log at the time of this freeze.
I don't think they are related, but you never know.
I did a "grep i915 /var/log/kern.log".

The initial freeze happens after the entry at time 00:47:30, before the entry at 04:45:10. The entries from 04:45:10 are due to me starting more X servers.

Revision history for this message
Bryce Harrington (bryce) wrote :

Hi Jamie, excellent analysis work. I can share some additional background and analysis we've done on this class of bug.

A lot of people have found that the 2.6 driver series simply does not work properly for them, and opt to either upgrade to 2.7 or downgrade to 2.4. We have PPAs providing packages for each of these options, in the x-updates and x-retro PPAs here:

  https://edge.launchpad.net/~ubuntu-x-swat

If you want to focus on identifying the root cause of the issue, I've written up some additional troubleshooting tips at https://wiki.ubuntu.com/X/Troubleshooting/Freeze which may be of some use.

I also put together some tools that help in diagnosis x freeze errors:
  https://edge.launchpad.net/~ubuntu-x-swat/+archive/x-freeze-test/

Unfortunately, we had found in the run-up to the Jaunty release that most of the patches we tried that fixed a problem for one group, would cause some unpredictable breakages for other folks with different hardware; thus we're being extraordinarily conservative of introducing patches for Jaunty's -intel, and focusing efforts towards getting the issues resolved in the newer versions of the code, so Karmic is a solid upgrade for everyone.

On that note, I would also encourage you to give a try of reproducing the issue on Karmic. Running a LiveCD of alpha-2 or a daily might be worth trying, although I know with these freeze bugs often it can take many hours or several days to reproduce the issue and running a livecd may be impractical.

The good news though is that so far we're seeing huge improvements in the Karmic -intel driver. A ton of the most pervasive bugs have vanished, and while there are still of course some problems, they seem more of the corner case variety and seem to be getting good attention when reported upstream.

If you do want to help chase down a patch for Jaunty for this issue, I'll be happy to review and package it for Karmic and arrange to get sufficient testing that we can prove it to be safe. On the other hand, if you're satisfied with one of the other options listed above, we can consider that the fix.

Changed in xserver-xorg-video-intel (Ubuntu):
status: New → Incomplete
Bryce Harrington (bryce)
tags: added: jaunty
Revision history for this message
Bryce Harrington (bryce) wrote :

We're closing this bug since it is has been some time with no response from the original reporter. However, if the issue still exists please feel free to reopen with the requested information. Also, if you could, please test against the latest development version of Ubuntu, since this confirms the bug is one we may be able to pass upstream for help.

Changed in xserver-xorg-video-intel (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
Jamie Lokier (jamie-shareable) wrote :

I did the debugging work because I thought you (or upstream) might find it useful. Plenty of lockups have been reported but not many where it's possible to observe a faulty kernel or hardware state in detail.

Specifically, because of the clues in how the kernel or hardware state was broken and X server restart didn't recover it, I though you or upstream developers would find it useful.

But it appears that was a mistake since you weren't interested in analysis, just whether running Karmic for a month locks up less often. (Running it for a day would not be enough).

In other words, if I'd just reported it locking up, your response would have been exactly the same: try Karmic, see if it's more reliable.

I'm not sure what response you were waiting for, other than 'I ran Karmic and don't get (as many) lockups'.

Instead I turned off DRI entirely (on Jaunty), and I get far fewer lockups but have still had one, unfortunately. Presumably the one I had subsequently was a different kind, as the one in this report depends on DRI.

==> So, I have a question: If I do experience lockups when I eventually move to Karmic and re-enable DRI, and it's not a complete system freeze so possible to debug like in this report, should I spend the hours doing that or is it not useful here?

Thanks!

Btw, why is the new status Invalid and not Closed? This report may be useful to someone else who comes searching in future, especially if the problem reappears, and Invalid bugs tend to be skipped over when searching.

Revision history for this message
souplin (klage) wrote :

I'm having the same problems since an update for the jaunty xserver intel driver. After changing to karmic my desktop still freezes, although I can still move the mouse.
Any suggestions on how to get this fixed.

To post a comment you must log in.