[i945GM] X server gets stuck in loop waiting on DRI, reads mouse events but ignores all socket I/O hence appears frozen
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
xserver-xorg-video-intel (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: xserver-
This Intel freeze is one where the mouse kept working, and I was able to login over SSH and do some diagnostics which point to DRI. I have a 945GM chipset, on a laptop.
I was able to get some diagnostic output which might help identify the cause.
Kernel DRI is implicated: it's stuck in the kernel, but interruptible with a signal (in this type of freeze). But after turning off DRI, the Intel hardware is still in a stuck state, implying it's not necessarily a bug in kernel DRI, just getting stuck due to an unexpected hardware state, which X server initialisation does not clear. I ran GDB on an X server with DRI disabled in this state, and found which memory location it is polling.
All my X locking up problems started with upgrade from Intrepid to Jaunty, very soon after Jaunty was released. I recall one lockup with mouse movement soon after the upgrade, but most have been complete freezes without even network access. This was a rare one with mouse working, so I took a closer look.
I am not using Compiz, just a standard Ubuntu 9.04 GNOME desktop, and mostly just Firefox, Emacs and Gnome-Terminal. Ubuntu's System > Settings > Appearance > Visual Effects is set to None, and I'd not run any 3d applications.
Freeze in ioctl(/
=======
This Intel freeze is one where the mouse kept working, and I was able to login over SSH and do some diagnostics which point to DRI.
Since updating to Jaunty, I've seen freezes every few days or weeks on my laptop while using X. Usually there is nothing I can do except power cycle - no network, no keyboard response, Alt-SysRq not working. Occasionally though, the mouse keeps working.
Kernel version is: 2.6.28-13-generic
CPU is 32-bit x86, a Core Duo 2.0GHz with 2.5GB RAM.
Graphics is Intel 945GM according to Xorg.0.log.
Kernel package:
linux-
X packages:
xserver-
xserver-common 2:1.6.0-0ubuntu14
xserver-xorg 1:7.4~5ubuntu18
xserver-
xserver-xorg-dev 2:1.6.0-0ubuntu14
xorg 1:7.4~5ubuntu18
GPU according to Xorg.log: Intel 945GM.
Like many people I've been having apparently random freezes with Jaunty on Intel graphics h/w, and more than one kind of freeze.
This time, mouse movement worked: specifically, the pointer moved around in response. However, there was no visible reaction from clicking or key presses or focus changes, and no other screen updates. Just pointer movement.
I straced the X server, and saw it blocked on ioctl(0x6458) on /dev/dri/card0; I think that's DRM_I915_
Mouse movements interrupted the ioctl() with a SIGIO signal. It handled SIGIO, reading some events, then it went back to calling the DRI ioctl(). This must be why mouse movement worked.
See [X_DRI_
During SIGIO, it called select() on the /dev/input/eventN only. It didn't select on any of the socket descriptors.
Because it didn't check any sockets, all attempts to communicate with the X server, including starting a new app from the text console, had no effect on the strace; it remained blocked in the DRI ioctl(). So it's not surprising there was no screen update from apps.
Clearly the problem was being stuck in the ioctl(11, 0x6458, 0), and clearly that is not _intended_ to block for long.
That was /dev/dri/card0, and I think the ioctl is DRM_I915_
Inside the ioctl(), the Xorg process was waiting in kernel function i915_wait_
It looks like the kernel is broken or the chip is stuck in some "impossible" way. Judging by the way the X server does not try to handle sockets while this call is blocked, I guess it is not supposed to block for long.
As last resort, I killed the server: "kill 3971" didn't kill it. "kill -HUP 3971" didn't kill it. "kill -QUIT 3971" did.
The Xorg.0.log from this X server is [X_DRI_
Once DRI is stuck, it stays stuck:
=======
After killing it, gdm automatically started another X server. The new X server started well, judging by it's log. But it only produced a black screen (not even the X stipple background). strace showed it stuck waiting for the same ioctl() as before, on /dev/dri/card0... gdm eventually gave up waiting for it.
The Xorg.log from the second X server is [X2_DRI_
Disabling DRI with [Option "DRI" "no"] broke the X server
=======
So I edited /etc/X11/xorg.conf, adding 'Option "DRI" "no"', and killed the second X server with "kill -QUIT" to start a third.
The third X server failed to start. It didn't even get stuck, it just output some error messages and gave up. Perhaps you can't disable DRI with that option along (see next section); is that another bug to report?
The Xorg.log from this X server is [X3_no_
=> This log may be particularly useful because it says a few things about the stuck hardware state.
(WW) intel(0): ESR is 0x00000001, instruction error
(WW) intel(0): PRB0_CTL (0x0001f001) indicates ring buffer enabled
(WW) intel(0): PRB0_HEAD (0x00000000) and PRB0_TAIL (0x00000050) indicate ring buffer not flushed
(WW) intel(0): Existing errors found in hardware state.
Error in I830WaitLpRing(), timeout for 2 seconds
pgetbl_ctl: 0x9ffc0001 getbl_err: 0x00000000
ipeir: 0x00000000 iphdr: 0x54300004
LP ring tail: 0x00000050 head: 0x00000000 len: 0x0001f001 start 0x007bf000
eir: 0x0000 esr: 0x0001 emr: 0xffff
instdone: 0x4081 instpm: 0x0000
memmode: 0x00000301 instps: 0x800f0044
hwstam: 0xffff ier: 0x0000 imr: 0xffff iir: 0x0000
Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20 acthd 0x9c764c
Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20
(... above line repeated exactly 64 times ...)
0001fffc: 00000000 MI_NOOP 1
Ring end
space: 130984 wanted 131064
Fatal server error:
lockup
Please consult the The X.Org Foundation support
at http://
for help.
Please also check the log file at "/var/log/
Error in I830WaitLpRing(), timeout for 2 seconds
pgetbl_ctl: 0x9ffc0001 getbl_err: 0x00000000
ipeir: 0x00000000 iphdr: 0x54300004
LP ring tail: 0x00000050 head: 0x00000000 len: 0x0001f001 start 0x007bf000
eir: 0x0000 esr: 0x0001 emr: 0xffff
instdone: 0x4081 instpm: 0x0000
memmode: 0x00000301 instps: 0x800f0044
hwstam: 0xffff ier: 0x0000 imr: 0xffff iir: 0x0000
Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20 acthd 0x9c764c
Ring at virtual 0xa7794000 head 0x0 tail 0x50 count 20
(... above line repeated exactly 64 times ...)
0001fffc: 00000000 MI_NOOP 1
Ring end
space: 130984 wanted 131064
FatalError re-entered, aborting
lockup
Disabling DRI even harder
=======
'Option "DRI" "no"' didn't work, so I looked for anything else DRI-related in the xorg.conf, in case that helped disable it properly.
That resulted in commenting out these lines, which came with it by default:
(in section "Module")
# Load "dri"
(at the end)
#Section "DRI"
# Mode 0666
#EndSection
And then X was able to start without exiting quickly. Does that mean the 'Option' didn't disable DRI? Perhaps I should report this as a bug too?
This one started, but still didn't actually work. This time it got stuck in 100% CPU in userspace.
The log from this one is [X4_no_
Because it was stuck in userspace, I looked at it with GDB.
GDB said X was stuck inside intel_drv.so, in a short loop like this:
0xb78b1830: mov (%edx),%eax
0xb78b1832: test %eax,%eax
0xb78b1834: jns 0xb78b1830
%edx = 0xb785f200, and %eax = 0x0000000a.
It's polling the word at address 0xb785f200, waiting for bit 31 to become set.
So I looked up that address in /proc/26213/maps, while the process was still being debugged:
b77fe000-b787e000 rw-s dc100000 00:00 7477 /sys/devices/
The offset is 0xb785f200-
The corresponding PCI device is:
00:02.0 VGA compatible controller: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller (rev 03)
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Region 0: Memory at dc100000 (32-bit, non-prefetchable) [size=512K]
Region 1: I/O ports at 1800 [size=8]
Region 2: Memory at c0000000 (32-bit, prefetchable) [size=256M]
Region 3: Memory at dc200000 (32-bit, non-prefetchable) [size=256K]
Kernel modules: intelfb
So we conclude that with all of DRI disabled, intel_drv.so gets stuck polling the word at offset 0x61200 from PCI region 0 of a 945GM, waiting for bit 31 to become set. When stuck, it reads the value 0x0a every time, when stopped and viewed in GDB.
That register is called PP_STATUS in i810_reg.h, bit 31 is PP_ON; that file doesn't say what 0x0a means.
I don't know if it's a different register on 945GM. I only found an obvious definition in i810_reg.h.
Since this is with all of DRI disabled in this (fourth) run of X, it might be unrelated to the original lockup. It might be stuck due to some other part of the video card in an improper state due to previous X servers being killed. In particular, PP_STATUS is related to "panel power" as far as I can tell, and nothing to do with DRI - unless DRI gets stuck waiting for vsync or something like that.
But it's quite suspicious.
Afterwards
==========
After rebooting, X starts with the same settings (DRI completely disabled) and works fine. I've been using it like this ever since, and haven't had any lockups. That's 6 days ago.
They were quite variable before, only happening every few days or after about 2 weeks, and I haven't used the computer a lot since the bug was noticed.
If I don't get any lockups for a month with the new settings, we can be more confident that this one needs DRI enabled to trigger it.
Hope the debugging is useful, or even better if it's already fixed upstream :-)
tags: | added: jaunty |
Here are some lined grepped from my kernel log at the time of this freeze.
I don't think they are related, but you never know.
I did a "grep i915 /var/log/kern.log".
The initial freeze happens after the entry at time 00:47:30, before the entry at 04:45:10. The entries from 04:45:10 are due to me starting more X servers.