X server crash: *** glibc detected *** free(): in valid next size (fast)

Bug #328035 reported by Matt Zimmerman on 2009-02-11
74
This bug affects 9 people
Affects Status Importance Assigned to Milestone
xorg-server (Ubuntu)
High
Bryce Harrington
Jaunty
High
Bryce Harrington

Bug Description

Binary package hint: xorg

The X server has been observed to crash on a variety of different intel chipsets. This most often happens during a resume (where the symptom is that the system resumes to a gdm login dialog rather than to the user session), but has been seen in other circumstances as well.

The crash is due to heap corruption detected by glibc. This is logged to /var/log/gdm/:0.log.1 and looks like this:

*** glibc detected *** /usr/X11R6/bin/X: free(): invalid next size (fast): 0x0000000005f124e0 ***
======= Backtrace: =========
/lib/libc.so.6[0x7f096ada6cb8]
/lib/libc.so.6(cfree+0x76)[0x7f096ada9276]
/usr/X11R6/bin/X(LogVMessageVerb+0x104)[0x4fb964]
/usr/X11R6/bin/X(xf86MsgVerb+0x8f)[0x48f26f]
/usr/X11R6/bin/X[0x4708a6]
/usr/X11R6/bin/X(xf86DeleteInput+0x1a)[0x491b3a]
/usr/X11R6/bin/X(DeleteInputDeviceRequest+0xf2)[0x496872]
/usr/X11R6/bin/X[0x46bf5e]
/usr/X11R6/bin/X[0x46c037]
/usr/lib/libhal.so.1[0x7f096bf53ee9]
/lib/libdbus-1.so.3(dbus_connection_dispatch+0x33b)[0x7f096bd1990b]
/lib/libdbus-1.so.3[0x7f096bd19ccb]
/usr/X11R6/bin/X[0x46b2fb]
/usr/X11R6/bin/X(WakeupHandler+0x4b)[0x451ebb]
/usr/X11R6/bin/X(WaitForSomething+0x1ef)[0x4ef41f]
/usr/X11R6/bin/X(Dispatch+0x80)[0x44e070]
/usr/X11R6/bin/X(main+0x3bd)[0x433ddd]
/lib/libc.so.6(__libc_start_main+0xe6)[0x7f096ad4d5a6]
/usr/X11R6/bin/X[0x433269]

ProblemType: Bug
Architecture: amd64
DistroRelease: Ubuntu 9.04
Package: xorg 1:7.4~5ubuntu12
ProcEnviron:
 LC_COLLATE=C
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/zsh
ProcVersion: Linux version 2.6.28-7-generic (buildd@yellow) (gcc version 4.3.3 (Ubuntu 4.3.3-3ubuntu2) ) #20-Ubuntu SMP Mon Feb 9 15:42:34 UTC 2009

SourcePackage: xorg
Uname: Linux 2.6.28-7-generic x86_64

[lspci]
00:00.0 Host bridge [0600]: Intel Corporation Mobile PM965/GM965/GL960 Memory Controller Hub [8086:2a00] (rev 0c)
     Subsystem: Lenovo Device [17aa:20b3]
00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller [8086:2a02] (rev 0c)
     Subsystem: Lenovo Device [17aa:20b5]

Related branches

Matt Zimmerman (mdz) wrote :
Matt Zimmerman (mdz) wrote :

I tried strace -s1024 but it seems it still doesn't expand the arguments to writev(). I'm not sure how to get more detail from this but I can reproduce it fairly easily. Even without suspending and resuming, it can happen when connecting a USB device (remote control).

Note that this is the same device as in bug 318261 and bug 311254, which may be related.

Alberto Milone (albertomilone) wrote :

Hi Matt, did you install xserver-xorg-core-dbg and tried what this page suggest?
https://wiki.ubuntu.com/X/Backtracing

Bryce Harrington (bryce) wrote :

Hmm, well it's definitely trying to put out a backtrace. Looks like something fails inside xf86Wakeup. Wish it gave an indication what the extension was that it's calling.

write(14, "======= Backtrace: =========\n"..., 29) = 29
writev(14, [{"/lib/libc.so.6"..., 14}, {"[0x"..., 3}, {"7f496c86fd48"..., 12}, {"]\n"..., 2}], 4) = 31
writev(14, [{"/lib/libc.so.6"..., 14}, {"("..., 1}, {"cfree"..., 5}, {"+0x"..., 3}, {"76"..., 2}, {")"..., 1}, {"[0x"..., 3}, {"7f496c872306"..., 12}, {"]\n"..., 2}], 9) = 43
writev(14, [{"/usr/X11R6/bin/X"..., 16}, {"("..., 1}, {"LogVMessageVerb"..., 15}, {"+0x"..., 3}, {"104"..., 3}, {")"..., 1}, {"[0x"..., 3}, {"4faae4"..., 6}, {"]\n"..., 2}], 9) = 50
writev(14, [{"/usr/X11R6/bin/X"..., 16}, {"("..., 1}, {"LogMessage"..., 10}, {"+0x"..., 3}, {"90"..., 2}, {")"..., 1}, {"[0x"..., 3}, {"4facc0"..., 6}, {"]\n"..., 2}], 9) = 44
writev(14, [{"/usr/lib/xorg/modules/extensions/"..., 43}, {"[0x"..., 3}, {"7f496bd16c03"..., 12}, {"]\n"..., 2}], 4) = 60
writev(14, [{"/usr/X11R6/bin/X"..., 16}, {"("..., 1}, {"xf86Wakeup"..., 10}, {"+0x"..., 3}, {"2b9"..., 3}, {")"..., 1}, {"[0x"..., 3}, {"485b19"..., 6}, {"]\n"..., 2}], 9) = 45
writev(14, [{"/usr/X11R6/bin/X"..., 16}, {"("..., 1}, {"WakeupHandler"..., 13}, {"+0x"..., 3}, {"4b"..., 2}, {")"..., 1}, {"[0x"..., 3}, {"451d2b"..., 6}, {"]\n"..., 2}], 9) = 47
writev(14, [{"/usr/X11R6/bin/X"..., 16}, {"("..., 1}, {"WaitForSomething"..., 16}, {"+0x"..., 3}, {"1ef"..., 3}, {")"..., 1}, {"[0x"..., 3}, {"4ee59f"..., 6}, {"]\n"..., 2}], 9) = 51
writev(14, [{"/usr/X11R6/bin/X"..., 16}, {"("..., 1}, {"Dispatch"..., 8}, {"+0x"..., 3}, {"80"..., 2}, {")"..., 1}, {"[0x"..., 3}, {"44def0"..., 6}, {"]\n"..., 2}], 9) = 42
writev(14, [{"/usr/X11R6/bin/X"..., 16}, {"("..., 1}, {"main"..., 4}, {"+0x"..., 3}, {"3bd"..., 3}, {")"..., 1}, {"[0x"..., 3}, {"433c5d"..., 6}, {"]\n"..., 2}], 9) = 39
writev(14, [{"/lib/libc.so.6"..., 14}, {"("..., 1}, {"__libc_start_main"..., 17}, {"+0x"..., 3}, {"e6"..., 2}, {")"..., 1}, {"[0x"..., 3}, {"7f496c8165a6"..., 12}, {"]\n"..., 2}], 9) = 55
writev(14, [{"/usr/X11R6/bin/X"..., 16}, {"[0x"..., 3}, {"4330e9"..., 6}, {"]\n"..., 2}], 4) = 27

Changed in xorg:
importance: Undecided → High
status: New → Triaged
Bryce Harrington (bryce) wrote :

Okay if I understand this correctly, X is going through its usual Wakeup handling, and one of the input extensions tries to write a message to the log, however something is wrong and in trying to free a string (maybe the tmpBuf in LogVMessageVerb?) it crashes. Then, I guess it is trying to write this backtrace to the log but of course can't do that, and I guess this triggers the heap corruption?

This seems to be an unusually severe crash, it's not surprising apport had trouble with it. If there's any way we can get a full backtrace, it would help fill in some of the gaps in this understanding.

Would it be possible for you to run X in gdb with a break on xf86Wakeup(), and step through the lines in that routine to see which line leads to the crash?

Matt Zimmerman (mdz) wrote :

Bryce, I think it's significantly simpler than that.

This stack trace isn't being written out by the X server but by glibc (note the different formatting). glibc notices that the heap is corrupt while attempting to satisfy a free(3) call, prints out a backtrace to /dev/tty, and aborts the process with SIGABRT. The X server's own signal handler isn't involved here, and there's only one crash.

The main obstacle I face in trying to debug this properly is that my text consoles don't work. They show a corrupt screen with some blue lines. I don't know if this is a recurrence of bug 312677 or something new, but if it were fixed, I could run X from the console and see the backtrace printed by glibc to /dev/tty.

Matt Zimmerman (mdz) wrote :

My text consoles are now working again (still not sure what the root cause was). So I now have the X server attached with gdb, and when this happens next, I'll get a full backtrace.

However, this backtrace will only be for the free(3) which triggers the abort. If my guess is right, then the corruption is happening sometime earlier, and may be very difficult to track down. Does the X server play nicely with valgrind?

Bryce Harrington (bryce) wrote :

Good to hear the text console issue has gone away. Issues with text consoles can sometimes be kernel issues; if it went away after upgrading the kernel, then that would be a pretty strong clue. We've not put in any changes to -intel recently that would have made that start working, and the only noteworthy xserver change was removing the maxclients change (which caused issues for -nvidia), although I'm doubtful that would have caused vt corruption. So, my guess there would be that either the issue is intermittent and will come back, or it was a kernel issue that is now fixed.

> However, this backtrace will only be for the free(3) which triggers the abort.
> If my guess is right, then the corruption is happening sometime earlier, and
> may be very difficult to track down. Does the X server play nicely with valgrind?

valgrind is definitely listed as an X testing tool (http://freedesktop.org/wiki/Software/XTesting) although I've tried running it myself and it just makes X flicker and turns my vt consoles yellow. It certainly flags a lot of issues, though. If you can get X to run under it enough to reproduce the issue, maybe it could give us useful info.

Download full text (7.0 KiB)

After a couple of days of normal use (including suspend/resume), I caught it in gdb. Again, the actual corruption may be happening elsewhere, but at least we can see exactly where the server is dying.

Program received signal SIGABRT, Aborted.
0x00007f7ccea1cfb5 in raise () from /lib/libc.so.6
(gdb) bt
#0 0x00007f7ccea1cfb5 in raise () from /lib/libc.so.6
#1 0x00007f7ccea1ebc3 in abort () from /lib/libc.so.6
#2 0x00007f7ccea5c228 in __libc_message () from /lib/libc.so.6
#3 0x00007f7ccea61cb8 in malloc_printerr () from /lib/libc.so.6
#4 0x00007f7ccea64276 in free () from /lib/libc.so.6
#5 0x00000000004faae4 in LogVMessageVerb (type=X_INFO, verb=3,
    format=0xa7e7f00 "intel(0): xf86BindGARTMemory: bind key %d at 0x%08lx (pgoffset %d)\n", args=0x7fffd8eed3f0) at ../../os/log.c:392
#6 0x000000000048e99b in xf86VDrvMsgVerb (scrnIndex=0, type=X_INFO, verb=3,
    format=0x57fe18 "xf86BindGARTMemory: bind key %d at 0x%08lx (pgoffset %d)\n", args=0x7fffd8eed3f0) at ../../../../hw/xfree86/common/xf86Helper.c:1260
#7 0x0000000000490340 in xf86DrvMsgVerb (scrnIndex=3739, type=3739, verb=6,
    format=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>)
    at ../../../../hw/xfree86/common/xf86Helper.c:1275
#8 0x00000000004749b4 in xf86BindGARTMemory (screenNum=<value optimized out>,
    key=0, offset=190644224)
    at ../../../../../hw/xfree86/os-support/linux/lnx_agp.c:304
#9 0x00007f7ccd4344dd in i830_bind_memory (pScrn=0x258df10, mem=0x25c6f20)
    at ../../src/i830_memory.c:210
#10 0x00007f7ccd434f5d in i830_bind_all_memory (pScrn=0x258df10)
    at ../../src/i830_memory.c:2024
#11 0x00007f7ccd42f190 in I830EnterVT (scrnIndex=0,
    flags=<value optimized out>) at ../../src/i830_driver.c:3587
#12 0x000000000048c392 in CMapEnterVT (index=0, flags=0)
---Type <return> to continue, or q <return> to quit---
    at ../../../../hw/xfree86/common/xf86cmap.c:455
#13 0x000000000049d679 in xf86XVEnterVT (index=0, flags=0)
    at ../../../../hw/xfree86/common/xf86xv.c:1228
#14 0x00007f7ccdf08c7f in glxDRIEnterVT (index=0, flags=0)
    at ../../glx/glxdri.c:858
#15 0x0000000000485c7c in xf86Wakeup (blockData=<value optimized out>,
    err=<value optimized out>, pReadmask=<value optimized out>)
    at ../../../../hw/xfree86/common/xf86Events.c:634
#16 0x0000000000451d2b in WakeupHandler (result=-1, pReadmask=0x7ddf20)
    at ../../dix/dixutils.c:418
#17 0x00000000004ee59f in WaitForSomething (pClientsReady=0x27cd6e0)
    at ../../os/WaitFor.c:231
#18 0x000000000044def0 in Dispatch () at ../../dix/dispatch.c:367
#19 0x0000000000433c5d in main (argc=10, argv=0x7fffd8eedb18,
    envp=<value optimized out>) at ../../dix/main.c:397
(gdb) bt full
#0 0x00007f7ccea1cfb5 in raise () from /lib/libc.so.6
No symbol table info available.
#1 0x00007f7ccea1ebc3 in abort () from /lib/libc.so.6
No symbol table info available.
#2 0x00007f7ccea5c228 in __libc_message () from /lib/libc.so.6
No symbol table info available.
#3 0x00007f7ccea61cb8 in malloc_printerr () from /lib/libc.so.6
No symbol table info available.
#4 0x00007f7ccea64276 in free () from /lib/libc.so.6
No symbol table info available.
#5 0x00000000004faae4 in LogVMessageVerb (type=X_INFO, ve...

Read more...

Matt Zimmerman (mdz) wrote :

This is still happening to me once out of every few resumes.

Bryce Harrington (bryce) on 2009-03-05
description: updated
Steve Langasek (vorlon) wrote :

Marked my bug (332347) as a duplicate of this one, and targeting to jaunty.

Martin Pitt (pitti) on 2009-03-06
Changed in xorg-server:
assignee: nobody → bryce
assignee: bryce → bryceharrington
Steve Langasek (vorlon) wrote :

Was going to bisect, then noticed that the first xserver-xorg-video-intel version where I started seeing the problem had a changelog entry describing a single fix related to switching video output - which is precisely what appears to trigger the bug for me. Have downgraded to 2:2.6.1-1ubuntu1, and so far X is stable for me again, in the face of being exercised with lid open/close, suspend/resume, toggling video output, etc.

No clue why that small patch from -1ubuntu2 is triggering heap corruption - clearly the corruption is happening in some other part of the code. But I'm about 98% certain that this is the change that introduced the problem. I'll continue to run with this version for the time being and try to trigger the crash in the usual ways (you know, suspend/resume, sneezing, etc).

Bryce Harrington (bryce) wrote :

Thanks for narrowing it to that. Given that it's been 5 days since the last comment, am I correct in assuming with that version reverted, that you're no longer seeing the issue?

Steve Langasek (vorlon) wrote :

Yes, this bug has not recurred since I downgraded.

Matt Zimmerman (mdz) wrote :

The changes in 2:2.6.1-1ubuntu2 (http://launchpadlibrarian.net/22429963/xserver-xorg-video-intel_2%3A2.6.1-1ubuntu1_2%3A2.6.1-1ubuntu2.diff.gz) seem so innocuous, I wonder if there isn't some secondary effect here.

Maybe a toolchain problem? Does recompiling 2:2.6.1-1ubuntu1 still produce a working binary?

Steve Langasek (vorlon) wrote :

I've rebuilt 1ubuntu1 here with current toolchain and am running it now. So far, the hokey pokey routine hasn't caused it to crash, but it's also only been an hour.

Bryce Harrington (bryce) wrote :

Fwiw, the change in 1u2 has been in the upstream git tree as well for some time now, and none of the referenced upstream bug #'s have further chatter.

Matt Zimmerman (mdz) wrote :

Suspecting a possible toolchain bug, I reviewed the relevant gcc changes:

https://edge.launchpad.net/ubuntu/+source/gcc-4.3

It looks like -1ubuntu2 will have been built prior to:

gcc-4.3 (4.3.3-3ubuntu4) jaunty; urgency=low

  * Fix PR target/39118 (wrong code on x86_64 with -O2 -fomit-frame-pointer).

 -- Matthias Klose < <email address hidden>> Wed, 11 Feb 2009 23:34:36 +0100

Note that the description of the bug is incorrect, based on the comments upstream: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=39118

According to the upstream bug, this "can cause subtle and unpredictable bugs depending upon when the kernel interrupts this code."

Steve, if you don't have any problems with you current test case, I think the next thing to try is a recompile of -1ubuntu5 with the current toolchain.

On Fri, Mar 13, 2009 at 09:16:54AM -0000, Matt Zimmerman wrote:
> Steve, if you don't have any problems with you current test case, I
> think the next thing to try is a recompile of -1ubuntu5 with the current
> toolchain.

Well, the problem is present in the -1ubuntu4 build also, and the build log
for that version (built on Mar 2) shows that gcc-4.3 4.3.3-5ubuntu1 was used
for the build, postdating the fix in 4.3.3-3ubuntu4.

I was guessing that this was simply a case of the seemingly-innocuous change
exposing a bug elsewhere in the code, possibly a bug that upstream has
already fixed. I have it on my todo list to build the latest upstream and
see if the bug is reproducible there, but I understand this depends on
having a bleeding-edge libdrm first - I was waiting for a new upstream
version to be uploaded to jaunty, but Bryce says even this may not be new
enough. Bryce, do you have a pointer to the upstream git for libdrm?

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

ok, I just had a crash with the rebuilt -1ubuntu1 version on resume from suspend. So it looks as though this might be a bug with the *current* toolchain. I'll try to rebuild the current version of the driver now with the older toolchain instead.

Matthias Klose (doko) wrote :

> According to the upstream bug, this "can cause subtle and unpredictable bugs
> depending upon when the kernel interrupts this code."

yes, the comment was before the second fix was checked in; the 4.3.3-3ubuntu4 upload includes both checkins.

Bryce Harrington (bryce) wrote :

> I was waiting for a new upstream version to be uploaded to jaunty, but Bryce says even this may not be new
enough.

AFAIK, what's in jaunty now should be new enough. I put 2.4.5 in last week. I found there was also a kernel dependency (which unfortunately I think the kernel team just didn't have the time to get to), but I was able to work around it with a tiny (upstream-approved) patch. -intel 2.6.3 is now uploaded as well, so I think everything you need should be in place; let me know if otherwise.

> Bryce, do you have a pointer to the upstream git for libdrm?

I think you probably don't need it, but in case you do, it's here:
http://cgit.freedesktop.org/mesa/drm/

On Thu, Mar 19, 2009 at 01:03:57AM -0000, Bryce Harrington wrote:
> > I was waiting for a new upstream version to be uploaded to jaunty, but Bryce says even this may not be new
> enough.

> AFAIK, what's in jaunty now should be new enough. I put 2.4.5 in last
> week. I found there was also a kernel dependency (which unfortunately I
> think the kernel team just didn't have the time to get to), but I was
> able to work around it with a tiny (upstream-approved) patch. -intel
> 2.6.3 is now uploaded as well, so I think everything you need should be
> in place; let me know if otherwise.

Ok. FWIW, the 2.6.1 package now appears to FTBFS against current jaunty
(QUIRK_BROKEN_ACPI_LID?), so I've built 2.6.3 instead using the old
toolchain and will test it out this evening once I stop needing X to be in
working shape to allow me to write email.

Thanks,
--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Okay, since it's sounding like the issue here is outside my normal purview I'm de-assigning myself this bug. Sounds like it may not be -intel either, but will leave it to someone else to assign it to the correct package. I'll stay subscribed to follow the progress though, so if there's anything I need to do for this one, let me know.

Changed in xserver-xorg-video-intel (Ubuntu Jaunty):
assignee: bryceharrington → nobody
Steve Langasek (vorlon) wrote :

After upgrading to xserver-xorg-video-intel 2:2.6.3-0ubuntu1 (the version from the archive, not one built with an older toolchain), I have so far been unable to reproduce the original crash. Matt, are you still seeing this with 2.6.3?

On Fri, Mar 20, 2009 at 08:19:36AM -0000, Steve Langasek wrote:
> After upgrading to xserver-xorg-video-intel 2:2.6.3-0ubuntu1 (the
> version from the archive, not one built with an older toolchain), I have
> so far been unable to reproduce the original crash. Matt, are you still
> seeing this with 2.6.3?

I have had several unexplained X server crashes this week, but I'm not sure
when I upgraded to 2:2.6.3-0ubuntu1. I upgraded to 2:2.6.3-0ubuntu2
yesterday and am presently running with that.

--
 - mdz

I have just confirmed that my X server still crashes with 2:2.6.3-0ubuntu2. The usual scenario is:

1. I watch a DVD using totem-xine
2. I close totem-xine
3. I suspend the laptop using Fn+F4
4. I go to sleep
5. In the morning, I resume the laptop using the power button
6. The screen flickers and I am presented with the gdm login screen, rather than my session (with the old X server having mysteriously vanished)

Today, I ran a suspend/resume stress test using the script in checkbox, but it completed successfully with no crashes. I've also tried suspending the system after running totem-xine (which works reliably), and even while totem-xine is running (the latter fails to go to sleep at all, but gracefully). I am still looking for a way to trigger this crash which does not require waiting overnight. Any guesses or suggestions would be appreciated.

Matt Zimmerman (mdz) wrote :

I should also mention that my usual setup is to use an external VGA display, with the internal LCD switched off using xrandr.

Nafallo Bjälevik (nafallo) wrote :

 Hi,

I seem to have what Matt is describing in his last step (6). I'm not watching movies and sleeping however, I rather work away with gajim, terminator, firefox and thunderbird opened. Suspend by closing the lid and then wake up by opening it in the next data centre (after travelling). I've been greeted by the GDM login screen the last few days.

00:02.0 VGA compatible controller: Intel Corporation Mobile GM965/GL960 Integrated Graphics Controller (rev 0c)

And I'm seeing this in dmesg on VT switches:
[50311.489997] [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0
[50317.837740] [drm:i915_get_vblank_counter] *ERROR* trying to get vblank count for disabled pipe 0

Nafallo Bjälevik (nafallo) wrote :

I should also mention this crash isn't always the case. Haven't really got any good statics on how often it happens either.

Matt Zimmerman (mdz) wrote :

Thanks to a patch from Kees Cook, users running current Jaunty should be able to tell if they have this bug by checking /var/log/gdm:

xorg-server (2:1.6.0-0ubuntu4) jaunty; urgency=low

  * Add 168_glibc_trace_to_stderr.patch:
    - Catch glibc internal abort traces on stderr instead of to the
      controlling terminal (LP: #345748).

 -- Kees Cook <email address hidden> Thu, 19 Mar 2009 22:39:33 -0700

Matt Zimmerman (mdz) wrote :

I'm definitely still seeing this with:

linux-image-2.6.28-11-generic 2.6.28-11.36
xserver-xorg-core 2:1.6.0-0ubuntu4
xserver-xorg-video-intel 2:2.6.3-0ubuntu2

Thanks to Kees' patch, I have the latest stack trace in the log (which is slightly different from the one I saw before):

*** glibc detected *** /usr/X11R6/bin/X: free(): invalid next size (fast): 0x000000001d2ca290 ***
======= Backtrace: =========
/lib/libc.so.6[0x7fae1c24fcb8]
/lib/libc.so.6(cfree+0x76)[0x7fae1c252276]
/usr/X11R6/bin/X(LogVMessageVerb+0x104)[0x4fb964]
/usr/X11R6/bin/X(xf86MsgVerb+0x8f)[0x48f26f]
/usr/X11R6/bin/X[0x4708a6]
/usr/X11R6/bin/X(xf86DeleteInput+0x1a)[0x491b3a]
/usr/X11R6/bin/X(DeleteInputDeviceRequest+0xf2)[0x496872]
/usr/X11R6/bin/X[0x46bf5e]
/usr/X11R6/bin/X[0x46c037]
/usr/lib/libhal.so.1[0x7fae1d3fcee9]
/lib/libdbus-1.so.3(dbus_connection_dispatch+0x33b)[0x7fae1d1c290b]
/lib/libdbus-1.so.3[0x7fae1d1c2ccb]
/usr/X11R6/bin/X[0x46b2fb]
/usr/X11R6/bin/X(WakeupHandler+0x4b)[0x451ebb]
/usr/X11R6/bin/X(WaitForSomething+0x1ef)[0x4ef41f]
/usr/X11R6/bin/X(Dispatch+0x80)[0x44e070]
/usr/X11R6/bin/X(main+0x3bd)[0x433ddd]
/lib/libc.so.6(__libc_start_main+0xe6)[0x7fae1c1f65a6]
/usr/X11R6/bin/X[0x433269]

Nafallo Bjälevik (nafallo) wrote :
Steve Langasek (vorlon) wrote :

Here's a recipe for running your X server under valgrind that might be helpful.

$ sudo dpkg-divert --local --rename --divert /usr/bin/X.valgrind-madness /usr/bin/X
$ sudo tee /usr/bin/X
#!/bin/sh
exec valgrind --error-limit=no --log-file=/var/log/Xorg-valgrind.log X.valgrind-madness "$@"
^D
$ sudo chmod a+x /usr/bin/X
$

I think that's sufficient to get X started with the right permissions from gdm. If you're starting it by some other method that depends on X's suid bit, you'll probably need to set valgrind itself suid root (unfortunately) to get the right permissions available.

To unwind, run:

$ sudo rm /usr/bin/X
$ sudo dpkg-divert --remove --local --rename /usr/bin/X
$

Matt Zimmerman (mdz) wrote :

Has anyone ever seen this happen outside of xf86Wakeup()?

It's curious that we always crash there, even though the point of failure is generally in the logging functions (which are called many other times). That points to the corruption happening sometime shortly before.

Luka Renko (lure) wrote :

I am having the same problem on up-to-date Kubuntu on x200s with Intel graphics. If suspend is done by PowerDevil (KDE4 power manager), either by closing lid or selecting Suspend from menu, I get KDM login screen after resume.

If I start suspend from command line (sudo pm-suspend), resume works and I get my KDE session back after resume.

Luka Renko (lure) wrote :

I get this error in /var/log/kdm.log:

X: ../../src/i830_batchbuffer.h:78: intel_batch_emit_dword: Assertion `pI830->batch_ptr != ((void *)0)' failed.

On Mon, Mar 23, 2009 at 08:47:59PM -0000, Luka Renko wrote:
> I get this error in /var/log/kdm.log:
>
> X: ../../src/i830_batchbuffer.h:78: intel_batch_emit_dword: Assertion
> `pI830->batch_ptr != ((void *)0)' failed.

You are experiencing a different bug; please file it separately with
"ubuntu-bug xserver-xorg-video-intel"

--
 - mdz

Matt Zimmerman (mdz) wrote :

I've just seen it again, with a backtrace similar to my previous one (i.e. DeleteInputDeviceRequest rather than i830_bind_memory). I don't understand why libdbus and libhal are in the stack trace, though; are these callbacks?

*** glibc detected *** /usr/X11R6/bin/X: free(): invalid next size (fast): 0x0000000005f124e0 ***
======= Backtrace: =========
/lib/libc.so.6[0x7f096ada6cb8]
/lib/libc.so.6(cfree+0x76)[0x7f096ada9276]
/usr/X11R6/bin/X(LogVMessageVerb+0x104)[0x4fb964]
/usr/X11R6/bin/X(xf86MsgVerb+0x8f)[0x48f26f]
/usr/X11R6/bin/X[0x4708a6]
/usr/X11R6/bin/X(xf86DeleteInput+0x1a)[0x491b3a]
/usr/X11R6/bin/X(DeleteInputDeviceRequest+0xf2)[0x496872]
/usr/X11R6/bin/X[0x46bf5e]
/usr/X11R6/bin/X[0x46c037]
/usr/lib/libhal.so.1[0x7f096bf53ee9]
/lib/libdbus-1.so.3(dbus_connection_dispatch+0x33b)[0x7f096bd1990b]
/lib/libdbus-1.so.3[0x7f096bd19ccb]
/usr/X11R6/bin/X[0x46b2fb]
/usr/X11R6/bin/X(WakeupHandler+0x4b)[0x451ebb]
/usr/X11R6/bin/X(WaitForSomething+0x1ef)[0x4ef41f]
/usr/X11R6/bin/X(Dispatch+0x80)[0x44e070]
/usr/X11R6/bin/X(main+0x3bd)[0x433ddd]
/lib/libc.so.6(__libc_start_main+0xe6)[0x7f096ad4d5a6]
/usr/X11R6/bin/X[0x433269]

description: updated
Matt Zimmerman (mdz) wrote :

I'm now running with valgrind, using a slightly modified version of Steve's script:

perseus:[~] cat /usr/bin/X
#!/bin/sh
savelog -c 7 /var/log/Xorg-valgrind.log
exec valgrind --error-limit=no --log-file=/var/log/Xorg-valgrind.log X.valgrind-madness "$@"

I tried a single suspend/resume cycle and it did not turn up any corruption. I will leave it until the next time it crashes and see if it detects anything.

Matt Zimmerman (mdz) wrote :

I just saw this bug in a new circumstance: during installation using Ubiquity on an Acer Aspire One with the following graphics chipset:

00:02.0 VGA compatible controller: Intel Corporation Mobile 945GME Express Integrated Graphics Controller (rev 03)
00:02.1 Display controller: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller (rev 03)

This is the first time I've confirmed it without a suspend/resume.

Matt Zimmerman (mdz) wrote :

The previous comment also means that this is not limited to 64-bit, which was also common in the other reports.

description: updated
summary: - *** glibc detected *** free(): invalid next size (fast) for xf86Wakeup()
- call
+ X server crash: *** glibc detected *** free(): in valid next size (fast)
Matt Zimmerman (mdz) wrote :

I just realized a new element that all of these scenarios have in common: the screensaver.

The screensaver is activated when coming back from resume.

The crash during installation happened at just about the point where the screensaver activated due to an activity timeout (I noticed during my second attempt that it activated at just about the point where it had crashed).

This would potentially explain why we've been unable to trigger the bug with suspend/resume stress testing, because there's no screensaver timeout.

This is just a guess at this point, but I think the coincidence justifies a closer look.

Chris Jones (cmsj) wrote :

Matt: if it is screensaver related, it's probably not because of some 3D parts - I have a blank screen and am seeing this, although I've yet to find a log of it (perhaps just because of a lack of retention of Xorg logs for very long, which is something I filed separately a while ago as Bug #274870 ).
It's odd that it doesn't happen more repeatably though, activating the screensaver happens every time I suspend, but I only get a crash on resume every couple of days.

Robert Jordens (jordens) wrote :

I get this crash on resume from time to time. But I also get a crash with the same symptoms on switching between the internal LVDS and an external VGA with xrandr but without a suspend/resume.

Tormod Volden (tormodvolden) wrote :

I got suspicious when I saw LogVMessageVerb in the backtrace, which I have touched recently myself... Sure enough I did not think about the possibility of clocks going backwards (which they sometimes do after resume) when I worked on the timestamp stuff and you end up with nice timestamps (like below) with no bounds checking. The tmpBuf is overflowed and free()'ing it messes up. Sorry.

[-1009303149.602078] (II) AIGLX: Suspending AIGLX clients for VT switch

The timestamp patch was supposed to be taken away before Beta anyway. Bryce, can you please remove it now, and I will fix the timestamp again for later.

On Tue, Mar 24, 2009 at 12:27:07PM -0000, Tormod Volden wrote:
> I got suspicious when I saw LogVMessageVerb in the backtrace, which I
> have touched recently myself... Sure enough I did not think about the
> possibility of clocks going backwards (which they sometimes do after
> resume) when I worked on the timestamp stuff and you end up with nice
> timestamps (like below) with no bounds checking. The tmpBuf is
> overflowed and free()'ing it messes up. Sorry.
>
> [-1009303149.602078] (II) AIGLX: Suspending AIGLX clients for VT switch

Thanks very much! I was getting pretty worried about this and hadn't been
able to track it down yet.

> The timestamp patch was supposed to be taken away before Beta anyway.
> Bryce, can you please remove it now, and I will fix the timestamp again
> for later.

Wouldn't it be better to fix the bug than to remove the patch? The
timestamps are useful.

--
 - mdz

Matt Zimmerman (mdz) wrote :

Assigning back to Bryce as this has turned out to be a server bug after all

Changed in xserver-xorg-video-intel (Ubuntu Jaunty):
assignee: nobody → bryceharrington
Tormod Volden (tormodvolden) wrote :

The timestamps are useful for testing, but we do not want to use a different log format than upstream (for upstreaming of bug reports, automatic processing of logs, googling etc).

I'll fix the bug anyway, we can use the timestamping in our development versions.

Tormod Volden (tormodvolden) wrote :

Here is a debdiff that disables the timestamp patch for now.

Tormod Volden (tormodvolden) wrote :

This alternative debdiff also fixes the timestamping, in that it deals with negative time deltas or large positive ones.

Tormod Volden (tormodvolden) wrote :

Changed it a bit so it resets the stopwatch when the time has gone backwards.

Bryce Harrington (bryce) wrote :

Interesting, it seems my initial analysis in comment #5 was not far off.

For now I've opted to drop the patch with tormod's debdiff in comment #51. I've uploaded this, but defer to slangasek's discretion in including it for beta. (I favor dropping it in beta myself but know that time is short, and while the crashes are serious and may lead to some extraneous bug reports during testing, I don't think it's going to critically inhibit testing activity.) If it doesn't go in for beta, it will go in post-beta.

Meanwhile, I think the patch should be changed to use safer string handling code. snprintf() has been suggested.

mdz has suggested this patch would be useful to include in the release. However, I have several concerns before doing this. First, we've been operating under the assumption that this was just test scaffolding code that would be dropped before the release, so it has not received the level of review that it needs. Second, as they say, "bugs tend to live in clusters", and since we've found several bugs in this patch to date I think it would be safe to assume there are more in there we've not yet found. Third, this patch has been beneficial for debugging purposes during development, but I'm uncertain if that need post-release is as strong; as tormod mentions in comment #50 there are some tradeoffs if we keep it. Possibly this could be addressed best by making it an xorg.conf togglable option.

mdz also suggested it should be posted upstream, which I think would net useful feedback for improving the patch. I would like to wait on doing this until the patch has been reviewed and updated to fix the aforementioned problems we already know about. I can't guess whether or not they will take the patch, but if nothing else perhaps there will be suggestions for improving it that we'd benefit from.

Meanwhile, it's curious that the code is getting negative timestamps. That secondary issue sounds like it may be worth some additional analysis. It sounds to me like a pointer is turning up uninitialized after resume, or there is an unsigned/signed or long/int type conversion going on. Either way, that could hint at further bugs that need to be resolved.

Bryce Harrington (bryce) wrote :

<kees> bryce: this updated patch for 160 will avoid the overflow creep: http://pastebin.osuosl.org/25076
<kees> bryce: but I don't think upstream will take it, due to the asprintf use.

Kees Cook (kees) wrote :

With whitespace intact...

For note, the "%5d" is the problem -- it can grow beyond 5 digits. I recommend asprintf, since it will calculate lengths automatically. If not, please always use snprintf.

Steve Langasek (vorlon) wrote :

On Tue, Mar 24, 2009 at 08:15:56PM -0000, Bryce Harrington wrote:
> Interesting, it seems my initial analysis in comment #5 was not far off.

> For now I've opted to drop the patch with tormod's debdiff in comment
> #51. I've uploaded this, but defer to slangasek's discretion in
> including it for beta. (I favor dropping it in beta myself but know
> that time is short, and while the crashes are serious and may lead to
> some extraneous bug reports during testing, I don't think it's going to
> critically inhibit testing activity.) If it doesn't go in for beta, it
> will go in post-beta.

Either we force retesting of all the ISOs, or we have to advise users to
upgrade X before they can submit meaningful bug reports. If the crashes
only happen once the timestamp overflows (~28h), it's highly unlikely that
this will impact liveCD users; users installing should be tracking the
post-beta updates anyway, and can be explicitly instructed to do so for the
X server as part of the beta errata.

So I'm opting not to reroll the CDs for this, but this change is on the
watch list for things to include if we do have to reroll.

> Meanwhile, it's curious that the code is getting negative timestamps.

Are we sure that it is, rather than simply overflowing the buffer due to
large positive timestamps?

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Matt Zimmerman (mdz) wrote :

Marking as Fix Committed, as slangasek said there is an upload in the queue for post-beta. Perhaps Bryce could paste the changelog entry here for reference

Changed in xorg-server (Ubuntu Jaunty):
status: Triaged → Fix Committed
Martin Pitt (pitti) wrote :

For the record, this is
http://launchpadlibrarian.net/24314246/xorg-server_1.6.0-0ubuntu5_source.changes

 xorg-server (2:1.6.0-0ubuntu5) jaunty; urgency=low
 .
   [Tormod Volden]
   * Disable 160_log_timestamping.patch before the beta, this was only used
     for testing, and the patch also has a serious stack corruption bug.
     (LP: #328035)
 .
   [Bryce Harrington]
   * Disable 999_default_modedebug_on.patch before the beta, this was used
     only for testing.
   * Add 169_mipointer_nullptr_checks.patch:
     - MIPOINTER() can return NULL in some circumstances, so always check
       its return value before dereferencing.
     - Fixes crash with keyrepeat on non-primary screen on -nvidia when
       using Xinerama (and probably other serious mipointer-related crashes)
       (LP: #324465)
   * Add 170_primary_pci_video_device.patch:
     - Patch from redhat
     - Fixes failure when multiple video cards are present. Xserver currently
       exits with an error "(EE) No device detected" in such cases.
       (LP: #267241)

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package xorg-server - 2:1.6.0-0ubuntu5

---------------
xorg-server (2:1.6.0-0ubuntu5) jaunty; urgency=low

  [Tormod Volden]
  * Disable 160_log_timestamping.patch before the beta, this was only used
    for testing, and the patch also has a serious stack corruption bug.
    (LP: #328035)

  [Bryce Harrington]
  * Disable 999_default_modedebug_on.patch before the beta, this was used
    only for testing.
  * Add 169_mipointer_nullptr_checks.patch:
    - MIPOINTER() can return NULL in some circumstances, so always check
      its return value before dereferencing.
    - Fixes crash with keyrepeat on non-primary screen on -nvidia when
      using Xinerama (and probably other serious mipointer-related crashes)
      (LP: #324465)
  * Add 170_primary_pci_video_device.patch:
    - Patch from redhat
    - Fixes failure when multiple video cards are present. Xserver currently
      exits with an error "(EE) No device detected" in such cases.
      (LP: #267241)

 -- Bryce Harrington <email address hidden> Tue, 24 Mar 2009 09:57:17 -0700

Changed in xorg-server:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.