Bug #111869 “gdb screws stacktraces when no debuginfo is present...” : Bugs : gdb package : Ubuntu

Revision history for this message

Sitsofe Wheeler (sitsofe) wrote on 2007-05-03:

#1

Thank you for your bug report.

Roman:
To the best of my knowledge this has always been the case. Without the symbols I guess it is hard to know exactly where stack frames start and stop and I guess if gdb gets one offset wrong that's pretty much it. As for the binary driver situation... well that's the way it goes.

Revision history for this message

Roman Kennke (roman-kennke) wrote on 2007-05-03:

#2

Thank you. I think you are right. I thought I remember that such a case was showing somewhat better results, but that might have been luck, or maybe it wasn't the debuginfo-lacking binary at the stack top. Please close this bug then.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-15:

#3

So - gdb is working less well than I remember it working in the past, and less well than it could be hoped I think :-)

This is really an analog of #339730# to which I will also attach my test case.

Suffice it to say that most apps that I connect gdb to fail to give a useful stack trace, unless I have full debuginfo installed for all the packages working from the bottom to the top of the stack.

As a banal example here is console kit:

(gdb) bt
#0 0xffffe430 in __kernel_vsyscall ()
#1 0xb7e163f9 in ioctl () from /lib/libc.so.6
#2 0x0805b810 in ?? ()
#3 0x0000000a in ?? ()
#4 0x00005607 in ?? ()
#5 0x00000030 in ?? ()
#6 0x00000030 in ?? ()
#7 0xb7f71770 in g__g_thread_lock () from /usr/lib/libglib-2.0.so.0
#8 0xb7f90ff4 in ?? () from /lib/libpthread.so.0
#9 0xb7f71770 in g__g_thread_lock () from /usr/lib/libglib-2.0.so.0
#10 0x08062444 in ?? ()
#11 0x08062420 in ?? ()
#12 0x08062513 in ?? ()
#13 0xb7a50b58 in ?? ()
#14 0xb7f70ff4 in ?? () from /usr/lib/libglib-2.0.so.0

now I install ConsoleKit-debuginfo and try again:

[Switching to thread 20 (Thread 0xb7a50b90 (LWP 2253))]#0 0xffffe430 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe430 in __kernel_vsyscall ()
#1 0xb7e163f9 in ioctl () from /lib/libc.so.6
#2 0x0805b810 in ck_wait_for_active_console_num (console_fd=10, num=48) at ck-sysdeps-unix.c:266
#3 0x080518a2 in vt_thread_start (data=0x8076918) at ck-vt-monitor.c:322
#4 0xb7f2039f in g_thread_create_proxy (data=0x8076928) at gthread.c:635
#5 0xb7f82175 in start_thread (arg=0xb7a50b90) at pthread_create.c:297
#6 0xb7e1ddde in clone () from /lib/libc.so.6

Same running process - same system, no other debuginfo installed [ NB. I have debuginfo for libc & gthread) so ...

Anyhow - I created a more minimal test case, which I'll attach.

It would be -extremely- useful wrt. debugging things to have a fix here :-) [ also for SLED ]

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-15:

#4

Created an attachment (id=215511)
gdb test-case: run 'make'

I run make here and see:

(gdb) bt
#0 0xffffe430 in __kernel_vsyscall ()
#1 0x400db0f0 in __nanosleep_nocancel () from /lib/libc.so.6
#2 0x400daefe in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#3 0x0804858b in ?? ()
#4 0x00000000 in ?? ()

I would expect gdb to be able to wander back up the stack frames giving at least vaguely intelligent information for each frame - in particular for trace_two - which should have good debuginfo.

Note - if I remove the 'strip ./a.out' from the Makefile I get:

(gdb) bt
#0 0xffffe430 in __kernel_vsyscall ()
#1 0x400db0f0 in __nanosleep_nocancel () from /lib/libc.so.6
#2 0x400daefe in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#3 0x0804858b in trace_zero ()
#4 0x080485a5 in trace_one ()
#5 0x4001e4ac in trace_two (fn=0x804858d <trace_one>) at two.c:8
#6 0x080485b9 in trace_three ()
#7 0x080485d1 in main ()

which is much more like what I would expect to get .

HTH.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Schwab-novell (schwab-novell) wrote on 2008-05-16:

#5

If you strip you don't have any debuginfo.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-16:

#6

Naturally - however apparently we ship our packages with stripped libraries in them: and this causes gdb to become -far- less useful wrt. debugging almost anything: worse it's a regression - this used to work quite well.

Surely there is no conceptual reason why we can't walk back up the un-optimised [!] frame-pointer containing [!] code and give at least useful output where we have debuginfo; I would accept:

#0 0xffffe430 in __kernel_vsyscall ()
#1 0x400db0f0 in __nanosleep_nocancel () from /lib/libc.so.6
#2 0x400daefe in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:138
#3 0x0804858b in ??
#4 0x080485a5 in ??
#5 0x4001e4ac in trace_two (fn=0x804858d <trace_one>) at two.c:8
#6 0x080485b9 in ??
#7 0x080485d1 in main ()

but not giving up as we do at frame #4.

Really - having a system that is extremely hard to debug, and requires the installation of tons of mostly redundant debuginfo packages makes life rather harder than it should be [ not to mention the horrible truncation of the trace ]

Can you re-consider the priority change ? the evo. guys complain like mad about this on SLED10 - it makes their lives hell wrt. debugging.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Schwab-novell (schwab-novell) wrote on 2008-05-16:

#7

Since there is no symbol at PC there is no way to find the beginning of the function for prologue analysis.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-16:

#8

So - let me get this right; are you stating that:

* in order to generate a useful backtrace - in future it is always going to be necessary to install debuginfo packages from the bottom of the trace upwards ?

That would suck some big rocks and was historically not the case.

Also - it would be interesting to know how valgrind [ if you replace the sleep() with (*(int*)0) = 0; ] manages to generate a useful debug log:

==21438== Process terminating with default action of signal 11 (SIGSEGV)
==21438== Access not within mapped region at address 0x0
==21438== at 0x8048584: (within /home/michael/gdb-testcase/a.out)
==21438== by 0x80485AF: (within /home/michael/gdb-testcase/a.out)
==21438== by 0x40294AB: trace_two (two.c:8)
==21438== by 0x80485C3: (within /home/michael/gdb-testcase/a.out)
==21438== by 0x80485DB: (within /home/michael/gdb-testcase/a.out)
==21438== by 0x40655F5: (below main) (libc-start.c:220)

ie. much as my suggested trace above. how does valgrind do it ? [ cf. coregrind/m_stacktrace.c (get_StackTrace_wrk) ] - it certainly doesn't give up almost immediately and actually gets to 'main' :-)

Why is it that we cannot simply follow the %ebp chain up and get a load of values for IPs for each function call point & make some educated guess (as valgrind does) ?

I've tentatively re-opened - if I'm just being totally dim witted here :-) please do re-close, but I would really like to better understand the necessity of prologue analysis when the code is compiled on IA32 with -O0 and no sillies (eg. -fomit-frame-pointer) - valgrind's backtrace appears rather simpler & more robust.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Schwab-novell (schwab-novell) wrote on 2008-05-16:

#9

A debugger absolutely needs more that just a frame chain. There is no way this is going to work with all that missing information.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-16:

#10

So - to try to expand and elaborate when you write the above what I hear is:

   "yes, we could provide a much more useful stack trace to the poor person
    trying to debug something - but we're not going to: because we want
    to show something else much less useful" ;-)

or do I mistake things ? ;-) valgrind can provide a useful trace for a poor hacker trying to find what went wrong and where; but gdb is committed to not do so ? just for reference since we have the valgrind trace above - lets see what gcc does when we get the same segv trapped:

Program received signal SIGSEGV, Segmentation fault.
0x08048584 in ?? ()
(gdb) bt
#0 0x08048584 in ?? ()
#1 0x080486b0 in _IO_stdin_used ()
#2 0x00000001 in ?? ()
#3 0x00000025 in ?? ()
#4 0xb7fec560 in ?? () from /lib/libc.so.6
#5 0x00000027 in ?? ()
#6 0xb7eac6c0 in ?? ()
#7 0xbf930898 in ?? ()
#8 0x080485b0 in ?? ()
#9 0x00000001 in ?? ()
#10 0x00000003 in ?? ()
#11 0xbf9308b8 in ?? ()
#12 0xb80134ac in trace_two (fn=0x1) at two.c:8
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Are you suggesting that this is a more useful view than:

==21438== at 0x8048584: (within /home/michael/gdb-testcase/a.out)
==21438== by 0x80485AF: (within /home/michael/gdb-testcase/a.out)
==21438== by 0x40294AB: trace_two (two.c:8)
==21438== by 0x80485C3: (within /home/michael/gdb-testcase/a.out)
==21438== by 0x80485DB: (within /home/michael/gdb-testcase/a.out)
==21438== by 0x40655F5: (below main) (libc-start.c:220)

[ though I guess, due to some fluke of parameters passed (no 0's) we managed at least to get nice line number information for trace_two ].

If this more useful, what is it ? :-) [ I assume it's just an unwind of the stack, frame by frame printed in hex with some guesses as to function addresses - until we hit a NULL - but how useful is that really honestly ? vs. being able to tell what was called from where ].

Or - are you suggesting that we shouldn't strip any of our binaries as we ship them - so we can can get stack traces when things fail ? or ...

So - to try to expand and elaborate when you write the above what I hear is:

"yes, we could provide a much more useful stack trace to the poor person
    trying to debug something - but we're not going to: because we want
    to show something else much less useful" ;-)

or do I mistake things ? ;-) valgrind can provide a useful trace for a poor hacker trying to find what went wrong and where; but gdb is committed to not do so ? just for reference since we have the valgrind trace above - lets see what gcc does when we get the same segv trapped:

Program received signal SIGSEGV, Segmentation fault.
0x08048584 in ?? ()
(gdb) bt
#0  0x08048584 in ?? ()
#1  0x080486b0 in _IO_stdin_used ()
#2  0x00000001 in ?? ()
#3  0x00000025 in ?? ()
#4  0xb7fec560 in ?? () from /lib/libc.so.6
#5  0x00000027 in ?? ()
#6  0xb7eac6c0 in ?? ()
#7  0xbf930898 in ?? ()
#8  0x080485b0 in ?? ()
#9  0x00000001 in ?? ()
#10 0x00000003 in ?? ()
#11 0xbf9308b8 in ?? ()
#12 0xb80134ac in trace_two (fn=0x1) at two.c:8
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Are you suggesting that this is a more useful view than:

==21438==    at 0x8048584: (within /home/michael/gdb-testcase/a.out)
==21438==    by 0x80485AF: (within /home/michael/gdb-testcase/a.out)
==21438==    by 0x40294AB: trace_two (two.c:8)
==21438==    by 0x80485C3: (within /home/michael/gdb-testcase/a.out)
==21438==    by 0x80485DB: (within /home/michael/gdb-testcase/a.out)
==21438==    by 0x40655F5: (below main) (libc-start.c:220)

[ though I guess, due to some fluke of parameters passed (no 0's) we managed at least to get nice line number information for trace_two ].

If this more useful, what is it ? :-) [ I assume it's just an unwind of the stack, frame by frame printed in hex with some guesses as to function addresses - until we hit a NULL - but how useful is that really honestly ? vs. being able to tell what was called from where ].

Or - are you suggesting that we shouldn't strip any of our binaries as we ship them - so we can can get stack traces when things fail ? or ...

Revision history for this message

In Novell/SUSE Bugzilla #390722, Federico Mena-Quintero (federico-novell) wrote on 2008-05-16:

#11

I agree with Michael. Not being able to get at least an initial clue of where a program is crashing is *massively* inconvenient.

Bug reports then become useless, as everyone files stack traces that are full of NULLs. Then you must play Bugzilla ping-pong to get any useful information from users.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-16:

#12

So - how about some hybrid - I assume the gdb developers think people want to see a raw dump of the stack, looking like stack frames; but why not interleave that - we may not have debug data - but we can guess which are the IP frames by following the %ebp chain; so something like this could be possible (?)

from 0x08048584: (within /home/michael/gdb-testcase/a.out)
0x080486b0 0x00000001 0x00000025 0xb7fec560 0x00000027 0xb7eac6c0 0xbf930898
from 0x080485AF: (within /home/michael/gdb-testcase/a.out)
0x080485b0 0x00000001 0x00000003 0xbf9308b8
from 0x040294AB: trace_two (two.c:8)
from 0x080485C3: (within /home/michael/gdb-testcase/a.out)
...
==21438== by 0x40655F5: (below main) (libc-start.c:220)

or something ?

Revision history for this message

In Novell/SUSE Bugzilla #390722, Schwab-novell (schwab-novell) wrote on 2008-05-19:

#13

Not a bug.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-05-19:

#14

Of course it's a bug, at least a quality of implementation one. If valgrind
can do it, so can gdb. Even more so, because gdb _did_ do this in the past.
It certainly is a regression in gdbs stack walker, and you should try to find
out what it caused.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Schwab-novell (schwab-novell) wrote on 2008-05-19:

#15

It not going to work without that missing information.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-05-19:

#16

I assume you haven't even looked at the issue. valgrind clearly _can_
provide a more usefull backtrace, I'm not sure why you're ignoring this fact.
Hence a %ebp chain is in place, which gdb can use to skip frames where no
debug info exists. Currently gdb is so heavily confused if even _one_
intermediate frame has no debug info, that even higher frames that _do_
have debug info are not parsed anymore. That's the bug, gdb clearly can
do exactly the same as valgrind and use %ebp to skip the frame.
That would be much better than what it currently does. gdb clearly tries
to do something: it walks the stack somewhat, it just is heavily confused
by the addresses. If it absolutely would have to give up, then gdb
should just say so, instead of running wild in memory.

And the claim is, that gdb did this some time ago. At least earlier versions
were more usefull even in absence of debug information, that's nothing you
can discuss away by claiming "it's impossible". It is possible, and this
bug report is a request to make it happen again.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Schwab-novell (schwab-novell) wrote on 2008-05-19:

#17

It not going to work without that missing information.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-05-19:

#18

valgrind can do it, hence can gdb. If you think it can't, explain why.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Schwab-novell (schwab-novell) wrote on 2008-05-19:

#19

It's not going to work without that missing information.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-05-19:

#20

valgrind can do it, hence can gdb. If you think it can't, explain why.
Or better, try to fix gdb.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Schwab-novell (schwab-novell) wrote on 2008-05-19:

#21

It's not going to work without that missing information.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Lpechacek (lpechacek) wrote on 2008-05-20:

#22

Andreas, I don't think this bug should be closed as invalid. In principle, the presence of the frame pointer enables the debugging tool to walk the call chain up to main() without problems.

Moreover another debugging tool can do it. Why GDB can't? Please, be so kind as to elaborate, or at least don't close this bug as invalid. Thanks.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-20:

#23

As another data-point for "It's not going to work without that missing information", here is gdb doing really rather well having jumped a load of mangled stack frames to a frame where there is debuginfo:

(gdb) bt
#0 0x08048584 in ?? ()
#1 0x080486b0 in _IO_stdin_used ()
#2 0x00000001 in ?? ()
#3 0x00000025 in ?? ()
#4 0x40183560 in ?? () from /lib/libc.so.6
#5 0x00000027 in ?? ()
#6 0x401876c0 in ?? ()
#7 0xbff1ee18 in ?? ()
#8 0x080485b0 in ?? ()
#9 0x00000001 in ?? ()
#10 0x00000003 in ?? ()
#11 0xbff1ee38 in ?? ()
#12 0x4001e4ac in trace_two (fn=0x1) at two.c:8
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb) up 12
#12 0x4001e4ac in trace_two (fn=0x1) at two.c:8
8 fn (1, 3);
(gdb) l
3 extern int trace_one (int a, unsigned int b);
4
5 void trace_two (int (*fn) (int, unsigned int))
6 {
7 fprintf (stderr, "this method should have nice debuginfo\n");
8 fn (1, 3);
9 }
(gdb)

Clearly examining (non-)frame #8 doesn't look so good:

#8 0x080485b0 in ?? ()
(gdb) l
Line number 10 out of range; two.c has 9 lines.

but - that's expected of course.

So: it seems that far from being hopeless without the missing information: if we can move into a valid frame higher up the stack that is associated with some debuginfo (and which can easily be determined from %ebp chaining): then we can in fact do some really rather useful debugging :-)

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-20:

#24

So - my interest piqued; I installed SuSE Linux 10.0 to play with: interestingly it is also worse than valgrind - but at least doesn't artificially truncate the stack trace as soon as it feels threatened by a NULL ;-)

Program received signal SIGSEGV, Segmentation fault.
0x0804854a in ?? ()
(gdb) bt
#0 0x0804854a in ?? ()
#1 0xbffa6a78 in ?? ()
#2 0x40080df2 in fwrite () from /lib/tls/libc.so.6
#3 0x08048576 in ?? ()
#4 0x00000001 in ?? ()
#5 0x00000003 in ?? ()
#6 0x4014a6c0 in ?? ()
#7 0x400196ec in ?? () from ./libtwo.so
#8 0x00000000 in ?? ()
#9 0x40015cc0 in ?? () from /lib/ld-linux.so.2
#10 0xbffa6a98 in ?? ()
#11 0x40018577 in trace_two (fn=0x1) at two.c:8
#12 0x40018577 in trace_two (fn=0x8048562 <_init+370>) at two.c:8
#13 0x0804858e in ?? ()
#14 0x08048562 in ?? ()
#15 0x080497f8 in ?? ()
#16 0xbffa6ab8 in ?? ()
#17 0x08048405 in _init ()
#18 0x080485b4 in ?? ()
#19 0x40145ff4 in ?? () from /lib/tls/libc.so.6
#20 0x08048630 in ?? ()
#21 0x00000000 in ?? ()
#22 0x40145ff4 in ?? () from /lib/tls/libc.so.6
#23 0x00000000 in ?? ()
#24 0x40015cc0 in ?? () from /lib/ld-linux.so.2
#25 0xbffa6b28 in ?? ()
#26 0x4003fea0 in __libc_start_main () from /lib/tls/libc.so.6
#27 0x4003fea0 in __libc_start_main () from /lib/tls/libc.so.6
#28 0x08048491 in ?? ()

Of course - the ideal trace would be a lot more like the valgrind trace ;-) but not giving up prematurely would be much appreciated too.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-22:

#25

I managed to find some a friendly gdb hacker (or two) - who gave some helpful advice:

<mjw> -fomit-frame-pointer might be standard these days
<daney> Put a breakpoint on all 'call' and 'ret' instructions and record $sp each time you encounter one...
<mjw> Ah, it was the section above that: http://sourceware.org/gdb/current/onlinedocs/gdbint_3.html#SEC8
<mjw> " 3.1 Prologue Analysis " Clever stuff :)
<mjw> Anyway, I think fedora got this right. Just compile everything with -fasynchronous-unwind-tables then you don't need any heuristics.
<mjw> Although I think daney's solution is also pretty slick :)

Now, of course I'm clueless: do we compile currently with -fasynchronous-unwind-tables ?

<aph_> mjw: ah, but we don't emit unwinder data for the prologue, that's the reason gdb can't just use the unwinder data, now I remember
<mjw> Don't settle for -fexceptions! Go for the full monty! -fasynchronous-unwind-tables backtrace from any place!
<aph_> mjw: even with -fasynchronous-unwind-tables we still aren't accurate in the prologue
<mjw> aph_, I thought that was fixed?
<aph_> I thought it was deliberate to save a lot of space
<daney> Many distributions don't have any unwinding data for libraries written in C. This is the 21st. century. We have memory to burn now...
<aph_> I can't imagine why anyone would want to fix it
<mjw> because otherwise unwinding doesn't work reliably
<mjw> I did work on this for frysk. I thought we got it all right.
<aph_> it does, because you never uinwind in a prologue
<mjw> I have to get my testcases, but I am pretty sure we have tests for stepping into and out a whole function.
<aph_> but you use the debuginfo, don't you?
* mjw was actually pretty proud that worked
<daney> Really if you are using gdb, you should have accurate .debug_frame data for everything. leave .eh_frame for the runtime unwinder.
<aph_> daney: yeah, exactly
<mjw> aph_, in order eh_frame, debug_frame or heuristics based on peeking at the last few words on the stack and see if that was an address that contained a call instruction.
<daney> Really everyone should be using a processor with fixed size instructions (like mips) where it is possible to do accurate unwinding with almost no meta data.
<mjw> yeah, instruction decoding on x86 is so not fun!

etc. etc.

I managed to find some a friendly gdb hacker (or two) - who gave some helpful advice:

<mjw> -fomit-frame-pointer might be standard these days
<daney> Put a breakpoint on all 'call' and 'ret' instructions and record $sp each time you encounter one...
<mjw> Ah, it was the section above that: http://sourceware.org/gdb/current/onlinedocs/gdbint_3.html#SEC8
<mjw> " 3.1 Prologue Analysis " Clever stuff :)
<mjw> Anyway, I think fedora got this right. Just compile everything with -fasynchronous-unwind-tables then you don't need any heuristics.
<mjw> Although I think daney's solution is also pretty slick :)

Now, of course I'm clueless: do we compile currently with -fasynchronous-unwind-tables ?

<aph_> mjw: ah, but we don't emit unwinder data for the prologue, that's the reason gdb can't just use the unwinder data, now I remember
<mjw> Don't settle for -fexceptions! Go for the full monty! -fasynchronous-unwind-tables backtrace from any place!
<aph_> mjw: even with  -fasynchronous-unwind-tables  we still aren't accurate in the prologue
<mjw> aph_, I thought that was fixed?
<aph_> I thought it was deliberate to save a lot of space
<daney> Many distributions don't have any unwinding data for libraries written in C.  This is the 21st. century.  We have memory to burn now...
<aph_> I can't imagine why anyone would want to fix it
<mjw> because otherwise unwinding doesn't work reliably
<mjw> I did work on this for frysk. I thought we got it all right.
<aph_> it does, because you never uinwind in a prologue
<mjw> I have to get my testcases, but I am pretty sure we have tests for stepping into and out a whole function.
<aph_> but you use the debuginfo, don't you?
* mjw was actually pretty proud that worked
<daney> Really if you are using gdb, you should have accurate .debug_frame data for everything.  leave .eh_frame for the runtime unwinder.
<aph_> daney: yeah, exactly
<mjw> aph_, in order eh_frame, debug_frame or heuristics based on peeking at the last few words on the stack and see if that was an address that contained a call instruction.
<daney> Really everyone should be using a processor with fixed size instructions (like mips)  where it is possible to do accurate unwinding with almost no meta data.
<mjw> yeah, instruction decoding on x86 is so not fun!

etc. etc.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-05-23:

#26

Created an attachment (id=217718)
make gdb print %ebp based backtraces

The snippet from the conversation mostly doesn't apply in our cases.
First, due to stripping there is no .debug_frame section anymore, and
without exceptions of course also no .eh_frame (on i386 at least).

Which also means that we can't find the borders of the function a frame
is associated with, and hence can't do any prologue analysis on these
intermediate frames.

(The speculation about -fomit-frame-pointer doesn't apply either, because
we don't compile with that option on i386, otherwise also valgrind would be
lost)

What needs to happen is simply that the fallback unwinder (that gdb
correctly uses if no dwarf debug info exists for the frame at hand)
has to cope with this situation that it doesn't find function borders.
It partly already assumes that it then is a normal %ebp frame, and
if it already assumes so partly, we can also make use of it.

The patch in this attachment does that. It's only for frames where no
debuginfo exists, so it's a strict improvement to the current situation. The
only cases where it could break (but not worse than before) if such frame
happens to be without a frame pointer, i.e. compiled with
-fomit-frame-pointer. We are lost then, for such frames we have no choice
than to use debug info.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-05-23:

#27

Packages are available for testing here:
http://www.go-oo.org/~michael/fixed-gdb/
I would heartily recommend them.

Here is my test experience with OpenOffice.org [ a package with a small 275Mb debuginfo RPM ]:

before patch:

(gdb) bt
#0 0xffffe430 in __kernel_vsyscall ()
#1 0xb6eb21d7 in *__GI___poll (fds=0x8641990, nfds=6, timeout=599) at ../sysdeps/unix/sysv/linux/poll.c:87
#2 0xb554f6f2 in ?? () from /usr/lib/libglib-2.0.so.0
#3 0x08641990 in ?? ()
#4 0x00000006 in ?? ()
#5 0x00000257 in ?? ()
#6 0x08641990 in ?? ()
#7 0x00000006 in ?? ()
#8 0x00000000 in ?? ()

after patch:

(gdb) bt
#0 0xffffe430 in __kernel_vsyscall ()
#1 0xb6eb21d7 in *__GI___poll (fds=0x8641990, nfds=6, timeout=599) at ../sysdeps/unix/sysv/linux/poll.c:87
#2 0xb554f6f2 in ?? () from /usr/lib/libglib-2.0.so.0
#3 0xb554f9d8 in g_main_context_iteration () from /usr/lib/libglib-2.0.so.0
#4 0xb5b3d68d in ?? () from /usr/lib/ooo-2.0/program/libvclplug_gtk680li.so
#5 0xb54f4751 in X11SalInstance::Yield () from /usr/lib/ooo-2.0/program/libvclplug_gen680li.so
#6 0xb7e06cf1 in Application::Yield () from /usr/lib/ooo-2.0/program/libvcl680li.so
#7 0xb7e06d3f in Application::Execute () from /usr/lib/ooo-2.0/program/libvcl680li.so
#8 0x08071c53 in desktop::Desktop::Main ()
#9 0xb7e0a27e in ?? () from /usr/lib/ooo-2.0/program/libvcl680li.so
#10 0xb7e0a41a in SVMain () from /usr/lib/ooo-2.0/program/libvcl680li.so
#11 0x08066c60 in main ()

apparently there is a way this will work.

Andreas - can we get this patch into OpenSUSE 11.0 ? - it would *really* help improve the quality of the product going forward I think.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-05-23:

#28

Created an attachment (id=217741)
same patch, for gdb-6.8 (as in 11.0)

Equivalent patch for the gdb in 11.0.
mbuild job is knuth-matz-2, I'll submit this for 11.0, coolo can take it
or leave it :)

Revision history for this message

In Novell/SUSE Bugzilla #390722, Hpj (hpj) wrote on 2008-05-23:

#29

Mr. Matz, I worship the ground you walk on. Thank you for this fix.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Federico Mena-Quintero (federico-novell) wrote on 2008-05-23:

#30

W00t. Add me to the list of Matz-worshippers. This is fantastic to have.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mwelinder (mwelinder) wrote on 2008-05-23:

#31

Upstream asap, please! This patch needs to make it into all distros
so we can forget there once was a lot of "??" traces. This is in the
same league as the invention of hot water.

Revision history for this message

Robert Collins (lifeless) wrote on 2008-05-24:

#32

This seems to be at least partially fixed by this patch:
https://bugzilla.novell.com/show_bug.cgi?id=390722

I'll see about porting it to ubuntu's gdb; in airport at the moment, may not have time.

Revision history for this message

Robert Collins (lifeless) wrote on 2008-05-24:

#33

Incorporate Michael Matz's patch Edit (1.6 KiB, text/plain)

Untested debdiff attached.

Revision history for this message

Robert Collins (lifeless) wrote on 2008-05-24:

#34

take 2. Edit (1.6 KiB, text/plain)

updated patch, has the advantage of actually applying. *blush*.

Revision history for this message

Robert Collins (lifeless) wrote on 2008-05-25:

#35

patch that builds Edit (1.4 KiB, text/plain)

Ok, a patch that tries to apply the patch (garh I hate quilt/dpatch's UI). and the patch applies and builds.

Revision history for this message

Robert Collins (lifeless) wrote on 2008-05-25:

#36

Well the package builds and the gdb it creates appears to work; however I need to get a 386 chroot etc to test it properly.

amd64_frame_cache may need the same love; it seems to be doing less well than perhaps it can as well.

Revision history for this message

Martin Pitt (pitti) wrote on 2008-05-26:

#37

Taking for sponsoring.

Changed in gdb:
assignee:	nobody → pitti
status:	New → In Progress

Revision history for this message

Martin Pitt (pitti) wrote on 2008-05-26:

#38

debdiff for hardy-proposed Edit (2.4 KiB, text/plain)

I ported the change to gdb/amd64-tdep.c, which seems to improve stack traces in an analogous way on amd64 as well.

A very simple test case is:

$ gdb /bin/bash
(gdb) run
[ press Control-C ]
(gdb) bt

With the updated gdb, the stack trace has much fewer ?? and more symbols.

Changed in gdb:
status:	In Progress → Fix Committed

Revision history for this message

Martin Pitt (pitti) wrote on 2008-05-26:

#39

Accepted into -proposed, please test and give feedback here

Revision history for this message

Martin Pitt (pitti) wrote on 2008-05-26:

#40

I also tested that the updated gdb still passes the testsuite in pkg-create-dbgsym, which exercises the installed gdb and checks that it produces good backtraces with the separate -dbgsym packages.

Changed in gdb:
assignee:	nobody → pitti
status:	New → In Progress

Bug Watch Updater (bug-watch-updater) on 2008-05-26

Changed in gdb:
status:	Unknown → Confirmed

Revision history for this message

Sebastien Bacher (seb128) wrote on 2008-05-26:

#41

confirmed on i386 that some non debug stacktraces have extra symbols and that debug stacktraces still work correctly

Revision history for this message

In Novell/SUSE Bugzilla #390722, Janne Karhunen (janne-karhunen) wrote on 2008-05-26:

#42

> (The speculation about -fomit-frame-pointer doesn't apply either, because
> we don't compile with that option on i386, otherwise also valgrind would be
> lost)

A while back I experimented with libsegfault by adding it threading support and some other fancy stuff such as 'file:line' DWARF data extract. Once testing that prior to sending it out to libc-alpha I found the frame pointer trail to be highly unreliable even when 'omit-frame-pointer' is not being used. Heck, even libc sleep() seemed to be using ebp/rbp for something else destroying the trail :/

Revision history for this message

Martin Pitt (pitti) wrote on 2008-05-27:

#43

Copied hardy-proposed version to intrepid.

Changed in gdb:
status:	In Progress → Fix Released

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-05-27:

#44

$ebp based frames are only highly unpredictable when they don't exist.
Unfortunately you can't detect this with certainty (i.e. if $ebp points
to the frame or is used for something else), that's where the problem is.
In normal compiled code (libc with its load of inline asm doesn't count
as that for some functions unfortunately) you can be sure that $ebp points
to a frame (if not compiled with omit-frame-pointer of course).

So, under the assumptions that we want this whole thing to get useful
backtraces out of segfaults, and that further such segfaults in libc are
not happening very often, but that they rather occur in application code
(without being called back from libc code, like with qsort), it seems sensible
to just rely on $ebp frames, even if it also has its share of problems.

Revision history for this message

In Novell/SUSE Bugzilla #390722, Janne Karhunen (janne-karhunen) wrote on 2008-05-27:

#45

Right, the reason libc is breaking frame pointer trail so heavily probably comes from excessive inline asm usage. These days quite a few libc functions are really inline asm wrappers to direct system calls (socket functions are great example of this) :/

Anyhoo, my conclusion of libsegfault was that it should probably be removed from libc completely (or that it should be rewritten using libunwind), it's that unreliable. So may the force be with you here..

Revision history for this message

Martin Pitt (pitti) wrote on 2008-05-29:

#46

Copied to hardy-updates.

Changed in gdb:
status:	Fix Committed → Fix Released

Revision history for this message

Paul Wise (Debian) (pabs) wrote on 2008-06-10:

#47

Has the patch been sent upstream, or at least to Debian?

Bug Watch Updater (bug-watch-updater) on 2008-06-16

Changed in gdb:
status:	Confirmed → Incomplete

Revision history for this message

Martin Pitt (pitti) wrote on 2008-09-06:

#48

Indeed I tried twice now to submit that bug and patch upstream, but this horrible piece of a bug tracker just swallows my reports: http://sourceware.org/cgi-bin/gnatsweb.pl?cmd=create&database=gdb.

I gave up and sent it to Debian now.

Bug Watch Updater (bug-watch-updater) on 2008-09-07

Changed in gdb:
status:	Unknown → New

Revision history for this message

In Novell/SUSE Bugzilla #390722, Aj-novell (aj-novell) wrote on 2008-10-24:

#49

Michael, Sankar, what is the status here?

Revision history for this message

In Novell/SUSE Bugzilla #390722, Janne Karhunen (janne-karhunen) wrote on 2008-10-24:

#50

FYI: there is a nice heuristic in libunwind for detecting valid frame pointer (based on size of the jump).

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mmeeks (mmeeks) wrote on 2008-10-24:

#51

Micha's patch is wonderful - IMHO we should ship that ASAP; then tackle Sankar's other problems separately as new bug(s).

Oh - and we should make gdb "actually work"(TM) in OS11.1 - so we have even a small chance of debugging things :-)

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-10-24:

#52

Created an attachment (id=247826)
ported patch for sles10-sp2 gdb

FWIW here's the patch backported to the SLES10 gdb. It's really trivial,
so I guess you had a similar one already. If that doesn't work we need
a better testcase, the one from Michael Meeks is fixed as far as I know.
mbuild is knuth-matz-23 .

Revision history for this message

In Novell/SUSE Bugzilla #390722, Matz (matz) wrote on 2008-10-24:

#53

The 11.1 gdb still contains this patch, so we should be fine.

Bug Watch Updater (bug-watch-updater) on 2008-10-24

Changed in gdb:
status:	Incomplete → Confirmed

Revision history for this message

In Novell/SUSE Bugzilla #390722, Psankar (psankar) wrote on 2009-01-07:

#54

I just verified in 11.1 and it works fine

Bug Watch Updater (bug-watch-updater) on 2009-06-02

Changed in gdb (Suse):
status:	Confirmed → Incomplete

Revision history for this message

In Novell/SUSE Bugzilla #390722, Ast-novell (ast-novell) wrote on 2010-04-29:

#55

What is the status here?

Revision history for this message

In Novell/SUSE Bugzilla #390722, Mrdocs-opensuse (mrdocs-opensuse) wrote on 2010-11-27:

#56

Fixed in 11.1

Bug Watch Updater (bug-watch-updater) on 2010-11-28

Changed in gdb (Suse):
importance:	Unknown → High
status:	Incomplete → Fix Released

Bug Watch Updater (bug-watch-updater) on 2011-08-11

Changed in gdb (Debian):
status:	New → Fix Released

	Status	Importance	Assigned to
gdb (Debian)	Fix Released	Unknown	debbugs #498030
gdb (Suse)	Fix Released	High	novell-bugs #390722
gdb (Ubuntu)	Fix Released	Undecided	Martin Pitt
Hardy	Fix Released	Undecided	Martin Pitt
Intrepid	Fix Released	Undecided	Martin Pitt

Ubuntu
gdb package

gdb screws stacktraces when no debuginfo is present

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

Ubuntugdb package

gdb screws stacktraces when no debuginfo is present

Bug Description

Related branches

Other bug subscribers

Patches

Remote bug watches

Ubuntu
gdb package