Crashes when debugging multi-threaded applications

Bug #616006 reported by Ulrich Weigand
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linaro GDB
Invalid
High
Ulrich Weigand

Bug Description

There appear to be fundamental issues preventing debugging of multi-threaded applications.

Failing test cases in that area include:
FAIL: gdb.threads/hand-call-in-threads.exp: hand call, thread 2
FAIL: gdb.threads/local-watch-wrong-thread.exp: set local watchpoint on *myp
FAIL: gdb.threads/print-threads.exp: all threads ran once (timeout)
FAIL: gdb.threads/schedlock.exp: step to increment (unlocked 2)
FAIL: gdb.threads/threxit-hop-specific.exp: get past the thread specific breakpoint
FAIL: gdb.base/watch_thread_num.exp: Watchpoint triggered iteration 2 (timeout)

Symptoms are somewhat unpredictable, but include the inferior crashing with SIGILL:

Breakpoint 2, thread_function0 (arg=0x0) at /home/uweigand/gdb-7.1.90/gdb/testsuite/gdb.threads/local-watch-wrong-thread.c:37^M
37 usleep (1); /* Loop increment 1. */^M
(gdb) PASS: gdb.threads/local-watch-wrong-thread.exp: continue to thread_function0
delete breakpoints^M
Delete all breakpoints? (y or n) y^M
(gdb) info breakpoints^M
No breakpoints or watchpoints.^M
(gdb) watch *myp^M
Watchpoint 3: *myp^M
(gdb) FAIL: gdb.threads/local-watch-wrong-thread.exp: set local watchpoint on *myp
continue^M
Continuing.^M
[New Thread 0x40596470 (LWP 15255)]^M
^M
Program received signal SIGILL, Illegal instruction.^M
0x000084f8 in thread_function0 (arg=0x0) at /home/uweigand/gdb-7.1.90/gdb/testsuite/gdb.threads/local-watch-wrong-thread.c:37^M
37 usleep (1); /* Loop increment 1. */^M
(gdb) FAIL: gdb.threads/local-watch-wrong-thread.exp: local watchpoint triggers

Definitely needs further analysis.

Changed in gdb-linaro:
assignee: nobody → Ulrich Weigand (uweigand)
Revision history for this message
Ulrich Weigand (uweigand) wrote :

I'm seeing those SIGILL failures on Michael's "pavo1" board, running a non-Linaro 2.6.32 kernel.
On Loic's BeagleBoard running linux-image-2.6.35-6-omap I'm *not* seeing those failures.

(There are still other failures in the thread tests, but those look much less severe, e.g. incorrect backtraces.)

Maybe there is an underlying kernel bug that got fixed in the meantime?

Revision history for this message
Ulrich Weigand (uweigand) wrote :
Download full text (4.0 KiB)

More details on just which tests fail on the two boards.

1. Fails on Loic's board only:

FAIL: gdb.threads/attachstop-mt.exp: attach3 to stopped bt
FAIL: gdb.threads/attachstop-mt.exp: attach4 to stopped bt

bt^M
#0 0xffff0520 in ?? ()^M
#1 0x4014d732 in __libc_enable_asynccancel () at ../nptl/cancellation.c:43^M
#2 0x00000000 in ?? ()^M
(gdb) FAIL: gdb.threads/attachstop-mt.exp: attach3 to stopped bt

2. Fails on both boards in same fashion:

FAIL: gdb.threads/hand-call-in-threads.exp: hand call, thread 2
FAIL: gdb.threads/hand-call-in-threads.exp: hand call, thread 3
FAIL: gdb.threads/hand-call-in-threads.exp: hand call, thread 4
FAIL: gdb.threads/hand-call-in-threads.exp: hand call, thread 5

call hand_call()^M
^M
Program received signal SIGSEGV, Segmentation fault.^M
0x00000000 in ?? ()^M

FAIL: gdb.threads/hand-call-in-threads.exp: dummy stack frame number, thread 2
FAIL: gdb.threads/hand-call-in-threads.exp: dummy stack frame number, thread 3
FAIL: gdb.threads/hand-call-in-threads.exp: dummy stack frame number, thread 4
FAIL: gdb.threads/hand-call-in-threads.exp: dummy stack frame number, thread 5

bt^M
#0 0x00000000 in ?? ()^M
#1 0x000085b8 in pthread_cond_init () at forward.c:117^M
#2 0x000085b8 in pthread_cond_init () at forward.c:117^M
Backtrace stopped: previous frame identical to this frame (corrupt stack?)^M

FAIL: gdb.threads/hand-call-in-threads.exp: all dummies popped

maint print dummy-frames^M
0x2e9200: id={stack=0x40995d00,code=0x8608,!special}^M
0x2d0b68: id={stack=0x40795d00,code=0x8608,!special}^M
0x2d0c38: id={stack=0x40595d00,code=0x8608,!special}^M
0x300288: id={stack=0x40395d00,code=0x8608,!special}^M
(gdb) FAIL: gdb.threads/hand-call-in-threads.exp: all dummies popped

3. Fails on both boards in different fashions:

FAIL: gdb.threads/local-watch-wrong-thread.exp: set local watchpoint on *myp
FAIL: gdb.threads/local-watch-wrong-thread.exp: local watchpoint triggers (timeout)
ERROR: Delete all breakpoints in delete_breakpoints (timeout)
UNRESOLVED: gdb.threads/local-watch-wrong-thread.exp: set local watchpoint on *myp, with false conditional (timeout)
FAIL: gdb.threads/local-watch-wrong-thread.exp: breakpoint on the other thread (timeout)
[...]

Loic's board: watchpoint never triggers
Michael's board: crashed with SIGILL

4. Fails on Loic's board only:

FAIL: gdb.threads/pthreads.exp: check backtrace from thread 2
FAIL: gdb.threads/pthreads.exp: apply backtrace command to all three threads

0 0xffff0520 in ?? ()^M
#1 0x4014d732 in __libc_enable_asynccancel () at ../nptl/cancellation.c:43^M
#2 0x00000000 in ?? ()^M

5. Fails on Michael's board only:

FAIL: gdb.threads/print-threads.exp: all threads ran once (timeout)

Fails to run to completion for some reason.

6. Fails on Michael's board only:

FAIL: gdb.threads/schedlock.exp: step to increment (unlocked 2)
[...]

Program terminated with signal SIGILL, Illegal instruction.^M

7. Fails on both boards in the same way:

FAIL: gdb.threads/threxit-hop-specific.exp: get past the thread specific breakpoint

warning: Breakpoint address adjusted from 0x4002d29d to 0x4002d29c.^M
start_thread (arg=<value optimized out>) at pthread_create.c:285^M
285 pthread...

Read more...

Revision history for this message
Ulrich Weigand (uweigand) wrote :

Failure 2 in the list in comment 2 is actually the same problem as Bug #615974, inferior calls from within interrupted system calls are broken.

Revision history for this message
Ulrich Weigand (uweigand) wrote :

Failure 7 in the list in comment 2 is a combination of two separate issues:

- The warning "Breakpoint address adjusted" is caused by arm_get_longjmp_target not removing the Thumb bit from the longjmp target address; the rest of the code doesn't cope well with breakpoint addresses with Thumb bit set.

- The stop in start_thread happens because I've installed the libc-dbg debug info packages, so that GDB actually considers the pthread implementation to be routines it ought to stop in. This is really a bug in the test case as-is. On the other hand, this doesn't show up elsewhere because usually libc debuginfo is not installed during GDB test suite runs; we only have to do it on Arm because of missing prologue parsing support for Thumb-2 code ...

Revision history for this message
Ulrich Weigand (uweigand) wrote :

Failures 1 and 4 in the list in comment 2 is a problem with backtracing out of the "magic" 0xffff0000 kernel area.

The code at 0xffff0520 is a stub the kernel uses to implement ERESTART_RESTARTBLOCK system call restart handling. The problem is that this actually creates a (minimal) stack frame on the user stack, so it needs to be unwound.

However:
- there is no symbol or debug info for the 0xffff0000 area
- this area is not even readable via ptrace, so GDB cannot do any code analysis

Therefore, GDB must assume the function is frameless, which then fails.

This could be fixed either by additional kernel support (e.g. the "magic" area could be converted into a real vDSO like on other platforms, which could include symbol and unwind data), or else hard-coding addresses of those magic routines into GDB (which is not really appropriate as those might change depending on kernel version ...).

Yet another approach might be to change the kernel to avoid the need for creating this extra stack frame. On s390 we've solved a similar problem without this hack (potentially at the cost of a couple extra cycles in the syscall exit path, but it might be possible to avoid those).

Revision history for this message
Ulrich Weigand (uweigand) wrote :

Failure 3 indicates just that the test case is not supported on machines without hardware watchpoints. Should be fixed once we get kernel hardware watchpoint support on ARM (see blueprint hardware-breakpoint-support). For now, we should run the test suite with a board file that sets the gdb,no_hardware_watchpoints flag to suppress those tests.

Revision history for this message
Ulrich Weigand (uweigand) wrote :

Failure 8 has the same reason has failure 3 (missing hardware watchpoint support).

Revision history for this message
Ulrich Weigand (uweigand) wrote :

I've now opened separate bugs to track issues described in comments #4 and #5:

Bug #620595 (gdb.threads/threxit-hop-specific.exp failure)
Bug #620611 (Unable to backtrace out of vector page 0xffff0000)

This means all problems seem on Loic's board are now described elsewhere.

The only remaining problem tracked in *this* bug is the SIGILL seen on pavo1.

Revision history for this message
Ulrich Weigand (uweigand) wrote :

Failure 5 seems to have been a transient timeout, I cannot reproduce it on pavo1 any more.

Changed in gdb-linaro:
importance: Undecided → High
Revision history for this message
Ulrich Weigand (uweigand) wrote :

Since I'm unable to reproduce the SIGILL problem with any current kernel, I'm closing this bug report now.

Changed in gdb-linaro:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.