multithreaded ARM seg/longjmp causes uninitialized stack frame due to0d10193870b5a81c3bce13a602a5403c3a55cf6c

Bug #823902 reported by Dr. David Alan Gilbert on 2011-08-10
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned

Bug Description

Hi,
  I've got an ARM multithreaded test program that I wrote as a gcc testcase (attached) that fails on QEmu, firefox from Ubuntu ARM maverick also fails in the same way. The failure is either a seg fault or '*** longjmp causes uninitialized stack frame ***: ./arm-linux-user/qemu-arm terminated' and it fails every time.

The test works on real hardware - a dual core A9 panda board. Firefox in an ARM maverick chroot also fails in the same way and is fixed in the same way.

On 64bit Oneiric (i7-860 quad core) the backtrace from the seg looks like:
#0 __sigsetjmp () at ../sysdeps/x86_64/setjmp.S:26
#1 0x0000000060034cf4 in cpu_arm_exec (env=0x0) at /media/crypt/work/qemu/cpu-exec.c:233
#2 0x0000000060006467 in cpu_loop (env=0x6226d060) at /media/crypt/work/qemu/linux-user/main.c:599
#3 0x0000000060007984 in main (argc=<value optimised out>, argv=<value optimised out>, envp=<value optimised out>) at /media/crypt/work/qemu/linux-user/main.c:3567

On 32bit lucid (core2 duo dual core) when it gives the longjmp error it's taken a bit of a more tortuous route but it looks like it originally took a seg at about the same place:
#0 pthread_cond_wait ()
    at ../nptl/sysdeps/unix/sysv/linux/i386/i486/pthread_cond_wait.S:123
#1 0x60000344 in exclusive_idle ()
    at /home/dg/linaro/git/qemu/linux-user/main.c:134
#2 start_exclusive () at /home/dg/linaro/git/qemu/linux-user/main.c:144
#3 stop_all_tasks () at /home/dg/linaro/git/qemu/linux-user/main.c:2996
#4 0x60016491 in force_sig (target_sig=6)
    at /home/dg/linaro/git/qemu/linux-user/signal.c:378
#5 0x60016f1d in queue_signal (env=0x639ff698, sig=6, info=0xb5610280)
    at /home/dg/linaro/git/qemu/linux-user/signal.c:451
#6 0x60017375 in host_signal_handler (host_signum=6, info=0xb561031c,
    puc=0xb561039c) at /home/dg/linaro/git/qemu/linux-user/signal.c:504
#7 <signal handler called>
#8 0x600c53d1 in raise ()
#9 0x6009a133 in abort ()
#10 0x600a0345 in __libc_message ()
#11 0x600b977c in __fortify_fail ()
#12 0x600b9717 in ____longjmp_chk ()
#13 0x600b9697 in __longjmp_chk ()
#14 0x6002b478 in cpu_loop_exit (env=0xb5611068)
    at /home/dg/linaro/git/qemu/cpu-exec.c:37
#15 0x6001d4ff in exception_action (host_signum=11, pinfo=0xb5610c8c,
    puc=0xb5610d0c) at /home/dg/linaro/git/qemu/user-exec.c:46
---Type <return> to continue, or q <return> to quit---
#16 handle_cpu_signal (host_signum=11, pinfo=0xb5610c8c, puc=0xb5610d0c)
    at /home/dg/linaro/git/qemu/user-exec.c:123
#17 cpu_arm_signal_handler (host_signum=11, pinfo=0xb5610c8c, puc=0xb5610d0c)
    at /home/dg/linaro/git/qemu/user-exec.c:186
#18 0x600172f6 in host_signal_handler (host_signum=11, info=0xb5610c8c,
    puc=0xb5610d0c) at /home/dg/linaro/git/qemu/linux-user/signal.c:492
#19 <signal handler called>
#20 0x60099ac6 in _setjmp ()
#21 0x6002b4eb in cpu_arm_exec (env=0x0)
    at /home/dg/linaro/git/qemu/cpu-exec.c:233
#22 0x600005bc in cpu_loop (env=0x639ff698)
    at /home/dg/linaro/git/qemu/linux-user/main.c:739
#23 0x60006134 in clone_func (arg=0xbfdcf95c)
    at /home/dg/linaro/git/qemu/linux-user/syscall.c:3953
#24 0x6008a8d0 in start_thread (arg=0xb5611b70) at pthread_create.c:300
#25 0x600b7f1e in clone ()

Things I've tried (with suggestions from Pete Maydell):

If I remove the 'env = cpu_single_env;' added by 0d10193870b5a81c3bce13a602a5403c3a55cf6c (tcg: Reload local variables after return from longjmp) the test works reliably (10 out of 10 passes) on 32bit Lucid and partially (7 out of 10 passes) on 64 bit Oneiric (some segs, some hangs).

If I make cpu_single_env thread local with __thread and leave 0d101... in, then again it works reliably on 32bit Lucid, and is flaky on 64 bit Oneiric (5/10 2 hangs, 3 segs)

I've also tried using a volatile local variable in cpu_exec to hold a copy of env and restore that rather than cpu_single_env. With this it's solid on 32bit lucid and flaky on 64bit Oneirc; these failures on 64bit OO look like it running off the end of the code buffer (all 0 code), jumping to non-existent code addresses and a seg in tb_reset_jump_recursive2.

With both __thread and the volatile local I still get failures on 64bit oneiric; they look mostly like they've run off the end of generated code (they're executing out of a buffer of all 0's).

(I also tried some of the 64bit tests on an EC2 Xen Natty VM with similar results).

My guess is I'm hitting multiple bugs here:
  1) The Lucid install is probably too old to hit the compiler bugs for which 0d101... is a fix - but it is in itself triggering a new bug on the old compiler.
  2) The 64bit Natty and Oneiric installs are new enough to hit the compiler bug for which 0d101 is a fix
  3) I'm probably hitting something else as well, my guess is that it could be bug 668799 but I'm not clear why it doesn't happen on my 32bit lucid install

Dave

Test built from source from this set of patches:
http://gcc.gnu.org/ml/gcc-patches/2011-07/msg02235.html

Note this needs the 64bit ARM sync patches for gcc to build

Peter Maydell (pmaydell) wrote :

If you roll back to commit 2b41f10e186ccb4f0058815161586f8d6d006ea3 what is the pass/fail rate? That ought to separate out new bugs caused by recent commits (including the 0d101 change which is definitely wrong since it assumes cpu_single_env is only being used by one thread) from random other multithreaded-user-mode problems like 668799.

volatile ought to work and be a conservative fix (although I'm not a fan of volatile and compilers notoriously can't get it right). Making cpu_single_env thread-local sounds like a reasonable idea for user-mode, but I think that the current iothread code assumes that there is only one running CPU and cpu_single_env is how you get at it from the iothread. So if we go in that direction it would require more analysis of code to figure out what it's doing with cpu_single_env.

2b41f... is a disaster on 64bit - 1 out of 10 pass; most of the others fail with:
qemu/user-exec.c:99: handle_cpu_signal: Assertion `({ unsigned long __guest = (unsigned long)(address) - guest_base; __guest < (1ul << 32); })' failed.

which I think is a segfault at host address 0 or there about.

Dave

Peter Maydell (pmaydell) wrote :

This is fixed by commits 754fd932/8a5f7b03a/b3c4bbe5 in master, and this fix will go into QEMU 1.0.

Changed in qemu:
status: New → Fix Committed
Peter Maydell (pmaydell) on 2011-12-15
Changed in qemu:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers