rare hang in kill-non-lisp-thread.impure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
SBCL |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This test was stuck so I attached with gdb and invoked backtrace_
It seems to be stuck waiting on the Lisp semaphore within the test. The C thread was nonexistent
using 150000 cells (0x124f80 bytes) for profile buffer @ 0x7ff17f1ac000
allocation profiler: 2 threads
// Running pure tests (LOAD-TEST)
// Running kill-non-
::: Running :KILL-NON-
0: fp=0x7ff17f92e330 pc=0xb80079d310 (FLET SB-UNIX::BODY :IN SB-THREAD:
1: fp=0x7ff17f92e438 pc=0xb80079d8f0 (FLET "WITHOUT-
2: fp=0x7ff17f92e510 pc=0xb80079d1be SB-THREAD:
3: fp=0x7ff17f92e5c0 pc=0xb8000a9de1 (FLET SB-THREAD:
4: fp=0x7ff17f92e648 pc=0xb800297258 (FLET "WITHOUT-
5: fp=0x7ff17f92e6d0 pc=0xb800297038 SB-THREAD:
6: fp=0x7ff17f92e7c0 pc=0xb8000a9bef SB-THREAD:
7: fp=0x7ff17f92e7f8 pc=0xb80079f53e SB-THREAD:
8: fp=0x7ff17f92e820 pc=0xb800a9d117 (LAMBDA () :IN "/usr/local/
9: fp=0x7ff17f92ea30 pc=0xb800a7a9cc TEST-UTIL::RUN-TEST
10: fp=0x7ff17f92eb08 pc=0xb800090c7d SB-INT:
11: fp=0x7ff17f92eb30 pc=0xb8006ff8ec SB-EXT::EVAL-TLF
12: fp=0x7ff17f92ed90 pc=0xb8000b3e11 (LABELS SB-FASL::EVAL-FORM :IN SB-INT:
13: fp=0x7ff17f92eea8 pc=0xb8000b371d (LAMBDA (SB-KERNEL::FORM &KEY :CURRENT-INDEX &ALLOW-OTHER-KEYS) :IN SB-INT:
14: fp=0x7ff17f92ef60 pc=0xb8000628c6 SB-C::%
15: fp=0x7ff17f92f130 pc=0xb8000b303f SB-INT:
16: fp=0x7ff17f92f248 pc=0xb8000b52b1 (LABELS SB-FASL:
17: fp=0x7ff17f92f2e0 pc=0xb8007cb808 SB-FASL:
18: fp=0x7ff17f92f3f8 pc=0xb8000b4da4 LOAD
19: fp=0x7ff17f92f450 pc=0xb800a90182 RUN-TESTS:
Changed in sbcl: | |
status: | Fix Committed → Fix Released |
this test is pretty lousy. It has (at least) 2 failures modes, only one of which I can explain so far.
It's trying to assert that a process-directed async signal always chooses to redirect the signal to a Lisp thread when the OS has chosen (arbitrarily) a native non-lisp thread.
Problem #1 is that if the thread that the kernel chooses to resignal to a Lisp thread _OTHER_THAN_ the thread on whose queue of interruptions the test pushed the action that would end the test, then the test hangs. In particular the kernel could direct the signal to the finalizer thread, and the finalizer thread just drops the signal.
Problem #2 might have something to do with parallel-exec but I'm not 100% sure. What I am sure of is that SIGURG is appearing "too soon", before the test even starts. I inserted a tiny diff in src/runtime/ interrupt:
--- a/src/runtime/ interrupt. c interrupt. c
RECORD_ SIGNAL( signal, void_context) ; \
UNBLOCK_ SIGSEGV( ); \
RESTORE_ FP_CONTROL_ WORD(context, void_context) ; \ self(), current_ thread) ;write( 2,m,n); } \ handle_ in_this_ thread( context) ) {
+++ b/src/runtime/
@@ -357,6 +357,7 @@ static void record_signal(int sig, void* context)
+ if(signal==SIGURG) { char m[80];int n=snprintf(m,sizeof m,"sigurg pthr %lx %p\n",pthread_
if (should_
#define RESTORE_ERRNO
which printed the following sequence of events:
using 150000 cells (0x124f80 bytes) for profile buffer @ 0x7f9e8461f000 lisp-thread. impure. lisp in COMPILE evaluator mode LISP-THREAD
allocation profiler: 1 thread
sigurg pthr 7f9e8494f6c0 0x7f9e84b50080
// Running pure tests (LOAD-TEST)
// Running kill-non-
::: Running :KILL-NON-
symbol=5025084F sem=1002821263 mutex-of sem=1002821223
sigurg pthr 7f9e844d96c0 (nil)
sigurg pthr 7f9e8494f6c0 0x7f9e84b50080
Using gdb I determined that 0x7f9e84b50080 is the 'struct thread' for the finalizer thread.
So not only is this the same problem as #1, it literally can't happen that sigurg appeared before the test starts. However it might be weirdness of parallel-exec. Either way it seems to reduce to problem #1