SBCL

rare hang in kill-non-lisp-thread.impure

Bug #2060336 reported by Douglas Katzman on 2024-04-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	SBCL	Fix Released	Undecided	Unassigned

Bug Description

This test was stuck so I attached with gdb and invoked backtrace_from_fp($rbp,20,0).
It seems to be stuck waiting on the Lisp semaphore within the test. The C thread was nonexistent

using 150000 cells (0x124f80 bytes) for profile buffer @ 0x7ff17f1ac000
allocation profiler: 2 threads
// Running pure tests (LOAD-TEST)
// Running kill-non-lisp-thread.impure.lisp in COMPILE evaluator mode
::: Running :KILL-NON-LISP-THREAD
   0: fp=0x7ff17f92e330 pc=0xb80079d310 (FLET SB-UNIX::BODY :IN SB-THREAD::FUTEX-WAIT)
   1: fp=0x7ff17f92e438 pc=0xb80079d8f0 (FLET "WITHOUT-INTERRUPTS-BODY-1" :IN SB-THREAD::%CONDITION-WAIT)
   2: fp=0x7ff17f92e510 pc=0xb80079d1be SB-THREAD::%CONDITION-WAIT
   3: fp=0x7ff17f92e5c0 pc=0xb8000a9de1 (FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD::%DECREMENT-SEMAPHORE)
   4: fp=0x7ff17f92e648 pc=0xb800297258 (FLET "WITHOUT-INTERRUPTS-BODY-24" :IN SB-THREAD::CALL-WITH-SYSTEM-MUTEX/ALLOW-WITH-INTERRUPTS)
   5: fp=0x7ff17f92e6d0 pc=0xb800297038 SB-THREAD::CALL-WITH-SYSTEM-MUTEX/ALLOW-WITH-INTERRUPTS
   6: fp=0x7ff17f92e7c0 pc=0xb8000a9bef SB-THREAD::%DECREMENT-SEMAPHORE
   7: fp=0x7ff17f92e7f8 pc=0xb80079f53e SB-THREAD::WAIT-ON-SEMAPHORE
   8: fp=0x7ff17f92e820 pc=0xb800a9d117 (LAMBDA () :IN "/usr/local/google/home/dougk/sbcl-charmander/tests/kill-non-lisp-thread.impure.lisp")
   9: fp=0x7ff17f92ea30 pc=0xb800a7a9cc TEST-UTIL::RUN-TEST
  10: fp=0x7ff17f92eb08 pc=0xb800090c7d SB-INT::SIMPLE-EVAL-IN-LEXENV
  11: fp=0x7ff17f92eb30 pc=0xb8006ff8ec SB-EXT::EVAL-TLF
  12: fp=0x7ff17f92ed90 pc=0xb8000b3e11 (LABELS SB-FASL::EVAL-FORM :IN SB-INT::LOAD-AS-SOURCE)
  13: fp=0x7ff17f92eea8 pc=0xb8000b371d (LAMBDA (SB-KERNEL::FORM &KEY :CURRENT-INDEX &ALLOW-OTHER-KEYS) :IN SB-INT::LOAD-AS-SOURCE)
  14: fp=0x7ff17f92ef60 pc=0xb8000628c6 SB-C::%DO-FORMS-FROM-INFO
  15: fp=0x7ff17f92f130 pc=0xb8000b303f SB-INT::LOAD-AS-SOURCE
  16: fp=0x7ff17f92f248 pc=0xb8000b52b1 (LABELS SB-FASL::LOAD-STREAM-1 :IN LOAD)
  17: fp=0x7ff17f92f2e0 pc=0xb8007cb808 SB-FASL::CALL-WITH-LOAD-BINDINGS
  18: fp=0x7ff17f92f3f8 pc=0xb8000b4da4 LOAD
  19: fp=0x7ff17f92f450 pc=0xb800a90182 RUN-TESTS::LOAD-TEST

Revision history for this message

Douglas Katzman (dougk) wrote on 2024-04-06:

this test is pretty lousy. It has (at least) 2 failures modes, only one of which I can explain so far.
It's trying to assert that a process-directed async signal always chooses to redirect the signal to a Lisp thread when the OS has chosen (arbitrarily) a native non-lisp thread.

Problem #1 is that if the thread that the kernel chooses to resignal to a Lisp thread _OTHER_THAN_ the thread on whose queue of interruptions the test pushed the action that would end the test, then the test hangs. In particular the kernel could direct the signal to the finalizer thread, and the finalizer thread just drops the signal.

Problem #2 might have something to do with parallel-exec but I'm not 100% sure. What I am sure of is that SIGURG is appearing "too soon", before the test even starts. I inserted a tiny diff in src/runtime/interrupt:

--- a/src/runtime/interrupt.c
+++ b/src/runtime/interrupt.c
@@ -357,6 +357,7 @@ static void record_signal(int sig, void* context)
         RECORD_SIGNAL(signal,void_context); \
         UNBLOCK_SIGSEGV(); \
         RESTORE_FP_CONTROL_WORD(context,void_context); \
+ if(signal==SIGURG) { char m[80];int n=snprintf(m,sizeof m,"sigurg pthr %lx %p\n",pthread_self(),current_thread);write(2,m,n); } \
         if (should_handle_in_this_thread(context)) {

#define RESTORE_ERRNO

which printed the following sequence of events:

using 150000 cells (0x124f80 bytes) for profile buffer @ 0x7f9e8461f000
allocation profiler: 1 thread
sigurg pthr 7f9e8494f6c0 0x7f9e84b50080
// Running pure tests (LOAD-TEST)
// Running kill-non-lisp-thread.impure.lisp in COMPILE evaluator mode
::: Running :KILL-NON-LISP-THREAD
symbol=5025084F sem=1002821263 mutex-of sem=1002821223
sigurg pthr 7f9e844d96c0 (nil)
sigurg pthr 7f9e8494f6c0 0x7f9e84b50080

Using gdb I determined that 0x7f9e84b50080 is the 'struct thread' for the finalizer thread.
So not only is this the same problem as #1, it literally can't happen that sigurg appeared before the test starts. However it might be weirdness of parallel-exec. Either way it seems to reduce to problem #1

--- a/src/runtime/interrupt.c
+++ b/src/runtime/interrupt.c
@@ -357,6 +357,7 @@ static void record_signal(int sig, void* context)
         RECORD_SIGNAL(signal,void_context);                     \
         UNBLOCK_SIGSEGV();                                      \
         RESTORE_FP_CONTROL_WORD(context,void_context);          \
+        if(signal==SIGURG) { char m[80];int n=snprintf(m,sizeof m,"sigurg pthr %lx %p\n",pthread_self(),current_thread);write(2,m,n); } \
         if (should_handle_in_this_thread(context)) {
 
 #define RESTORE_ERRNO 
 
which printed the following sequence of events:

Revision history for this message

Stas Boukarev (stassats) wrote on 2024-04-06:

I've seen it hang plenty of times, and outside of parallel-exec.

Revision history for this message

Douglas Katzman (dougk) wrote on 2024-04-16:

haven't seen it hang since https://sourceforge.net/p/sbcl/sbcl/ci/ad27c236e326

Changed in sbcl:
status:	New → Fix Committed

Stas Boukarev (stassats) on 2024-04-29

Changed in sbcl:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.