Steel Bank Common Lisp

"deadlock cycle" in threads.impure.lisp (:HASH-CACHE :SUBTYPEP)+:PARALLEL-DEFCLASS

Reported by 3b on 2011-07-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
Critical
Unassigned

Bug Description

The (:HASH-CACHE :SUBTYPEP) test in threads.impure.lisp doesn't wait for the threads to exit, so they are still running when the :PARALLEL-DEFCLASS test runs. Something about the combination sometimes triggers

WARNING: DEADLOCK CYCLE DETECTED:

   #<SB-THREAD:THREAD "d1" waiting for:
        #<MUTEX "GC lock"
            owner: #1=#<SB-THREAD:THREAD "d2" #2=waiting for: # {10040A8601}>>
      {10040A8431}>

   #<SB-THREAD:THREAD "d2" waiting for:
        #<MUTEX "World Lock"
            owner: #1=#<SB-THREAD:THREAD "d1" #2=waiting for: # {10040A8431}>>
      {10040A8601}>

END OF CYCLE
fatal error encountered in SBCL pid 8127(tid 140737278113536):
Trapping to run pending handler while GC in progress.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>

with LDB not responding to input.

Running either test in a loop by itself runs to completion, but running both in a loop as in the following test case tends to break fairly reliably (usually within 5-50 iterations here), either with unresponsive LDB or locking up after apparently trying to start the normal debugger:

(with-test (:name :parallel-defclass)
  (loop for i below 100
     do
       (format t "~%~%starting subtypep-hash-cache-test, pass ~s~%" i)
       (dotimes (i 10)
         (sb-thread:make-thread #'sb-thread::subtypep-hash-cache-test))
       (format t "starting :parallel-defclass, pass ~s...~%" i)
       (progn
         (defclass test-1 () ((a :initform :orig-a)))
         (defclass test-2 () ((b :initform :orig-b)))
         (defclass test-3 (test-1 test-2) ((c :initform :orig-c)))
         (let* ((run t)
                (d1 (sb-thread:make-thread (lambda ()
                                             (loop while run
                                                do (defclass test-1 () ((a :initform :new-a)))
                                                  (write-char #\1)
                                                  (force-output)))
                                           :name "d1"))
                (d2 (sb-thread:make-thread (lambda ()
                                             (loop while run
                                                do (defclass test-2 () ((b :initform :new-b)))
                                                  (write-char #\2)
                                                  (force-output)))
                                           :name "d2"))
                (d3 (sb-thread:make-thread (lambda ()
                                             (loop while run
                                                do (defclass test-3 (test-1 test-2) ((c :initform :new-c)))
                                                  (write-char #\3)
                                                  (force-output)))
                                           :name "d3"))
                (i (sb-thread:make-thread (lambda ()
                                            (loop while run
                                               do (let ((i (make-instance 'test-3)))
                                                    (assert (member (slot-value i 'a) '(:orig-a :new-a)))
                                                    (assert (member (slot-value i 'b) '(:orig-b :new-b)))
                                                    (assert (member (slot-value i 'c) '(:orig-c :new-c))))
                                                 (write-char #\i)
                                                 (force-output)))
                                          :name "i")))
           (format t "~%sleeping!~%")
           (sleep 3.0)
           (format t "~%stopping!~%")
           (setf run nil)
           (mapc (lambda (th)
                   (sb-thread:join-thread th)
                   (format t "~%joined ~S~%" (sb-thread:thread-name th)))
                 (list d1 d2 d3 i))))
       (format t " :parallel-defclass, pass ~s done~%" i)))

Testing on SBCL 1.0.49.79-ba12c5c, x8664 linux.

Possibly related to Bug #308959?

Nikodemus Siivola (nikodemus) wrote :

The real question is, how and why does D2 try to acquire World Lock while holding the GC lock? Or does not it really do that, but that there's a bug in deadlock detection instead?

(The entry into LDB makes me think there's a bug not related to deadlock detection here, though.)

Changed in sbcl:
assignee: nobody → Nikodemus Siivola (nikodemus)
importance: Undecided → High
status: New → In Progress
3b (00003b) wrote :

After a hint from IRC that the test harness interferes with the debuggers, further tests showed that in both cases LDB and the normal debugger were still responsive when run directly or from slime, which allowed narrowing it down further.

As far as I can tell, what is happening is that the deadlock detection code is being confused by threads that are interrupted while waiting for a lock. So in that example, D2 didn't try to acquire World Lock while holding GC lock, it acquired GC lock while waiting for World Lock.

WITH-DEADLOCKS saves/restores the THREAD-WAITING-FOR slot while an interrupted thread waits for a lock, but then restores it once that lock is acquired, making it look like the thread is waiting on the original lock even though the interruption code is still running and can release the inner lock normally.

simplified test case for that situation (also at http://paste.lisp.org/+2NOI )

(let* ((m1 (sb-thread:make-mutex :name "M1"))
       (m2 (sb-thread:make-mutex :name "M2"))
       (t1 (sb-thread:make-thread
            (lambda ()
              (sb-thread:with-mutex (m1)
                (sleep 0.3)
                :ok))
            :name "T1"))
       (t2 (sb-thread:make-thread
            (lambda ()
              (sleep 0.1)
              (sb-thread:with-mutex (m1 :wait-p t)
                (sleep 0.2)
                :ok))
            :name "T2")))
  (sleep 0.2)
  (sb-thread:interrupt-thread t2 (lambda ()
                                   (sb-thread:with-mutex (m2 :wait-p t)
                                     (sleep 0.3))))
  (sleep 0.05)
  (sb-thread:interrupt-thread t1 (lambda ()
                                   (sb-thread:with-mutex (m2 :wait-p t)
                                     (sleep 0.3))))
  ;; both threads should finish without a deadlock or deadlock
  ;; detection error
  (let ((res (list (sb-thread:join-thread t1)
                   (sb-thread:join-thread t2))))
    (assert (equal '(:ok :ok) res))))

Timer.impure.lisp / (TIMER THREADED-STRESS) hits what seems to be a variant of the same problem:

Deadlock cycle detected:

   #1=#<SB-THREAD:THREAD "1" #2=waiting for:
           #<MUTEX "thread interruptions lock" (free)>
         {10029DD141}>

   #<SB-THREAD:THREAD "worker" #2#
        #<MUTEX "thread result lock" owner: #1#>
      {100294D0E1}>

which is apparently caused when the sigalrm-handler runs in a thread waiting on the thread-result-lock of the thread on which the timer function is supposed to run. Test triggers it fairly rarely, looping 1000 times hits it reasonably often, but not always.

See http://ccl.clozure.com/irc-logs/sbcl/2011-08/sbcl-2011.08.13.txt around 01:00 for discussion of a possible fix: Basic idea is to store a stack of thread-waiting-for, and held locks remember the state of the stack when the lock was acquired so the deadlock detection can stop checking when it reaches that point.
(more discussion of the bug in general scattered over preceding 2 days of logs, starting at 17:50 in http://ccl.clozure.com/irc-logs/sbcl/2011-08/sbcl-2011.08.11.txt as well)

3b (00003b) wrote :
Download full text (4.3 KiB)

In the LDB case (backtrace below), the ERROR ends up causing a call to GETHASH, which does a WITHOUT-GCING, which tries to run pending GCs when it finishes, and we lose() since we were already trying to GC when the deadlock check started. Not sure why it only happens some of the time though.

WARNING: DEADLOCK CYCLE DETECTED:

   #<SB-THREAD:THREAD "d2" waiting for:
        #<MUTEX "GC lock"
            owner: #1=#<SB-THREAD:THREAD "i" #2=waiting for: # {1002E50261}>>
      {1002DD0081}>

   #<SB-THREAD:THREAD "i" waiting for:
        #<MUTEX "World Lock"
            owner: #1=#<SB-THREAD:THREAD "d2" #2=waiting for: # {1002DD0081}>>
      {1002E50261}>

END OF CYCLE
fatal error encountered in SBCL pid 19497(tid 140737271953152):
Trapping to run pending handler while GC in progress.

Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb> ba
Backtrace:
   0: Foreign function lose, fp = 0x7ffff319c080, ra = 0x40fc72
   1: Foreign function (null), fp = 0x7ffff319c0b0, ra = 0x412214
   2: Foreign function (null), fp = 0x7ffff319c0f0, ra = 0x4130ca
   3: Foreign function (null), fp = 0x7ffff319c610, ra = 0x7ffff79cab40
   4: SB-THREAD::CALL-WITH-RECURSIVE-SYSTEM-SPINLOCK/WITHOUT-GCING
   5: SB-IMPL::GETHASH3
   6: (COMMON-LISP::FLET SB-THREAD::WITH-RECURSIVE-SYSTEM-SPINLOCK-THUNK)
   7: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-[CALL-WITH-RECURSIVE-SYSTEM-SPINLOCK]245)
   8: SB-DI::MAKE-COMPILED-DEBUG-FUN
   9: (SB-C::HAIRY-ARG-PROCESSOR SB-DI::COMPUTE-CALLING-FRAME)
  10: SB-KERNEL::FIND-CALLER-NAME-AND-FRAME
  11: (SB-C::VARARGS-ENTRY COMMON-LISP::ERROR)
  12: SB-THREAD::CHECK-DEADLOCK
  13: (SB-C::HAIRY-ARG-PROCESSOR SB-THREAD::GET-MUTEX)
  14: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-[CALL-WITH-MUTEX]301)
  15: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-[SUB-GC]27)
  16: (SB-C::VARARGS-ENTRY SB-KERNEL::SUB-GC)
  17: Foreign function call_into_lisp, fp = 0x7ffff319cf10, ra = 0x4220b0
  18: Foreign function maybe_gc, fp = 0x7ffff319cf40, ra = 0x40d374
  19: Foreign function interrupt_handle_pending, fp = 0x7ffff319cf70, ra = 0x412189
  20: Foreign function (null), fp = 0x7ffff319cfb0, ra = 0x4130ca
  21: Foreign function (null), fp = 0x7ffff319d510, ra = 0x7ffff79cab40
  22: (SB-C::VARARGS-ENTRY (SB-PCL::EMF COMMON-LISP::INITIALIZE-INSTANCE))
  23: (SB-C::VARARGS-ENTRY (SB-PCL::FAST-METHOD COMMON-LISP::MAKE-INSTANCE (COMMON-LISP::CLASS)))
  24: (SB-PCL::FAST-METHOD SB-MOP::COMPUTE-EFFECTIVE-SLOT-DEFINITION (SB-PCL::SLOT-CLASS COMMON-LISP::T COMMON-LISP::T))
  25: SB-PCL::STD-COMPUTE-SLOTS
  26: (SB-PCL::FAST-METHOD SB-MOP::COMPUTE-SLOTS KEYWORD::AROUND (COMMON-LISP::STANDARD-CLASS))
  27: (COMMON-LISP::FLET SB-THREAD::WITH-RECURSIVE-LOCK-THUNK)
  28: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-[CALL-WITH-RECURSIVE-LOCK]325)
  29: SB-THREAD::CALL-WITH-RECURSIVE-LOCK
  30: SB-PCL::UPDATE-CLASS
  31: (COMMON-LISP::FLET SB-THREAD::WITH-RECURSIVE-LOCK-THUNK)
  32: (COMMON-LISP::FLET WITHOUT-INTERRUPTS-BODY-[CALL-WITH-RECURSIVE-LOCK]325)
  33: SB-THREAD::CALL-WITH-RECURSIVE-LOCK
  34: SB-PCL::UPDATE-CLASS
  35: (SB-C::VARARGS-ENTRY (SB-PCL::FAST-METHOD COMMON-LISP::SHARED-INITIALIZE KEYWORD::AFTER (SB-PCL::STD-CLASS COMMON-LIS...

Read more...

Nikodemus Siivola (nikodemus) wrote :

I have a fix coming up.

Here's a smaller test-case for the GC issue:

(let* ((stop (+ 100 (get-universal-time)))
         (m1 (sb-thread:make-mutex :name "m1"))
         (t1 (sb-thread:make-thread
              (lambda ()
                (loop until (> (get-universal-time) stop)
                      do (sb-thread:with-mutex (m1)
                           (eval `(make-array 24))))
                :ok)))
         (t2 (sb-thread:make-thread
              (lambda ()
                (loop until (> (get-universal-time) stop)
                      do (sb-thread:with-mutex (m1)
                           (eval `(make-array 24))))
                :ok))))
    (let ((res (list (sb-thread:join-thread t1)
                     (sb-thread:join-thread t2))))
      (assert (equal '(:ok :ok) res))))

Changed in sbcl:
importance: High → Critical
Nikodemus Siivola (nikodemus) wrote :

commit 5abf3b4b94c8c2315777e63729293395dc54992c
Author: Nikodemus Siivola <email address hidden>
Date: Mon Aug 15 14:33:49 2011 +0300

    fix bogus deadlocks from interrupts and GCs

     lp#807475

     Going in despite the freeze since this is a regression that can
     semi-randomly break correct code. *ouch*

     Thanks to Bart Bortta and #sbcl for the analysis.

     Problem 1:

       T1 holds L1

       T2 is waiting for L1

       T2 is interrupted, interrupt handler grabs L2

       T1 starts waiting on L2

       Prior to this patch, when GET-MUTEX in T2's interrupt handler grabbed
       L2 is marked T2 as still waiting for L1 -- which is not true until
       the interrupt handler returns.

     Problem 2:

       T1 holds L1

       T2 is waiting for L1

       GC is triggered in T2 inside GET-MUTEX

       T2 grabs *ALREADY-IN-GC* lock

       GC is triggered in T1, T1 tries to get *ALREADY-IN-GC* lock.

       Prior to this patch, when T1 detects a bogus deadlock as T2 has
       been marked as waiting for L1 -- which is not true until the GC is
       finished and normal execution resumes.

     Problem 3:

       T1 holds L1

       T2 is waiting for L1

       GC is triggered in T2 inside GET-MUTEX

       T2 grabs lock L2 due to a finalizer or an after-gc-hook

       GC is triggered in T1

       T1 tries to grab L2 due to a finalizer, etc.

       Same as problem 2, but with a user-lock and POST-GC instead of
       *ALREADY-IN-GC* and SUB-GC.

     This patch fixes the issue by saving, clearing, and restoring
     the waiting-for mark in

      1) interrupt handlers

      2) SUB-GC

      3) POST-GC

Changed in sbcl:
assignee: Nikodemus Siivola (nikodemus) → nobody
status: In Progress → Fix Committed
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers