Comment 2 for bug 807475

Revision history for this message
3b (00003b) wrote :

After a hint from IRC that the test harness interferes with the debuggers, further tests showed that in both cases LDB and the normal debugger were still responsive when run directly or from slime, which allowed narrowing it down further.

As far as I can tell, what is happening is that the deadlock detection code is being confused by threads that are interrupted while waiting for a lock. So in that example, D2 didn't try to acquire World Lock while holding GC lock, it acquired GC lock while waiting for World Lock.

WITH-DEADLOCKS saves/restores the THREAD-WAITING-FOR slot while an interrupted thread waits for a lock, but then restores it once that lock is acquired, making it look like the thread is waiting on the original lock even though the interruption code is still running and can release the inner lock normally.

simplified test case for that situation (also at http://paste.lisp.org/+2NOI )

(let* ((m1 (sb-thread:make-mutex :name "M1"))
       (m2 (sb-thread:make-mutex :name "M2"))
       (t1 (sb-thread:make-thread
            (lambda ()
              (sb-thread:with-mutex (m1)
                (sleep 0.3)
                :ok))
            :name "T1"))
       (t2 (sb-thread:make-thread
            (lambda ()
              (sleep 0.1)
              (sb-thread:with-mutex (m1 :wait-p t)
                (sleep 0.2)
                :ok))
            :name "T2")))
  (sleep 0.2)
  (sb-thread:interrupt-thread t2 (lambda ()
                                   (sb-thread:with-mutex (m2 :wait-p t)
                                     (sleep 0.3))))
  (sleep 0.05)
  (sb-thread:interrupt-thread t1 (lambda ()
                                   (sb-thread:with-mutex (m2 :wait-p t)
                                     (sleep 0.3))))
  ;; both threads should finish without a deadlock or deadlock
  ;; detection error
  (let ((res (list (sb-thread:join-thread t1)
                   (sb-thread:join-thread t2))))
    (assert (equal '(:ok :ok) res))))

Timer.impure.lisp / (TIMER THREADED-STRESS) hits what seems to be a variant of the same problem:

Deadlock cycle detected:

   #1=#<SB-THREAD:THREAD "1" #2=waiting for:
           #<MUTEX "thread interruptions lock" (free)>
         {10029DD141}>

   #<SB-THREAD:THREAD "worker" #2#
        #<MUTEX "thread result lock" owner: #1#>
      {100294D0E1}>

which is apparently caused when the sigalrm-handler runs in a thread waiting on the thread-result-lock of the thread on which the timer function is supposed to run. Test triggers it fairly rarely, looping 1000 times hits it reasonably often, but not always.

See http://ccl.clozure.com/irc-logs/sbcl/2011-08/sbcl-2011.08.13.txt around 01:00 for discussion of a possible fix: Basic idea is to store a stack of thread-waiting-for, and held locks remember the state of the stack when the lock was acquired so the deadlock detection can stop checking when it reaches that point.
(more discussion of the bug in general scattered over preceding 2 days of logs, starting at 17:50 in http://ccl.clozure.com/irc-logs/sbcl/2011-08/sbcl-2011.08.11.txt as well)