rare lockups/failures on (:SEMAPHORE-NOTIFICATION :WAIT-ON-SEMAPHORE) test

Bug #1038034 reported by 3b on 2012-08-17
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
SBCL
Undecided
Unassigned

Bug Description

The test (:SEMAPHORE-NOTIFICATION :WAIT-ON-SEMAPHORE) in threads.pure.lisp locks up a few % of the time on 1.0.54, 1.0.57, 1.0.58, 1.0.58.11, on x86-84 linux.

More rarely (maybe 0.2%) it fails with "The assertion (= N (+ K (LENGTH SAFE))) failed.", seen on 1.0.54, 1.0.58, 1.0.58.11

When it locks up, it seems to be consistently trying to JOIN-THREAD thread T99 (in the 'safe' list), and gets stuck waiting on "thread result lock" mutex owned by thread T99, which is still running. Partial stack trace after Ctrl-C on locked test:

16: ("foreign function: interrupt_handle_now")
17: ("foreign function: #x413CCE")
18: ("bogus stack frame")
19: (SB-THREAD::%%WAIT-FOR-MUTEX
     #<unavailable argument>
     #<unavailable argument>
     #<unavailable argument>
     #<unavailable argument>
     #<unavailable argument>
     #<unavailable argument>)
20: (SB-THREAD::%WAIT-FOR-MUTEX
     #<MUTEX "thread result lock" owner: #<THREAD "T99" RUNNING {1008B81273}>>
     #<THREAD "main thread" RUNNING {10036DA053}>
     NIL
     NIL
     NIL
     NIL
     NIL
     NIL)
21: ((FLET #:WITHOUT-INTERRUPTS-BODY-239723 :IN JOIN-THREAD))
22: (JOIN-THREAD #<THREAD "T99" RUNNING {1008B81273}> :DEFAULT NIL :TIMEOUT NIL)

Jan Moringen (scymtym) wrote :

The attached patch removes one source of occasional failures and adds a description to the unit test.

The test previously used thread return values to reason about successful semaphore acquisitions. However, since a subset of the threads are terminated asynchronously via TERMINATE-THREAD, thread return values can be lost at any point. The patch records successful acquisitions by collecting them onto a list which is protected by the semaphore itself (protects against concurrent modification) and WITHOUT-INTERRUPTS (protects against TERMINATE-THREAD).

This patch will not make the test succeed unconditionally since a lost wakeup problem leading to deadlocks, also due to TERMINATE-THREAD, remains. Another patch may address that problem.

Jan Moringen (scymtym) wrote :

Updated patch without debug output (sorry).

tags: added: review tests threads

Confirmed deadlock on clean SBCL 1.1.10.24-9304704 , with and without Jan's patch.
If it happens, the stuck thread looks like this:

Backtrace for: #<SB-THREAD:THREAD "T99" RUNNING {10045CE7C3}>
0: ("bogus stack frame")
1: ((FLET #:WITHOUT-INTERRUPTS-BODY-575 :IN SB-THREAD:CONDITION-WAIT))
2: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE {10043774D3}> #<SB-THREAD:MUTEX (free) {10043774A3}> :TIMEOUT NIL)
3: ((FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD:WAIT-ON-SEMAPHORE))
4: ((FLET #:WITHOUT-INTERRUPTS-BODY-520 :IN SB-THREAD::CALL-WITH-SYSTEM-MUTEX/ALLOW-WITH-INTERRUPTS))
5: (SB-THREAD::CALL-WITH-SYSTEM-MUTEX/ALLOW-WITH-INTERRUPTS #<CLOSURE (FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN SB-THREAD:WAIT-ON-SEMAPHORE) {7FFFDC4FE0CB}> #<SB-THREAD:MUTEX (free) {10043774A3}>)
6: (SB-THREAD:WAIT-ON-SEMAPHORE #S(SB-THREAD:SEMAPHORE :NAME NIL :%COUNT 1 :WAITCOUNT 40 :MUTEX #<SB-THREAD:MUTEX (free) {10043774A3}> :QUEUE #<SB-THREAD:WAITQUEUE {10043774D3}>) :TIMEOUT NIL :NOTIFICATION #S(SB-THREAD:SEMAPHORE-NOTIFICATION :%STATUS NIL))
7: ((FLET #:WITHOUT-INTERRUPTS-BODY-2 :IN #:G804))
8: ((FLET THREAD-TEST::CRITICAL :IN #:G804))
9: ((FLET THREAD-TEST::CRITICAL :IN #:G804)) [external]
10: ((FLET #:WITHOUT-INTERRUPTS-BODY-1120 :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE))
11: ((FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE))
12: ((FLET #:WITHOUT-INTERRUPTS-BODY-537 :IN SB-THREAD::CALL-WITH-MUTEX))
13: (SB-THREAD::CALL-WITH-MUTEX #<CLOSURE (FLET SB-THREAD::WITH-MUTEX-THUNK :IN SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE) {7FFFDC4FEC7B}> #<SB-THREAD:MUTEX "thread result lock" owner: #<SB-THREAD:THREAD "T99" RUNNING {10045CE7C3}>> NIL T NIL)
14: (SB-THREAD::INITIAL-THREAD-FUNCTION-TRAMPOLINE #<SB-THREAD:THREAD "T99" RUNNING {10045CE7C3}> #S(SB-THREAD:SEMAPHORE :NAME "Thread setup semaphore" :%COUNT 0 :WAITCOUNT 0 :MUTEX #<SB-THREAD:MUTEX (free) {10045CE833}> :QUEUE #<SB-THREAD:WAITQUEUE {10045CE863}>) #<CLOSURE (FLET THREAD-TEST::CRITICAL :IN #:G804) {100437756B}> NIL NIL NIL NIL)
15: ("foreign function: call_into_lisp")
16: ("foreign function: new_thread_trampoline")

Attached a variant of Jan's version of the wait-on-semaphore test which can be run as a script.
$ cd tests
$ while sbcl --script wait-on-semaphore.script.lisp ; do true ; done

Often the problem is triggered within the first 10 runs.

For me, line 4 of these random-states is a pretty reliable trigger:
tests$ sbcl --script wait-on-semaphore.script.lisp 4
::: Running (:SEMAPHORE-NOTIFICATION :WAIT-ON-SEMAPHORE)
.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
[ 0.1s] 33 safe remaining: T35, T37, T39, T41, T43, T45, T47, T49, T51, T53, T55, T57, T59, T61, T63, T65, T67, T69, T71, T73, T75, T77, T79, T81, T83, T85, T87, T89, T91, T93, T95, T97, T99
Backtrace for: #<SB-THREAD:THREAD "T35" RUNNING {10034138E3}>
0: ("bogus stack frame")
1: ((FLET #:WITHOUT-INTERRUPTS-BODY-575 :IN SB-THREAD:CONDITION-WAIT))
2: (SB-THREAD:CONDITION-WAIT #<SB-THREAD:WAITQUEUE {1003373623}> #<SB-THREAD:MUTEX (free) {10033735F3}> :TIMEOUT NIL)
[...]

Now with sleep-times recording & replay, and more debug output.

Download full text (4.1 KiB)

These are some example sleep-times, mostly generated by replaying the random-states posted above.

*** *** *** Example 1:
tests$ sbcl --script wait-on-semaphore.script.lisp s4
::: Running (:SEMAPHORE-NOTIFICATION :WAIT-ON-SEMAPHORE)
.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
[ 0.1s] 48 safe remaining: T5, T7, T9, T11, T13, T15, T17, T19, T21, T23, T25, T27, T29, T31, T33, T35, T37, T39, T41, T43, T45, T47, T49, T51, T53, T55, T57, T59, T61, T63, T65, T67, T69, T71, T73, T75, T77, T79, T81, T83, T85, T87, T89, T91, T93, T95, T97, T99
*** Notes:
(#<THREAD "T1" FINISHED values: 1, 0.028918386 {1003E3B043}> 0 691)
(#<THREAD "T2" ABORTED {1003E3B413}> 1 720)
(#<THREAD "T3" FINISHED values: 2, 0.06313878 {1003E3B7E3}> 1 736)
*** Inc's:
(#<THREAD "T1" FINISHED values: 1, 0.028918386 {1003E3B043}> 0 720)
(#<THREAD "T3" FINISHED values: 2, 0.06313878 {1003E3B7E3}> 1 799)
*** Kills:
(#<THREAD "T2" ABORTED {1003E3B413}> NIL 736 736)
(#<THREAD "T4" ABORTED {1003E3BBB3}> NIL 799 799)
(#<THREAD "T6" ABORTED {1003E3C353}> NIL 875 875)
(#<THREAD "T8" ABORTED {1003E3CAF3}> NIL 929 929)
[...]
Backtrace for: #<SB-THREAD:THREAD "T5" RUNNING {1003E3BF83}>

*** *** *** Example 2:
tests$ sbcl --script wait-on-semaphore.script.lisp s2
::: Running (:SEMAPHORE-NOTIFICATION :WAIT-ON-SEMAPHORE)
.E.E.E.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
[ 0.1s] 42 safe remaining: T17, T19, T21, T23, T25, T27, T29, T31, T33, T35, T37, T39, T41, T43, T45, T47, T49, T51, T53, T55, T57, T59, T61, T63, T65, T67, T69, T71, T73, T75, T77, T79, T81, T83, T85, T87, T89, T91, T93, T95, T97, T99
*** Notes:
(#<THREAD "T1" FINISHED values: 1, 0.024494136 {1003E2B043}> 0 699)
(#<THREAD "T2" FINISHED values: 2, 0.0063797 {1003E2B413}> 1 723)
(#<THREAD "T3" FINISHED values: 3, 0.09456768 {1003E2B7E3}> 2 730)
(#<THREAD "T4" FINISHED values: 4, 0.007060051 {1003E2BBB3}> 3 825)
(#<THREAD "T5" FINISHED values: 5, 0.045601904 {1003E2BF83}> 4 832)
(#<THREAD "T6" FINISHED values: 6, 0.02261814 {1003E2C353}> 5 878)
(#<THREAD "T7" FINISHED values: 7, 0.097944714 {1003E2C723}> 6 901)
(#<THREAD "T9" FINISHED values: 8, 0.049314786 {1003E2CEC3}> 7 999)
(#<THREAD "T11" FINISHED values: 9, 0.009190631 {1003E2D663}> 8 1048)
(#<THREAD "T12" ABORTED {1003E2DA33}> 9 1058)
(#<THREAD "T13" FINISHED values: 10, 0.04782511 {1003E2DE03}> 9 1066)
(#<THREAD "T14" ABORTED {1003E2E1D3}> 10 1114)
(#<THREAD "T15" FINISHED values: 11, 0.042519607 {1003E2E5A3}> 10 1126)
*** Inc's:
(#<THREAD "T1" FINISHED values: 1, 0.024494136 {1003E2B043}> 0 723)
(#<THREAD "T2" FINISHED values: 2, 0.0063797 {1003E2B413}> 1 730)
(#<THREAD "T3" FINISHED values: 3, 0.09456768 {1003E2B7E3}> 2 825)
(#<THREAD "T4" FINISHED values: 4, 0.007060051 {1003E2BBB3}> 3 832)
(#<THREAD "T5" FINISHED values: 5, 0.045601904 {1003E2BF83}> 4 878)
(#<THREAD "T6" FINISHED values: 6, 0.02261814 {1003E2C353}> 5 900)
(#<THREAD "T7" FINISHED values: 7, 0.097944714 {1003E2C723}> 6 999)
(#<THREAD "T9" FINISHED values: 8, 0.049314786 {1003E2CEC3}> 7 1048)
(#<THREAD "T11" FINISHED values: 9, 0.009190631 {1003E2D663}> 8 1058)
(#<THREAD "T13" FINISHED values: 10, 0...

Read more...

Jan Moringen (scymtym) wrote :

The attached patch adds wakeups in case CONDITION-WAIT is async unwound.

This makes CONDITION-WAIT more forgiving when used from user code which is not prepared for async unwinds.

With this patch, Andreas' script did not fail after a few hundred repetitions. Without the patch, the script usually fails within fewer than 50 repetitions.

(Just seeing the above patch now. Will try it out now...)

The attached standalone testcase just uses three threads:
we try to terminate T2 around the time when T1 exists,
this way, T3 is never woken up.

tests$ while sbcl --script wait-on-semaphore.script.lisp ; do true ; done
::: Running (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
.-.
::: Success (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
::: Running (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
.-.
[ 0.1s] 1 safe remaining: T3
[ 0.2s] 0 safe remaining
::: Success (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
::: Running (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
.-.
::: Success (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
::: Running (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
.-.
::: Success (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
::: Running (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
.-.
[ 0.1s] 1 safe remaining: T3
*** Notes:
("T1" 0 71)
*** Inc's:
("T1" 0 81)
*** Kills:
("T2" T 81 81)
Backtrace for: #<THREAD "T3" RUNNING {1003694443}>
0: ("bogus stack frame")
1: ((FLET #:WITHOUT-INTERRUPTS-BODY-575 :IN CONDITION-WAIT))
2: (CONDITION-WAIT #<WAITQUEUE {1003693953}> #<MUTEX (free) {1003693923}> :TIMEOUT NIL)
3: ((FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN WAIT-ON-SEMAPHORE))

Built latest sbcl with above patch ok, with normal test suite ok.
My impression is that the frequency of this bug occurring
has been reduced significantly by the patch. That said,
it still occurs pretty reliably for me within less than 100 runs:

for ((i=0 ; ++i ; )) ; do echo -n $i ; env SBCL_HOME=`pwd`/lib/sbcl ./bin/sbcl --script ~/wait-on-semaphore.script.lisp || break ; done
1::: Running (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
.-.
::: Success (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
[...]
22::: Running (SEMAPHORE-NOTIFICATION WAIT-ON-SEMAPHORE)
.-.
[ 0.1s] 1 safe remaining: T3
*** Notes:
("T1" 0 72)
*** Inc's:
("T1" 0 82)
*** Kills:
("T2" T 82 82)
Backtrace for: #<THREAD "T3" RUNNING {10036B8273}>
0: ("bogus stack frame")
1: ((FLET #:WITHOUT-INTERRUPTS-BODY-575 :IN CONDITION-WAIT))
2: (CONDITION-WAIT #<WAITQUEUE {10036A7713}> #<MUTEX (free) {10036A76E3}> :TIMEOUT NIL)
3: ((FLET SB-THREAD::WITH-SYSTEM-MUTEX-THUNK :IN WAIT-ON-SEMAPHORE))

The problem seems to be that that we're not finishing the :ok case in the cleanup form:

 ...CWAIT3 = 7, CWAIT4 = 7, CLEANUP1 = 7, OK1 = 7, OK2 = 6, INTERRUPTED = 0, WAKEUP = 0, OTHER = 0, CLEANUP2 = 6

(in the successful cases we usually have one of these:
 ...CWAIT3 = 7, CWAIT4 = 7, CLEANUP1 = 7, OK1 = 7, OK2 = 7, INTERRUPTED = 0, WAKEUP = 0, OTHER = 0, CLEANUP2 = 7
 ...CWAIT3 = 7, CWAIT4 = 6, CLEANUP1 = 7, OK1 = 6, OK2 = 6, INTERRUPTED = 1, WAKEUP = 1, OTHER = 0, CLEANUP2 = 7
 )

Maybe we need another unwind-protect to catch terminations hitting during the cleanup forms?

The attached testcase seems to avoid the bug for me so far.

It includes an additional unwind-protect in the :ok mutex-aquisition case,
and calling ensure-wakeup if necessary.

In addition, it uses :wait-p t in the mutex-acquisition during ensure-wakeup.
Without that, I could still see occasional hangs like these:

("T2" :OK :SUCCESS NIL :MUTEX-STATE 1 :OWNER "T1" :ALIVE-P T)
("T2" :OK-DID-NOT-GET-MUTEX :MUTEX-STATE 1 :OWNER "T1" :ALIVE-P T)
 CWAIT1 = 7, CWAIT2 = 7, CWAIT3 = 7, CWAIT4 = 7, CLEANUP1 = 7, OK1 = 7, OK1A = 1, OK1B = 1, OK2 = 6, INTERRUPTED = 0, WAKEUP = 0, OTHER = 0, CLEANUP2 = 6

("T2" :INTERRUPTED :MUTEX-STATE 1 :OWNER "T1" :ALIVE-P T)
("T2" :INTERRUPTED-DID-NOT-GET-MUTEX :MUTEX-STATE 0 :OWNER NIL :ALIVE-P NIL)
 CWAIT1 = 7, CWAIT2 = 7, CWAIT3 = 7, CWAIT4 = 6, CLEANUP1 = 7, OK1 = 6, OK1A = 0, OK1B = 0, OK2 = 6, INTERRUPTED = 1, WAKEUP = 0, OTHER = 0, CLEANUP2 = 7

Jan Moringen (scymtym) wrote :

I did something similar in the attached file and did not see any failures or deadlocks in 22K runs.

I discovered another race which does not seem to be addressed by Andreas' ENSURE-WAKEUP: when %WAIT-FOR-MUTEX succeeds in the :OK case before being interrupted, ENSURE-WAKEUP performs a recursive lock attempt.

Jan, that looks very good to me.

In the :ok (with-unwind-cleanup ...) case, the (sleep .1) is meant just for debugging, isn't it?

There is also alexandria:unwind-protect-case, see http://quickdocs.org/alexandria/api ,
but if it's only used once then adding it to SBCL is probably overkill?

Do you want to finish the patch for commit?

This patch is derived from Jan's attachment, with the (sleep 0.1) in the :ok unwind-cleanup removed.

The attached patch adds a new test which reliably fails for me on current sbcl (1.1.11.39).
It reliably succeeds on sbcl with patched CONDITION-WAIT.

It doesn't touch the existing :wait-on-semaphore test yet; that test should be improved independently:
- by using Jan's patch
- by adding timeouts to prevent hanging
- by initializing the random state before running it (?)
...

I've verified that the previous comment about the new test still holds
when the original blocking of deferrable signals is put back:
          ;; Re-increment on exit if we decremented it.
          (when (semaphore-notification-status note)
- (signal-semaphore sem)))))))
+ (signal-semaphore sem))
+ ;; KLUDGE: Prevent interrupts after this point from
+ ;; unwinding us, so that we can reason about the counts.
+ #+sb-thread
+ (sb-thread::block-deferrable-signals))))))
  ;; Create /parallel/ threads trying to acquire and then release the
  ;; semaphore. Try to asynchronously abort T2 when T1 is exiting.

Paul Khuong (pvk) wrote :

I added the test case in #17 (1f5e30a More reliable test for asynchronous aborts in semaphore operations), but I can't look at the fix yet. FWIW, waiting on a mutex when unwinding doesn't look like a good idea to me.

To post a comment you must log in.