SEMAPHORE-MULTIPLE-WAITERS failure

Bug #332002 reported by Gábor Melis on 2009-02-20
4
Affects Status Importance Assigned to Milestone
SBCL
Undecided
Unassigned

Bug Description

Reported by Robert Brown:

I just compiled SBCL 1.0.25.55 and ran the tests. I'm getting a failure of
test SEMAPHORE-MULTIPLE-WAITERS. My system is a 2 CPU dual core Opteron
running Ubuntu Linux in 64-bit mode. I compiled the same SBCL version on my
netbook, 32-bit Atom running Ubuntu 8.10 and same test passes.

bob

====================

paradicsom [51] sh ./run-tests.sh threads.pure.lisp
/running tests on
'/usr/local/google/local-brown/software/source-trees/sbcl/tests/../src/runtime/sbcl
--core
/usr/local/google/local-brown/software/source-trees/sbcl/tests/../output/sbcl.core
--noinform --no-sysinit --no-userinit --noprint --disable-debugger'
; loading system definition from
;
/usr/local/google/local-brown/software/source-trees/sbcl/contrib/sb-posix/sb-posix.asd
; into #<PACKAGE "ASDF0">
; loading system definition from
;
/usr/local/google/local-brown/software/source-trees/sbcl/contrib/sb-grovel/sb-grovel.asd
; into #<PACKAGE "ASDF1">
; registering #<SYSTEM SB-GROVEL {1002FB15A1}> as SB-GROVEL
; registering #<SYSTEM SB-POSIX {1002F0A581}> as SB-POSIX
; registering #<SYSTEM SB-POSIX-TESTS {10031458F1}> as SB-POSIX-TESTS
STYLE-WARNING: Couldn't grovel for SIGEMT (unknown to the C compiler).
STYLE-WARNING: Couldn't grovel for S_IFWHT (unknown to the C compiler).
STYLE-WARNING: Couldn't grovel for O_NOFOLLOW (unknown to the C compiler).
STYLE-WARNING: Couldn't grovel for O_DIRECTORY (unknown to the C compiler).
STYLE-WARNING: Couldn't grovel for O_DIRECT (unknown to the C compiler).
STYLE-WARNING: Couldn't grovel for O_LARGEFILE (unknown to the C compiler).
// Running pure tests (#<FUNCTION RUN-TESTS::LOAD-TEST>)
// Running
/usr/local/google/local-brown/software/source-trees/sbcl/tests/threads.pure.lisp
::: Running MUTEX-OWNER
::: Success MUTEX-OWNER
::: Running SPINLOCK-OWNER
::: Success SPINLOCK-OWNER
::: Running WITHOUT-INTERRUPTS+CONDITION-WAIT
::: Success WITHOUT-INTERRUPTS+CONDITION-WAIT
::: Running WITHOUT-INTERRUPTS+GET-MUTEX
::: Success WITHOUT-INTERRUPTS+GET-MUTEX
::: Running PARALLEL-FIND-CLASS
::: Success PARALLEL-FIND-CLASS
; in: LAMBDA NIL
; (LET ((THREAD-TEST::W THREAD-TEST::W))
; (SB-THREAD:MAKE-THREAD
; (LAMBDA ()
; (LET (#)
; (DOTIMES # #)))
; :NAME "writer"))
;
; caught STYLE-WARNING:
; The variable W is defined but never used.

; (LET ((THREAD-TEST::R THREAD-TEST::R))
; (SB-THREAD:MAKE-THREAD
; (LAMBDA ()
; (LET (#)
; (DOTIMES # #)))
; :NAME "reader"))
;
; caught STYLE-WARNING:
; The variable R is defined but never used.
;
; compilation unit finished
; caught 2 STYLE-WARNING conditions
::: Running :SEMAPHORE-MULTIPLE-WAITERS
::: UNEXPECTED-FAILURE :SEMAPHORE-MULTIPLE-WAITERS
// Running pure tests (#<FUNCTION RUN-TESTS::CLOAD-TEST>)
// Running impure tests (#<FUNCTION RUN-TESTS::LOAD-TEST>)
// Running impure tests (#<FUNCTION RUN-TESTS::CLOAD-TEST>)
// Running impure tests (#<FUNCTION RUN-TESTS::SH-TEST>)

Finished running tests.
Status:
 Failure: threads.pure.lisp / SEMAPHORE-MULTIPLE-WAITERS
test failed, expected 104 return code, got 1

Gábor Melis (melisgl) wrote :

Further info from Bob:

The Linux kernel version is 2.6.18.5. I believe it is a regression for me
and that SBCL 1.0.24.44 passes the test. Any suggestions for ways to debug
this?

Gábor Melis (melisgl) wrote :

Could you please run the attached file and report the output? Also include the backtrace from the debugger, if any.

Gábor Melis (melisgl) wrote :

Oh, please change the timeout in the above code from 0.1 to 10.

Robert Brown (robert-brown) wrote :
Download full text (33.1 KiB)

I changed the timeout to 10 and changed backtrace to sb-debug:backtrace.

paradicsom <11> cat semaphore-test.lisp
(defparameter *count* 0)

(let ((semaphore (sb-thread:make-semaphore :name "test sem")))
  (labels
      ((make-readers (n i)
         (values
          (loop for r from 0 below n
                collect
                (sb-thread:make-thread
                 (lambda ()
                   (let ((sem semaphore)
                         (*count* 0))
                     (dotimes (s i)
                       (incf *count*)
                       (sb-thread:wait-on-semaphore sem))))
                 :name (format nil "reader ~S" r)))
          (* n i)))
       (make-writers (n readers i)
         (let ((j (* readers i)))
           (multiple-value-bind (k rem) (truncate j n)
             (values
              (let ((writers
                     (loop for w from 0 below n
                           collect
                           (sb-thread:make-thread
                            (lambda ()
                              (let ((sem semaphore)
                                    (*count* 0))
                                (dotimes (s k)
                                  (incf *count*)
                                  (sb-thread:signal-semaphore sem))))
                            :name (format nil "writer ~S" w)))))
                (assert (zerop rem))
                writers)
              (+ rem (* n k))))))
       (test (r w n)
         (format *debug-io* "TEST ~S ~S ~S ~S~%" r w n (get-internal-real-time))
         (force-output t)
         (multiple-value-bind (readers x) (make-readers r n)
           (assert (= (length readers) r))
           (multiple-value-bind (writers y) (make-writers w r n)
             (assert (= (length writers) w))
             (assert (= x y))
             (handler-case
                 (sb-ext:with-timeout 10.0
                   (mapc #'sb-thread:join-thread writers)
                   (mapc #'sb-thread:join-thread readers))
               (sb-ext:timeout ()
                 (mapc (lambda (thread)
                         (format *debug-io* "BACKTRACE FOR ~A~%" thread)
                         (handler-case
                             (sb-thread:interrupt-thread
                              thread
                              (lambda ()
                                (format *debug-io* "Count=~S~%" *count*)
                                (sb-debug:backtrace)))
                           (sb-thread:interrupt-thread-error (c)
                             (declare (ignore c))
                             (format *debug-io*
                                     "Thread exited in the meantime ...~%")))
                         (sleep 0.1))
                       (append readers writers)))))
           (assert (zerop (sb-thread:semaphore-count semaphore)))
           (values))))
    (test 1 1 100)
    (test 2 2 10000)
    (test 4 2 10000)
    (test 4 2 10000)
    (test 10 10 10000)
    (test 10 1 10000)))

paradicsom <12> sbcl
This is SBCL 1.0.25.55, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with...

Robert Brown (robert-brown) wrote :

I ran the test code using sbcl 1.0.24.44. It consistently passes.
The failure rate with sbcl 1.0.25.55 is about 50 percent.

Gábor Melis (melisgl) wrote :

It consistently passes for me.

However, I found a few related issues and made fixes available on the waitqueue branch [2] of my git tree [1].

The test that's added in the "less pessimal waitqueues" commit displays rather erratic behaviour on my system: it finishes in 2s or it hangs for a few seconds here and there and then continues. The strange thing is that it can affect the system to the point where X becomes unresponsive, and even switching between character based virtual terminals is impossible. Then it unwedges and life goes on. Is this similar behaviour to what you observe?

I suspect what I see is kernel issue, mine is 2.6.26. Reading [3] made me add the FUTEX_WAIT_PRIVATE, FUTEX_WAKE_PRIVATE flags to the relevant calls but they didn't help much, save for measurably reducing contention between mmap and futex.

Could you please test this branch? The last commit (the futex private one) may not work on your kernel, use the one before it.

[1] http://quotenil.com/git/sbcl.git
[2] http://quotenil.com/git/?p=sbcl.git;a=shortlog;h=waitqueue
[3] http://blogs.sun.com/roland/entry/real_time_java_and_futexes

Download full text (4.7 KiB)

I just noticed your email. I'll give it a try later tonight if I have the time.

bob

==========

On Mon, Feb 23, 2009 at 10:20 AM, Gábor Melis <email address hidden> wrote:
> It consistently passes for me.
>
> However, I found a few related issues and made fixes available on the
> waitqueue branch [2] of my git tree [1].
>
> The test that's added in the "less pessimal waitqueues" commit displays
> rather erratic behaviour on my system: it finishes in 2s or it hangs for
> a few seconds here and there and then continues. The strange thing is
> that it can affect the system to the point where X becomes unresponsive,
> and even switching between character based virtual terminals is
> impossible. Then it unwedges and life goes on. Is this similar behaviour
> to what you observe?
>
> I suspect what I see is kernel issue, mine is 2.6.26. Reading [3] made
> me add the FUTEX_WAIT_PRIVATE, FUTEX_WAKE_PRIVATE flags to the relevant
> calls but they didn't help much, save for measurably reducing contention
> between mmap and futex.
>
> Could you please test this branch? The last commit (the futex private
> one) may not work on your kernel, use the one before it.
>
> [1] http://quotenil.com/git/sbcl.git
> [2] http://quotenil.com/git/?p=sbcl.git;a=shortlog;h=waitqueue
> [3] http://blogs.sun.com/roland/entry/real_time_java_and_futexes
>
> --
> SEMAPHORE-MULTIPLE-WAITERS failure
> https://bugs.launchpad.net/bugs/332002
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Steel Bank Common Lisp: New
>
> Bug description:
> Reported by Robert Brown:
>
> I just compiled SBCL 1.0.25.55 and ran the tests. I'm getting a failure of
> test SEMAPHORE-MULTIPLE-WAITERS. My system is a 2 CPU dual core Opteron
> running Ubuntu Linux in 64-bit mode. I compiled the same SBCL version on my
> netbook, 32-bit Atom running Ubuntu 8.10 and same test passes.
>
> bob
>
> ====================
>
> paradicsom [51] sh ./run-tests.sh threads.pure.lisp
> /running tests on
> '/usr/local/google/local-brown/software/source-trees/sbcl/tests/../src/runtime/sbcl
> --core
> /usr/local/google/local-brown/software/source-trees/sbcl/tests/../output/sbcl.core
> --noinform --no-sysinit --no-userinit --noprint --disable-debugger'
> ; loading system definition from
> ;
> /usr/local/google/local-brown/software/source-trees/sbcl/contrib/sb-posix/sb-posix.asd
> ; into #<PACKAGE "ASDF0">
> ; loading system definition from
> ;
> /usr/local/google/local-brown/software/source-trees/sbcl/contrib/sb-grovel/sb-grovel.asd
> ; into #<PACKAGE "ASDF1">
> ; registering #<SYSTEM SB-GROVEL {1002FB15A1}> as SB-GROVEL
> ; registering #<SYSTEM SB-POSIX {1002F0A581}> as SB-POSIX
> ; registering #<SYSTEM SB-POSIX-TESTS {10031458F1}> as SB-POSIX-TESTS
> STYLE-WARNING: Couldn't grovel for SIGEMT (unknown to the C compiler).
> STYLE-WARNING: Couldn't grovel for S_IFWHT (unknown to the C compiler).
> STYLE-WARNING: Couldn't grovel for O_NOFOLLOW (unknown to the C compiler).
> STYLE-WARNING: Couldn't grovel for O_DIRECTORY (unknown to the C compiler).
> STYLE-WARNING: Couldn't grovel for O_DIRECT (unknown to the C compiler).
> STYLE-WARNING: Couldn't grovel for O_LARGEFILE (unknown...

Read more...

Gábor Melis (melisgl) wrote :

I committed the stuff referenced above as 1.0.26.4 - 1.0.26.6. In all likelihood, it won't solve your problems, but it would still be interesting to see how the test added in "less pessimal waitqueues" does on your system.

Robert Brown (robert-brown) wrote :

2009/3/16 Gábor Melis <email address hidden>:
> I committed the stuff referenced above as 1.0.26.4 - 1.0.26.6. In all
> likelihood, it won't solve your problems, but it would still be
> interesting to see how the test added in "less pessimal waitqueues" does
> on your system.

I recently tried the semaphore-test you emailed to me again on
sbcl 1.0.25.56, which was the release I initially complained about
as being broken for me. Surprisingly, although my workstation had
not been rebooted since the last runs and the sbcl binary had not
changed, I had to run the test about 10 times before I saw a
failure. Previously, about half the runs would fail for me.

Anyway, I built sbcl 1.0.26.8 and tried the test with that version.
It seems to work reliably now and generally the test takes a lot
less time to run than it used to take with 1.0.25.56. I have not yet
seen it fail.

bob

Gábor Melis (melisgl) wrote :

On Jueves 19 Marzo 2009, Robert Brown wrote:
> 2009/3/16 Gábor Melis <email address hidden>:
> > I committed the stuff referenced above as 1.0.26.4 - 1.0.26.6. In
> > all likelihood, it won't solve your problems, but it would still be
> > interesting to see how the test added in "less pessimal waitqueues"
> > does on your system.
>
> I recently tried the semaphore-test you emailed to me again on
> sbcl 1.0.25.56, which was the release I initially complained about
> as being broken for me. Surprisingly, although my workstation had
> not been rebooted since the last runs and the sbcl binary had not
> changed, I had to run the test about 10 times before I saw a
> failure. Previously, about half the runs would fail for me.
>
> Anyway, I built sbcl 1.0.26.8 and tried the test with that version.
> It seems to work reliably now and generally the test takes a lot
> less time to run than it used to take with 1.0.25.56. I have not yet
> seen it fail.
>
> bob

I'm relieved to hear that, although I suspect it's not the end of the
story as the test that I added in "less pessimal waitqueues" really
stops my whole machine for long seconds ...

Nikodemus Siivola (nikodemus) wrote :

What's the current status of this?

Gábor Melis (melisgl) wrote :

Well, Bob seems to be content with the changes but I strongly suspect that there is a Linux kernel bug as my system (Linux 2.6.26-1-686 #1 SMP) still wedges/unwedges as mentioned above when running the waitqueue tests.

If the problem persists with more recent kernels then we need a C+pthread test case for the kernel people.

Nikodemus Siivola (nikodemus) wrote :

So, keep as new till you or someone else has a chance to try a more recent kernel?

Gábor Melis (melisgl) wrote :

I guess so ;-). I've tried linux 2.6.30 with sbcl 1.0.25 and 1.0.33 on a two cpu box, plus linux 2.6.27.29 with sbcl 1.0.33 on a 24 cpu box and the test no longer bring the system to its knees. I was reasonably confident that it's a kernel issue; with this - admittedly weak - evidence I see no other option than to close this bug.

Changed in sbcl:
status: New → Invalid
Tobias C. Rittweiler (tcr) wrote :

As we know now, the "less pessimal waitqueues" patch introduced
a possible lost wakeup.

With Nikodemus' patch, this issue should perhaps be revisited?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers