The advertised usage of condition-wait may never timeout

Bug #1760827 reported by Siebe de Vos on 2018-04-03
This bug affects 1 person
Affects Status Importance Assigned to Milestone

Bug Description

Code using WAIT-ON-GATE with :TIMEOUT did not return in time under situations with lots of GC.

This can be traced back to the use of CONDITION-WAIT according to the documentation. When CONDITION-WAIT returns due to a spurious interrupt, the original timeout is reused in the next iteration. If the period of interrupts is shorter than TIMEOUT, this means that WAIT-ON-GATE may never return.

A simple example, using GC to trigger interrupts:

(require :sb-concurrency)

(defun test-condition-wait (&key (gc-p t) (timeout 1))
  "Expect to be timed out after TIMEOUT. When GC-P is true, generate lots of
interrupts, otherwise do nothing while waiting."
  (let ((gate (sb-concurrency:make-gate))
        (stop nil))
    (flet ((_waiter ()
             ;; We expect to be timed out:
             (sb-concurrency:wait-on-gate gate :timeout timeout)
             (setf stop t)))
      (let ((waiter (sb-thread:make-thread #'_waiter)))
        ;; Do something until stopped by WAITER after TIMEOUT.
              (when gc-p (sb-ext:gc))
              (sleep (/ timeout 2))
              (when stop (return)))
          (when (sb-thread:thread-alive-p waiter)
            (sb-thread:terminate-thread waiter)))))))

* (test-condition-wait :gc-p nil)
;; after one second:

* (test-condition-wait)
;; does not return...

This is SBCL 1.4.5, an implementation of ANSI Common Lisp.
Linux si2l 4.4.92-31-default #1 SMP Sun Oct 22 06:56:24 UTC 2017 (1d80e8a) x86_64 x86_64 x86_64 GNU/Linux


Siebe de Vos ( wrote :
Siebe de Vos ( wrote :

The patch #1 is naive because it is doing some work already done in %CONDITION-WAIT and because it may not be solving some on all levels. For example, the FIXME in %WAIT-FOR-MUTEX could probably lead to a similar issue.

A serious patch will use the remaining time values computed and returned by %CONDITION-WAIT and lower-level calls.

Probably timeouts have to be treated like deadlines, having both an absolute and relative component. Merging the timeout and deadline concepts might simplify some code.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers