The advertised usage of condition-wait may never timeout

Bug #1760827 reported by Siebe de Vos
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
New
Undecided
Unassigned

Bug Description

Code using WAIT-ON-GATE with :TIMEOUT did not return in time under situations with lots of GC.

This can be traced back to the use of CONDITION-WAIT according to the documentation. When CONDITION-WAIT returns due to a spurious interrupt, the original timeout is reused in the next iteration. If the period of interrupts is shorter than TIMEOUT, this means that WAIT-ON-GATE may never return.

A simple example, using GC to trigger interrupts:

(require :sb-concurrency)

(defun test-condition-wait (&key (gc-p t) (timeout 1))
  "Expect to be timed out after TIMEOUT. When GC-P is true, generate lots of
interrupts, otherwise do nothing while waiting."
  (let ((gate (sb-concurrency:make-gate))
        (stop nil))
    (flet ((_waiter ()
             ;; We expect to be timed out:
             (sb-concurrency:wait-on-gate gate :timeout timeout)
             (setf stop t)))
      (let ((waiter (sb-thread:make-thread #'_waiter)))
        ;; Do something until stopped by WAITER after TIMEOUT.
        (unwind-protect
            (loop
              (when gc-p (sb-ext:gc))
              (sleep (/ timeout 2))
              (when stop (return)))
          (when (sb-thread:thread-alive-p waiter)
            (sb-thread:terminate-thread waiter)))))))

* (test-condition-wait :gc-p nil)
;; after one second:
NIL

* (test-condition-wait)
;; does not return...

This is SBCL 1.4.5, an implementation of ANSI Common Lisp.
Linux si2l 4.4.92-31-default #1 SMP Sun Oct 22 06:56:24 UTC 2017 (1d80e8a) x86_64 x86_64 x86_64 GNU/Linux

(:64-BIT :64-BIT-REGISTERS :ALIEN-CALLBACKS :ANSI-CL :ASH-RIGHT-VOPS
 :C-STACK-IS-CONTROL-STACK :CALL-SYMBOL :COMMON-LISP :COMPACT-INSTANCE-HEADER
 :COMPARE-AND-SWAP-VOPS :COMPLEX-FLOAT-VOPS :CYCLE-COUNTER :ELF :FLOAT-EQL-VOPS
 :FP-AND-PC-STANDARD-SAVE :GCC-TLS :GENCGC :IEEE-FLOATING-POINT :IMMOBILE-CODE
 :IMMOBILE-SPACE :INLINE-CONSTANTS :INTEGER-EQL-VOP :LARGEFILE :LINKAGE-TABLE
 :LINUX :LITTLE-ENDIAN :MEMORY-BARRIER-VOPS :MULTIPLY-HIGH-VOPS
 :OS-PROVIDES-BLKSIZE-T :OS-PROVIDES-DLADDR :OS-PROVIDES-DLOPEN
 :OS-PROVIDES-GETPROTOBY-R :OS-PROVIDES-POLL :OS-PROVIDES-PUTWC
 :OS-PROVIDES-SUSECONDS-T :PACKAGE-LOCAL-NICKNAMES :RAW-INSTANCE-INIT-VOPS
 :RAW-SIGNED-WORD :RELOCATABLE-HEAP :SB-DOC :SB-EVAL :SB-FUTEX :SB-LDB
 :SB-PACKAGE-LOCKS :SB-SIMD-PACK :SB-SOURCE-LOCATIONS :SB-THREAD :SB-UNICODE
 :SBCL :STACK-ALLOCATABLE-CLOSURES :STACK-ALLOCATABLE-FIXED-OBJECTS
 :STACK-ALLOCATABLE-LISTS :STACK-ALLOCATABLE-VECTORS
 :STACK-GROWS-DOWNWARD-NOT-UPWARD :SYMBOL-INFO-VOPS :UNBIND-N-VOP
 :UNDEFINED-FUN-RESTARTS :UNIX :UNWIND-TO-FRAME-AND-CALL-VOP :X86-64)

Revision history for this message
Siebe de Vos (s.de.vos) wrote :
Revision history for this message
Siebe de Vos (s.de.vos) wrote :

The patch #1 is naive because it is doing some work already done in %CONDITION-WAIT and because it may not be solving some on all levels. For example, the FIXME in %WAIT-FOR-MUTEX could probably lead to a similar issue.

A serious patch will use the remaining time values computed and returned by %CONDITION-WAIT and lower-level calls.

Probably timeouts have to be treated like deadlines, having both an absolute and relative component. Merging the timeout and deadline concepts might simplify some code.

Revision history for this message
guenthert (guenthert) wrote :

   Well, there are of course multiple ways to address this shortcoming. One approach compatible with the already published interface would optionally accept an timeout type (to be specified) argument instead of the timeout value. The former, as long as it isn't a trivial type, could then be modified within the CONDITION-WAIT function. Not exactly pretty, not easy to get the remaining-time value correct and not very FP-like, but possible (at least with approximate remaining-time).

   I'd argue however, that the problem is with the specification as much as with the implementation. In the manual [1] it is stated that these functions are "based on the POSIX condition variable design". Posix' pthread_cond_timedwait() however takes an "abstime" argument, which specifies the timeout as point-in-time (deadline), thereby bypassing the awkward remaining-time problem. I'd like to see that implemented. I'd dare to create a patch, if there's consens that this shall be the way forward and a change of published interface is deemed acceptable.

   Perhaps best might be to add a new function COND-TIMEDWAIT with a deadline argument and deprecate the timeout argument of COND-WAIT.

Revision history for this message
Douglas Katzman (dougk) wrote :

I'm in favor of adding new interfaces and deprecating the old if we can clean things up.
To me that would mean eradicating any interaction with *DEADLINE* which is not a thing in the POSIX model, removing the distinction between ':waitp nil' and ':timeout 0' in grab-mutex, and removing the NOTIFICATION thing from every semaphore operation that currently accepts it.
(And maybe some other things I didn't mention)
The result would be a leaner synchronization API patterned on just the underlying layer (POSIX sync primitives) even though completely reimplemented in Lisp. And it would reconcile the programming model with other languages that don't have any of this old baggage and seem to do just fine. I would mildly oppose doing only a subset of those changes because then we'd be halfway between a crufty design and a clean design, which is even worse than just having 1 design.

That said, I think the documentation is not _requiring_ that you not decrement the timeout. I'm not sure why users perceive that it says that. It's saying "here is a usage pattern". ymmv.

Revision history for this message
guenthert (guenthert) wrote :

> I would mildly oppose doing only a subset of those changes because then we'd be halfway between a crufty design and a clean design, which is even worse than just having 1 design.

Fair enough. I like to help where I can, but I'm afraid the proposed changeset is beyond my expertise.

Meanwhile I wondered whether the original problem cannot be circumvented by using the WITH-DEADLINE macro (which somewhat perplexingly takes a time-periode rather than a point-in-time argument).

Revision history for this message
Siebe de Vos (s.de.vos) wrote :

dougk wrote:
> That said, I think the documentation is not _requiring_ that you not
> decrement the timeout. I'm not sure why users perceive that it says that.
> It's saying "here is a usage pattern". ymmv.

And the coder of wait-on-gate bought in on that usage pattern, which is clearly a bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.