Steel Bank Common Lisp

sleep calls of short duration appear to hang on mac osx

Reported by Joe Lobraco on 2010-09-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
SBCL
High
Unassigned

Bug Description

When trying to sleep in a thread for a short duration (in my case 1/30) the sleep call appears never to return on OSX 10.6.4. I have tested this in Linux (Fedora 13) and it appears to work fine.

The test case provided below starts an updater thread that iterates over a list of objects once every 1/30 of a second and executes an update function on each. Then during the course of this more objects can be added to the list. The list is protected with a mutex.

What I expect to happen is the following output to repeat:

BEFORE UPDATE
UPDATING: 1
UPDATING: 2
  ...
UPDATING: 999
AFTER UPDATE
DONE SLEEPING

What is actually happening:

BEFORE UPDATE
AFTER UPDATE

Then nothing else. The DONE SLEEPING output never appears indicating that the thread is stuck in that call.

---------------------------------------------------------------------
Configuration:

SBCL version : 1.0.38

uname -a : Darwin 92.sub-75-207-124.myvzw.com 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386

---------------------------------------------------------------------
Test case:

(defvar *object-mutex* (sb-thread:make-mutex :name "object mutex"))
(defvar *objects* ())

(defstruct obj id)

(defun add-object (obj)
  (sb-thread:with-mutex (*object-mutex*)
    (setf *objects* (append *objects* (list obj))) ))

(defun update-object (obj)
  (format t "UPDATING: ~A~%" (obj-id obj))
  (force-output))

(defun update-objects ()
  (do () (nil nil)
    (format t "BEFORE UPDATE~%") (force-output)
    (sb-thread:with-mutex (*object-mutex*)
      (dolist (obj *objects*)
        (update-object obj)))
    (format t "AFTER UPDATE~%") (force-output)
    (sleep 1/30)
    (format t "DONE SLEEPING~%") (force-output))
  (format t "EXITING THREAD~%") (force-output))

(sb-thread:make-thread #'update-objects :name "object-updater")

(dotimes (x 1000)
  (add-object (make-obj :id x)))

Nikodemus Siivola (nikodemus) wrote :

I cannot reproduce this on Leopard. Can someone on Snow Leopard chime in?

Nikodemus Siivola (nikodemus) wrote :

(Using current CVS HEAD.)

Cyrus Harmon (ch-launchpad) wrote :

I don't see this on snow leopard with 1.0.42.43. Joe, can you try this with a more recent SBCL?

Joe Lobraco (jlobraco) wrote :

Cyrus,

I compiled 1.0.42 from the source distribution, and my test case works as expected with this version on snow lepoard.

So apparently it was an issue that has been fixed in a later release.

My original version (1.0.38) was not built from source, rather it was installed as a binary from macports. I'm not sure if that would have had anything to do with the issue I was encountering, but I though I should mention it.

I'm not sure what the procedure for this issue would be now, since it appears to work properly in 1.0.42, should this issue now be closed?

Thanks,
Joe

Joe Lobraco (jlobraco) wrote :

OK, It looks like I may have jumped the gun a little on this one. Initially, I ran my build of 1.0.42 directly from the source directory using "sh run-sbcl.sh" from a different user account than I normally do. Under these conditions the test case worked fine.

However, I since have installed my new build of 1.0.24 into my system and started using it with my regular user account and the problem has reappeared. I then tried from a different user account and it worked there.

So, after investigating I have determined that my main user account has an .sbclrc file that "requires" asdf and asdf-install.

After some experimenting I have determined that the problem manifests itself only after I (require 'asdf-install) either in the .sbclrc or directly in the sbcl repl.

Basically, if I start sbcl with the --no-userinit option (and don't manually load asdf-install) the test works.

I am using the asdf-install that comes packaged with sbcl.

For reference here is my .sbclrc file in its entirety:

(require 'asdf)
(require 'asdf-install)
(setf asdf-install:*LOCATIONS* `((,(truename #P"/Users/jlobraco/cl/site/") ,(truename #P"/Users/jlobraco/cl/systems/") "Default Location")))
(push (truename #P"/Users/jlobraco/cl/systems/") asdf:*central-registry*)

Joe

Nikodemus Siivola (nikodemus) wrote :

Good job narrowing things down!

However, I still cannot reproduce this on Leopard.

Do you have anything in ~/.asdf-install?

Does this occur if you do

 sh run-sbcl.sh --no-userinit

followed by

 (require :asdf-install)

and your original test-case?

Joe Lobraco (jlobraco) wrote :

Nikodemus,

1) I do not have anything is ~/.asdf-install (that file does not even exist for me).

2) I ran sh run-sbcl.sh --no-userinit

then at the REPL I ran

(require :asdf-install)

then I ran my test case

(load "/Users/jlobraco/lisp/sleep-test.lisp")

and it failed.

I have added the output from my terminal below.

Thanks,
Joe

purple:~ jlobraco $ sh /Users/jlobraco/src/sbcl-1.0.42/run-sbcl.sh --no-userinit
(running SBCL from: /Users/jlobraco/src/sbcl-1.0.42)
This is SBCL 1.0.42, an implementation of ANSI Common Lisp.
More information about SBCL is available at <http://www.sbcl.org/>.

SBCL is free software, provided as is, with absolutely no warranty.
It is mostly in the public domain; some portions are provided under
BSD-style licenses. See the CREDITS and COPYING files in the
distribution for more information.
* (require :asdf-install)

; loading system definition from
; /Users/jlobraco/src/sbcl-1.0.42/contrib/sb-bsd-sockets/sb-bsd-sockets.asd
; into #<PACKAGE "ASDF0">
; registering #<SYSTEM SB-BSD-SOCKETS> as SB-BSD-SOCKETS
; registering #<SYSTEM SB-BSD-SOCKETS-TESTS> as SB-BSD-SOCKETS-TESTS
; loading system definition from
; /Users/jlobraco/src/sbcl-1.0.42/contrib/sb-posix/sb-posix.asd into
; #<PACKAGE "ASDF0">
; registering #<SYSTEM SB-POSIX> as SB-POSIX
; registering #<SYSTEM SB-POSIX-TESTS> as SB-POSIX-TESTS
("ASDF-INSTALL" "SB-POSIX" "SB-BSD-SOCKETS" "SB-GROVEL" "ASDF")
* (load "/Users/jlobraco/lisp/sleep-test.lisp")
BEFORE UPDATE
AFTER UPDATE
T
* (quit)
purple:~ jlobraco $

Joe Lobraco (jlobraco) wrote :
Download full text (3.3 KiB)

I am still experiencing this problem even if I do not require asdf-install, however the problem is much more intermittent when I don't.

I have narrowed it down further. It appears that this is not necessarily a problem with sbcl, rather it has to do with how the nanosleep() system call is behaving on darwin.

In sbcl, the sleep call calls the nanosleep lisp function, which in-turn calls the nanosleep system call (via int-syscall). Here is the sbcl source for reference:

(defun nanosleep (secs nsecs)
  (with-alien ((req (struct timespec))
               (rem (struct timespec)))
    (setf (slot req 'tv-sec) secs)
    (setf (slot req 'tv-nsec) nsecs)
    (loop while (eql sb!unix:eintr
                     (nth-value 1
                                (int-syscall ("nanosleep" (* (struct timespec))
                                                          (* (struct timespec)))
                                             (addr req) (addr rem))))
       do (rotatef req rem))))

The nanosleep system call is supposed to suspend the thread for the time specified and return 0 on success. If the nanosleep system call was interrupted, EINTR is returned and the second parameter (rem) will be filled with the remaining time left. The lisp code above correctly checks for this interrupted state and reschedules another nanosleep system call for the remaining time.

What is happening, however, is that in my failure case, EINTR is returned and the the rem variable has 4294967295 as it's tv-sec value. This is then swapped into the req variable and the nanosleep call is made again, which will sleep for a very long time, making it look like the thread has hung.

Interestingly enough, 4294967295 is the value of -1 if cast to a 32-bit signed integer.

After seeing this, I dug around google to see if I can find anything about why the nanosleep system call would return EINTR and have -1 as the seconds value of the remaining time structure.

What I found out was that this means that the nanosleep call was delayed and slept LONGER than the requested time. Typically, this happens when the requested sleep time is of short duration. (This came from a darwin-kernel mailing list archive http://osdir.com/ml/darwin-kernel/2010-03/msg00007.html).

None of this is documented in the nanosleep man page, so it remains to be seen if this is a bug with the darwin kernel or a feature.

To work around this I have a pretty bad hack I made to the nanosleep lisp function. Basically, I am checking the return value of the remainder after the call to the nanosleep system call, and if it is 4294967295 (i.e. -1) I break out of the loop.

(defun nanosleep (secs nsecs)
  (with-alien ((req (struct timespec))
               (rem (struct timespec)))
    (setf (slot req 'tv-sec) secs)
    (setf (slot req 'tv-nsec) nsecs)
    (loop while (and (eql sb!unix:eintr
                          (nth-value 1
                                     (int-syscall ("nanosleep" (* (struct timespec))
                                                               (* (struct timespec)))
                                                  (addr req) (addr rem))))
                     (not (eql 4294967295 (slot rem 'tv-s...

Read more...

Nikodemus Siivola (nikodemus) wrote :

Thank you! Excellent work.

The only thing this really needs is to use (logand -1 (1- (expt 2 sb-vm:n-word-bits))) instead of hardcoded 4294967295.

Changed in sbcl:
importance: Undecided → High
status: New → Triaged
Nikodemus Siivola (nikodemus) wrote :

I take that back. Darwin is doing something _really_ funny here, since the number is (unsigned)-1 but tv-secs is a long.

The following version tries to make sure the amount to sleep doesn't go up, and should do the same job.

#!+darwin isn't strictly necessary here, though. I wonder if we should do the check everywhere.

(defun nanosleep (secs nsecs)
  (with-alien ((req (struct timespec))
               (rem (struct timespec)))
    (setf (slot req 'tv-sec) secs)
    (setf (slot req 'tv-nsec) nsecs)
    (loop while (and (eql sb!unix:eintr
                          (nth-value 1
                                     (int-syscall ("nanosleep" (* (struct timespec))
                                                               (* (struct timespec)))
                                                  (addr req) (addr rem))))
                     #!+darwin
                     ;; On Darwin, if nanosleep sleeps _longer_ than the
                     ;; requested time, the call may return with EINT and
                     ;; (unsigned)-1 in the seconds of the remainder timespec,
                     ;; which would cause us to enter nanosleep again for ~136
                     ;; years... So check that the remainder time is actually
                     ;; decreasing.
                     ;; See: http://osdir.com/ml/darwin-kernel/2010-03/msg00007.html
                     (let ((rem-sec (slot rem 'tv-sec))
                           (rem-nsec (slot rem 'tv-nsec)))
                       (when (or (> secs rem-sec)
                                 (and (= secs rem-sec) (>= nsecs rem-nsec)))
                            (setf secs rem-sec
                                  nsecs rem-nsec)
                            t)))
          do (rotatef req rem))))

Nikodemus Siivola (nikodemus) wrote :

Test case:

(assert
 (handler-case
     (sb-ext:with-timeout 10
       (let (to)
         (handler-bind ((sb-ext:timeout (lambda (c)
                                          (unless to
                                            (setf to t)
                                            (sleep 2)
                                            (continue c)))))
           (sb-ext:with-timeout 0.1 (sleep 1) t))))
   (sb-ext:timeout ()
     nil)))

Nikodemus Siivola (nikodemus) wrote :

In SBCL 1.0.42.50.

Changed in sbcl:
status: Triaged → Fix Committed
Changed in sbcl:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers