Comment 8 for bug 640516

Revision history for this message
Joe Lobraco (jlobraco) wrote :

I am still experiencing this problem even if I do not require asdf-install, however the problem is much more intermittent when I don't.

I have narrowed it down further. It appears that this is not necessarily a problem with sbcl, rather it has to do with how the nanosleep() system call is behaving on darwin.

In sbcl, the sleep call calls the nanosleep lisp function, which in-turn calls the nanosleep system call (via int-syscall). Here is the sbcl source for reference:

(defun nanosleep (secs nsecs)
  (with-alien ((req (struct timespec))
               (rem (struct timespec)))
    (setf (slot req 'tv-sec) secs)
    (setf (slot req 'tv-nsec) nsecs)
    (loop while (eql sb!unix:eintr
                     (nth-value 1
                                (int-syscall ("nanosleep" (* (struct timespec))
                                                          (* (struct timespec)))
                                             (addr req) (addr rem))))
       do (rotatef req rem))))

The nanosleep system call is supposed to suspend the thread for the time specified and return 0 on success. If the nanosleep system call was interrupted, EINTR is returned and the second parameter (rem) will be filled with the remaining time left. The lisp code above correctly checks for this interrupted state and reschedules another nanosleep system call for the remaining time.

What is happening, however, is that in my failure case, EINTR is returned and the the rem variable has 4294967295 as it's tv-sec value. This is then swapped into the req variable and the nanosleep call is made again, which will sleep for a very long time, making it look like the thread has hung.

Interestingly enough, 4294967295 is the value of -1 if cast to a 32-bit signed integer.

After seeing this, I dug around google to see if I can find anything about why the nanosleep system call would return EINTR and have -1 as the seconds value of the remaining time structure.

What I found out was that this means that the nanosleep call was delayed and slept LONGER than the requested time. Typically, this happens when the requested sleep time is of short duration. (This came from a darwin-kernel mailing list archive http://osdir.com/ml/darwin-kernel/2010-03/msg00007.html).

None of this is documented in the nanosleep man page, so it remains to be seen if this is a bug with the darwin kernel or a feature.

To work around this I have a pretty bad hack I made to the nanosleep lisp function. Basically, I am checking the return value of the remainder after the call to the nanosleep system call, and if it is 4294967295 (i.e. -1) I break out of the loop.

(defun nanosleep (secs nsecs)
  (with-alien ((req (struct timespec))
               (rem (struct timespec)))
    (setf (slot req 'tv-sec) secs)
    (setf (slot req 'tv-nsec) nsecs)
    (loop while (and (eql sb!unix:eintr
                          (nth-value 1
                                     (int-syscall ("nanosleep" (* (struct timespec))
                                                               (* (struct timespec)))
                                                  (addr req) (addr rem))))
                     (not (eql 4294967295 (slot rem 'tv-sec))))
       do (rotatef req rem))))

I have tested this with my test case and it appears to work.

If anyone has a better solution, I would very much appreciate it.

Thanks,
Joe