Need non-blocking way to know if child process failed

Bug #185601 reported by Derick Eddington
2
Affects Status Importance Assigned to Milestone
Ikarus Scheme
Fix Committed
High
Abdulaziz Ghuloum

Bug Description

Currently, there's no non-blocking way to test if a child process succeeded or not. The only way is to try reading from the err port to see what string it returns, but this might block forever, and then waitpid returns weird values because waitpid(2) packs various info in the status.

Ikarus Scheme version 0.0.2patched+ (revision 1366, build 2008-01-23)
Copyright (c) 2006-2008 Abdulaziz Ghuloum

> (define-values (pid to from err) (process "asdfasdf"))
> (define errt (transcoded-port err (native-transcoder)))
> (get-string-all errt) ;; But I don't know if this will block
"failed to exec asdfasdf: No such file or directory\n"
> (waitpid pid)
65280
> (waitpid pid)
-141090988
> (waitpid pid)
-141091004
> (waitpid pid)
-141091020
> (waitpid pid)
-141091036
> (waitpid pid)
-141091052
> (waitpid pid)
-141091068
> (waitpid pid)
-141091084
>

To be able to test if executing the process succeeded, I made the following changes:

[d@eep:~/ikarus.dev]-> bzr diff
=== modified file 'scheme/ikarus.posix.ss'
--- scheme/ikarus.posix.ss 2007-12-15 13:22:49 +0000
+++ scheme/ikarus.posix.ss 2008-01-24 11:35:04 +0000
@@ -37,10 +37,15 @@
           [else (parent-proc pid)]))))

   (define waitpid
- (lambda (pid)
- (unless (fixnum? pid)
- (die 'waitpid "not a fixnum" pid))
- (foreign-call "ikrt_waitpid" pid)))
+ (case-lambda
+ [(pid)
+ (waitpid pid #t)]
+ [(pid block?)
+ (unless (fixnum? pid)
+ (die 'waitpid "not a fixnum" pid))
+ (unless (boolean? block?)
+ (die 'waitpid "not a boolean" block?))
+ (foreign-call "ikrt_waitpid" pid block?)]))

   (define system
     (lambda (x)

=== modified file 'src/ikarus-process.c'
--- src/ikarus-process.c 2008-01-02 02:08:07 +0000
+++ src/ikarus-process.c 2008-01-24 11:45:08 +0000
@@ -74,8 +74,15 @@
 }

 ikptr
-ikrt_waitpid(ikptr pid, ikpcb* pcb){
+ikrt_waitpid(ikptr pid, ikptr block, ikpcb* pcb){
   int status;
- waitpid(unfix(pid), &status, 0);
- return fix(status);
+ int options = 0;
+ int r;
+ if (block == false_object)
+ options = WNOHANG;
+ r = waitpid(unfix(pid), &status, options);
+ if (r > 0 && r == unfix(pid) && WIFEXITED(status))
+ return fix(WEXITSTATUS(status));
+ else
+ return false_object;
 }

Maybe this is a good way to accomplish this. I disclaim all copyright to this patch. Do what thou wilt with it :)

With this patch, I can do:

Ikarus Scheme version 0.0.2patched+ (revision 1366, build 2008-01-24)
Copyright (c) 2006-2008 Abdulaziz Ghuloum

> (define-values (pid to from err) (process "asdfasdf"))
> (waitpid pid #f) ;; Don't block
255 ;; Oh no, the forked process already exited
> (waitpid pid) ;; Just to demonstrate
#f ;; means this waitpid test did not show the process exited normally (it was already tested)
> (define errt (transcoded-port err (native-transcoder)))
> (get-string-all errt)
"failed to exec asdfasdf: No such file or directory\n"
>

And:

Ikarus Scheme version 0.0.2patched+ (revision 1366, build 2008-01-24)
Copyright (c) 2006-2008 Abdulaziz Ghuloum

> (define-values (pid to from err) (process "cat"))
> (waitpid pid #f) ;; Don't block
#f ;; It's alive
> (define tot (transcoded-port to (native-transcoder)))
> (define fromt (transcoded-port from (native-transcoder)))
> (put-string tot "blah")
> (flush-output-port tot)
> (get-string-n fromt 4)
"blah"
> (close-port tot)
> (close-port fromt)
> (close-port err)
> (waitpid pid) ;; old blocking behavior
0
>

If my patch isn't the right thing, I request some non-blocking way to test if the executed process died or not.

Revision history for this message
Abdulaziz Ghuloum (aghuloum) wrote :

I agree this needs to be fixed. The solution that you posted has a problem with a race condition since by the time you do you (waitpid pid #f), the child may not have gotten a chance to exec, fail, and exit. I'll have to check with Steven's on the best way to do this.

Changed in ikarus:
assignee: nobody → aghuloum
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Abdulaziz Ghuloum (aghuloum) wrote :

Question: if process returned nonblocking ports, or, if we have process-nonblocking to go with the rest of the nonblocking procedures, would that kind of solve this problem? (I still haven't found a solution)

Revision history for this message
Derick Eddington (derick-eddington) wrote : Re: [Bug 185601] Need to be able to test if process failed

On Mon, 2008-03-24 at 04:10 +0000, Abdulaziz Ghuloum wrote:
> Question: if process returned nonblocking ports, or, if we have process-
> nonblocking to go with the rest of the nonblocking procedures, would
> that kind of solve this problem? (I still haven't found a solution)

But if you exec a process with which you do not attempt any I/O (i.e.,
because the program isn't designed to do any), there's still no way to
know if it succeeded or failed.

I'll start trying to learn what other run-time systems have done about
this.

Revision history for this message
Abdulaziz Ghuloum (aghuloum) wrote :

On Mar 31, 2008, at 1:43 AM, Derick Eddington wrote:

> But if you exec a process with which you do not attempt any I/O (i.e.,
> because the program isn't designed to do any), there's still no way to
> know if it succeeded or failed.

Well, if the program is designed not to do any IO, then reading from the
output-port and error-port that process returns should both return #!
eof:

 > (let-values ([(pid to-p from-p err-p) (process "true")])
     (list (get-u8 from-p) (get-u8 err-p)))
(#!eof #!eof)

[I know you brought this just as an example. The problem is still
there.]

> I'll start trying to learn what other run-time systems have done about
> this.

The issue here is that I'm doing "fork", which succeeds, then I do
"exec"
within the child process, which may fail. At any point after the fork,
the child and parent processes are running independently, so, the parent
cannot ask if the exec failed because the child may not have gotten a
chance to exec and fail yet!

> I'll start trying to learn what other run-time systems have done about
> this.

Please do let me know if you find a solution (or a system that has found
a solution).

Revision history for this message
Derick Eddington (derick-eddington) wrote :

On Mon, 2008-03-31 at 09:53 +0000, Abdulaziz Ghuloum wrote:
> On Mar 31, 2008, at 1:43 AM, Derick Eddington wrote:
>
> > But if you exec a process with which you do not attempt any I/O (i.e.,
> > because the program isn't designed to do any), there's still no way to
> > know if it succeeded or failed.
>
> Well, if the program is designed not to do any IO, then reading from the
> output-port and error-port that process returns should both return #!
> eof

Maybe the parent Scheme program is a shell / general launcher and
doesn't know what the child will do.

> Please do let me know if you find a solution (or a system that has
> found
> a solution).

Will do, but might be a while. This must have been solved more than
once before you'd think, no? I remember hearing a brief description of
this specific problem years ago. It does seem like POSIX does not give
one a reliable non-blocking non-race-problem way to know, unless you
listen for SIGCHLD with the POSIX signal handlers facility, which I've
been guessing is not compatible / desireable with Ikarus's architecture.
Is that the case? Should we exclude solutions using a SIGCHLD signal
handler?

Revision history for this message
Abdulaziz Ghuloum (aghuloum) wrote :

On Mar 31, 2008, at 11:48 PM, Derick Eddington wrote:

> This must have been solved more than once before you'd think, no?

I would think.

> I remember hearing a brief description of
> this specific problem years ago. It does seem like POSIX does not
> give
> one a reliable non-blocking non-race-problem way to know, unless you
> listen for SIGCHLD with the POSIX signal handlers facility, which I've
> been guessing is not compatible / desireable with Ikarus's
> architecture.
> Is that the case? Should we exclude solutions using a SIGCHLD signal
> handler?

Not necessarily. If this is the only way to solve it, then, we have
to solve it. But I don't see (off the top of my head) how catching
SIGCHLD would solve it. Usually you catch SIGCHLD so that it
collects the dead children if you don't want to wait on them
yourself. I don't see how it can be used to "test if process
failed". Maybe my way of doing process (fork then exec) is not the
right way but I don't know what is.

Revision history for this message
Derick Eddington (derick-eddington) wrote :

On Tue, 2008-04-01 at 05:58 +0000, Abdulaziz Ghuloum wrote:
> Not necessarily. If this is the only way to solve it, then, we have
> to solve it. But I don't see (off the top of my head) how catching
> SIGCHLD would solve it. Usually you catch SIGCHLD so that it
> collects the dead children if you don't want to wait on them
> yourself.

Right. My memory is vague about SIGCHLD. I meant more generally, I
wonder if POSIX signals aren't the only way to test if the process
failed? There might be another signal or there might be an additional
code that goes along with SIGCHLD. I just Googled "fork exec child race
condition" which seems to yield a number of promising links I'll be
reading (maybe slowly).

BTW, I think a process-nonblocking which returns nonblocking ports would
be useful anyways. One of my longer-term goals of being into programing
is to use/make actors-like systems which use event-loops in different
processes communicating and async I/O is needed/good for that.

Revision history for this message
Derick Eddington (derick-eddington) wrote :

It looks like the only solution is to use a SIGCHLD signal handler. Not
to "test" (sorry) but to be notified when a specific process has died.
An idea: register a procedure for a child and have that procedure called
when the SIGCHLD telling of that child's death is delivered to the
signal handler (you'd use SA_SIGINFO with sa_sigaction to install a
signal handler that would be given a siginfo_t telling what child PID
died); without screwing up ikarus's stack or run-time of course. Using
an alternate signal stack (via sigaltstack and SA_ONSTACK) might be
noteworthy. Ah, it looks like ikarus already does use sa_sigaction and
an alternate stack for SIGINT, but the handler doesn't call back into
Scheme.

Would this also be possible: If the callback procedure returns, the
continuation of the program from where it was at when the signal handler
was called is resumed, but the callback procedure could possibly not
return as its way of dealing with the death (hahaha).

Revision history for this message
Abdulaziz Ghuloum (aghuloum) wrote : Re: [Bug 185601] Re: Need to be able to know if child process failed
Download full text (3.7 KiB)

On Apr 2, 2008, at 2:18 AM, Derick Eddington wrote:

> It looks like the only solution is to use a SIGCHLD signal
> handler. Not
> to "test" (sorry) but to be notified when a specific process has died.
> An idea: register a procedure for a child and have that procedure
> called
> when the SIGCHLD telling of that child's death is delivered to the
> signal handler (you'd use SA_SIGINFO with sa_sigaction to install a
> signal handler that would be given a siginfo_t telling what child PID
> died); without screwing up ikarus's stack or run-time of course.
> Using
> an alternate signal stack (via sigaltstack and SA_ONSTACK) might be
> noteworthy. Ah, it looks like ikarus already does use sa_sigaction
> and
> an alternate stack for SIGINT, but the handler doesn't call back into
> Scheme.

Exactly. In the signal handler, you're pretty much helpless because
you don't even know whether you're in the Scheme code, in the GC, in
GMP, in some system call (read, write, select, ...) or just in the
middle of a cons that did not initialize its car or cdr fields.

So, for SIGINT, all that Ikarus does right now is set two fields in
the pcb record:

void handler(int signo, siginfo_t* info, void* uap){
   the_pcb->engine_counter = -1;
   the_pcb->interrupted = 1;
}

and that's it. In the Scheme code, on entry to every procedure, the
value of engine_counter is decremented and, if negative, the engine
handler is called, which resets the counter, then checks and resets
the interrupted flag and either calls the interrupt handler (which
raises an interrupted continuable condition and returns, or the
timeout handler which just returns (iirc).

So, calling into Scheme from the signal handler is just not possible.

So, you add another field (say pcb->child_died) and from the handler,
you set the engine_counter to be -1 and the child_died flag to be 1.
In Scheme, the engine handler would have to check for this flag now,
and if set, calls waitpid to reap the dead child and collect the
info, and stash it somewhere (hash table of some sort) to be
retrieved at a later time so that you know if your child has exited
or not and what the exit status was.

I'm just thinking out loud here, so, I don't know if any of this
would work. I don't know off the top of my head which of these calls
are interruptable/restartable, what happens if multiple children die
at the same time, or when one child dies while you're collecting
another.

But all of this does not answer the question: how to know if a child
process failed. The fact that you did not get a sigchild does not
mean that the process did not fail. All it means is that it did not
fail *yet* and might fail any time now. (I just read in waitpid(2)
that you can pass a WNOHANG option to waitpid so that it doesn't
hang, but that too does not answer the question.)

Let me repeat the problem statement: You want the call to (process
"foo") to return the usual values if the process is started, or raise
an exception if that process was not started for whatever reason.
Right? If so, then all this business with interrupts/waitpid/etc
does not give that behavior, and I d...

Read more...

Revision history for this message
Derick Eddington (derick-eddington) wrote :
Download full text (3.2 KiB)

On Wed, Apr 2, 2008 at 3:20 AM, Abdulaziz Ghuloum
<email address hidden> wrote:
> So, you add another field (say pcb->child_died) and from the handler,
> you set the engine_counter to be -1 and the child_died flag to be 1.
> In Scheme, the engine handler would have to check for this flag now,
> and if set, calls waitpid to reap the dead child and collect the
> info, and stash it somewhere (hash table of some sort) to be
> retrieved at a later time so that you know if your child has exited
> or not and what the exit status was.
>
> I'm just thinking out loud here, so, I don't know if any of this
> would work. I don't know off the top of my head which of these calls
> are interruptable/restartable, what happens if multiple children die
> at the same time, or when one child dies while you're collecting
> another.

You can block/delay delivery of signal(s) when in such sections you
don't want interrupted. Maybe the child_died flag can also be a
counter of the number of un-reaped dead kiddies (in case the SIGCHLD
sig-handler is called more than once before the engine handler) and
when the engine handler deals with this it does waitpid as many times
as the counter says and then the infos for all the dead children are
collected. A hashtable associating PIDs to Scheme callback procedures
is used to find the appropriate callbacks. I suppose signal(s) need
to be unblocked before calling the callbacks, because who knows what a
callback will want to do.

> But all of this does not answer the question: how to know if a child
> process failed. The fact that you did not get a sigchild does not
> mean that the process did not fail. All it means is that it did not
> fail *yet* and might fail any time now.

But you do get to know after failure/death happens. Which, because of
the race condition, seems to be the only non-blocking way for the
parent to find out; unless every parent explicitly repeatedly tries a
non-blocking waitpid every so often to try to find out, which seems
annoying.

> (I just read in waitpid(2)
> that you can pass a WNOHANG option to waitpid so that it doesn't
> hang, but that too does not answer the question.)

Right. I used WNOHANG in my initial post, when I forgot about the
race condition.

> Let me repeat the problem statement: You want the call to (process
> "foo") to return the usual values if the process is started, or raise
> an exception if that process was not started for whatever reason.
> Right? If so, then all this business with interrupts/waitpid/etc
> does not give that behavior, and I don't know how to do it.

Sorry for not being clear. I think we need any non-blocking
(race-free, not explicitly scheduling checks) way for the parent to
know a child died and which one.

> > Would this also be possible: If the callback procedure returns, the
> > continuation of the program from where it was at when the signal
> > handler
> > was called is resumed, but the callback procedure could possibly not
> > return as its way of dealing with the death (hahaha).
>
> That's fine. We do that all the time.

Cool. I hope this can be made to work so we can have a reliable and
easy way for the pare...

Read more...

Revision history for this message
Abdulaziz Ghuloum (aghuloum) wrote :

Does the new waitpid (rev 1516) resolve this issue?

Revision history for this message
Derick Eddington (derick-eddington) wrote :

On Fri, 2008-06-13 at 12:51 +0000, Abdulaziz Ghuloum wrote:
> Does the new waitpid (rev 1516) resolve this issue?

I want to say yes. I think for most programs that want to be non-blocking / multiplexing, it's adequate, because these types of programs usually have an event loop driving them and they can periodically do non-blocking waitpid checks to see if a child exited or was killed. For programs which expect to do I/O with a child, I/O failures will happen if the child died unexpectedly and I/O is attempted with it (in this case, I've seen #!eof returned for reads, and broken pipe exception for writes) and the parent can detect these failures and do a non-blocking waitpid to check if the child died and to reclaim the zombie.

However, assume a hypothetical non-blocking program which for some reason can't periodically do non-blocking waitpid checks and which doesn't do I/O with children so can't rely on I/O failures to indicate child death and which must know if a child succeeded or died abnormally. This program would need to be notified via an event/callback when the child has died so it can check if it succeeded or not. Unfortunately I don't have a more clear idea if this type of program is real or if it's the wrong way and you should have designed it differently.

I have actually already created a patch for Ikarus which implements callback notifications of child death (using a C signal handler for SIGCHLD which handles it similar to SIGINT). However, having the process accept SIGCHLD means nearly every system-call might be interrupted by the delivery of a SIGCHLD and return errno==EINTR, and this requires checking for EINTR for every syscall (except specified-duration sleep, which I don't think there's a solution to) and redoing the syscall. I implemented, in Scheme, these EINTR checks and syscall redos, also. Along with this, I tried to implement solutions for the corner cases with the engine_counter and $do-event when another signal(s) are delivered before the handling of the first one has completed. As you told me, these corner cases also apply to engines preemption, so my ideas might be useful for this issue with engines. But I'm no longer sure allowing SIGCHLD and dealing with EINTR is worth it or necessary.

So, I'll close this bug report and I, or someone else should, reopen if we need child death notification callbacks.

Changed in ikarus:
status: Confirmed → Fix Committed
Revision history for this message
Derick Eddington (derick-eddington) wrote :

On Sat, 2008-06-14 at 02:33 +0000, Derick Eddington wrote:
On Fri, 2008-06-13 at 12:51 +0000, Abdulaziz Ghuloum wrote:
> However, having the
> process accept SIGCHLD means nearly every system-call might be
> interrupted by the delivery of a SIGCHLD and return errno==EINTR, and
> this requires checking for EINTR for every syscall (except specified-
> duration sleep, which I don't think there's a solution to) and redoing
> the syscall.

TMI, I realized there's an obvious solution to the problem of specified-duration sleep being interrupted by delivery of a SIGCHLD: block signals before sleeping and unblock them after. In the modifications I mentioned above, I already implemented functions to block and unblock signals, so doing so for sleep would be easy.

Revision history for this message
Derick Eddington (derick-eddington) wrote :

Oops, that line saying "Abdulaziz Ghuloum wrote:" is out of context, I forgot to delete that. The quote is all of me.

Changed in ikarus:
milestone: none → 0.0.4
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.