Comment 10 for bug 185601

Revision history for this message
Derick Eddington (derick-eddington) wrote : Re: [Bug 185601] Re: Need to be able to know if child process failed

On Wed, Apr 2, 2008 at 3:20 AM, Abdulaziz Ghuloum
<email address hidden> wrote:
> So, you add another field (say pcb->child_died) and from the handler,
> you set the engine_counter to be -1 and the child_died flag to be 1.
> In Scheme, the engine handler would have to check for this flag now,
> and if set, calls waitpid to reap the dead child and collect the
> info, and stash it somewhere (hash table of some sort) to be
> retrieved at a later time so that you know if your child has exited
> or not and what the exit status was.
>
> I'm just thinking out loud here, so, I don't know if any of this
> would work. I don't know off the top of my head which of these calls
> are interruptable/restartable, what happens if multiple children die
> at the same time, or when one child dies while you're collecting
> another.

You can block/delay delivery of signal(s) when in such sections you
don't want interrupted. Maybe the child_died flag can also be a
counter of the number of un-reaped dead kiddies (in case the SIGCHLD
sig-handler is called more than once before the engine handler) and
when the engine handler deals with this it does waitpid as many times
as the counter says and then the infos for all the dead children are
collected. A hashtable associating PIDs to Scheme callback procedures
is used to find the appropriate callbacks. I suppose signal(s) need
to be unblocked before calling the callbacks, because who knows what a
callback will want to do.

> But all of this does not answer the question: how to know if a child
> process failed. The fact that you did not get a sigchild does not
> mean that the process did not fail. All it means is that it did not
> fail *yet* and might fail any time now.

But you do get to know after failure/death happens. Which, because of
the race condition, seems to be the only non-blocking way for the
parent to find out; unless every parent explicitly repeatedly tries a
non-blocking waitpid every so often to try to find out, which seems
annoying.

> (I just read in waitpid(2)
> that you can pass a WNOHANG option to waitpid so that it doesn't
> hang, but that too does not answer the question.)

Right. I used WNOHANG in my initial post, when I forgot about the
race condition.

> Let me repeat the problem statement: You want the call to (process
> "foo") to return the usual values if the process is started, or raise
> an exception if that process was not started for whatever reason.
> Right? If so, then all this business with interrupts/waitpid/etc
> does not give that behavior, and I don't know how to do it.

Sorry for not being clear. I think we need any non-blocking
(race-free, not explicitly scheduling checks) way for the parent to
know a child died and which one.

> > Would this also be possible: If the callback procedure returns, the
> > continuation of the program from where it was at when the signal
> > handler
> > was called is resumed, but the callback procedure could possibly not
> > return as its way of dealing with the death (hahaha).
>
> That's fine. We do that all the time.

Cool. I hope this can be made to work so we can have a reliable and
easy way for the parent to be notified and decide on any arbitrary
action to take (in the callback procedure).

--Derick