Comment 15 for bug 688541

Revision history for this message
Michael Biebl (mbiebl) wrote : Re: [Bug 688541] Re: race condition on shutdown (leads to corrupted fs)

2010/12/16 Clint Byrum <email address hidden>:
>
> /etc/init.d/sendsigs has this code:
>
>
>        # Upstart jobs have their own "stop on" clauses that sends
>        # SIGTERM/SIGKILL just like this, so if they're still running,
>        # they're supposed to be
>        for pid in $(initctl list | sed -n -e "/process [0-9]/s/.*process //p"); do
>                OMITPIDS="${OMITPIDS:+$OMITPIDS }-o $pid"
>        done
>
>
> It uses this to determine which pids not to kill because, presumably, upstart should be managing them.
>
> However, this code is flawed. killall5 will kill the children of all of
> these if they are multi process daemons or scripts running things.

This observation is correct. On the other hand, isn't this exactly
what the sendsigs script is for: clean up any remaining, stray
processes which have not been stopped by its corresponding sysv init
script or upstart job (or have been e.g. started by the user)?

But I guess you are right, we should first stop all upstart jobs, give
them time to finish stopping, and then let sendsigs clean up anything
remaining afterwards.

> However, this technique can actually be used to determine if there are
> still jobs that are supposed to be stopped, but haven't finished
> stopping yet. Since they should be listed as stop/(pre-stop|post-
> stop|killed), we can determine exactly which pids we expect to go away.
> Since upstart has its own idea of how long to wait before it kills
> these, we should actually wait indefinitely.
>
> I'm attaching a debdiff that solves the race as far as I can tell,
> though I think it needs a good long look, since it could mean shutdowns
> hang for a long time waiting (I'm especially curious if the pre-stop
> /post-stop's are subject to kill timeout)

This code is still racy, afaics. What about upstart jobs, which are
not stopped by "stop on runlevel [016]"? They could receive their stop
signal at a point when your loop has already been run.

If you don't want to change existing jobs, we probably have to pick up
Ante's suggestion, and do the following in sendsigs:

1) run a for loop to wait for *all* running upstart jobs to stop.
upstart jobs which need to keep running past sendsigs (e.g. plymouth)
need to signal that using a similar mechanism like the killall5
sendsigs.d omit interface. I'd at least give upstart jobs 60secs time
to stop, so big databases etc have enough time to cleanly shutdown
2.) run a for loop and send SIGTERM all remaining processes, but do
*not* add upstart pids to $OMITPIDS
3.) send a final SIGKILL if any processes are left.

Regarding 1.), it would be nice to have a native C implementation in
upstart, instead of running initctl, grep and sleep manually.

--
Why is it that all of the instruments seeking intelligent life in the
universe are pointed away from Earth?