Comment 12 for bug 688541

Clint Byrum (clint-fewbar) wrote :

So I've done some more thinking about this, and I had a bit of an aha! moment.

While we *should* in fact stop using 'stop on runlevel [016]' or 'stop on runlevel [!2345]', I think we can solve this without touching all of those jobs.

/etc/init.d/sendsigs has this code:

        # Upstart jobs have their own "stop on" clauses that sends
        # SIGTERM/SIGKILL just like this, so if they're still running,
        # they're supposed to be
        for pid in $(initctl list | sed -n -e "/process [0-9]/s/.*process //p"); do
                OMITPIDS="${OMITPIDS:+$OMITPIDS }-o $pid"

It uses this to determine which pids not to kill because, presumably, upstart should be managing them.

However, this code is flawed. killall5 will kill the children of all of these if they are multi process daemons or scripts running things. This would only be solved by walking through /proc looking for these as parent pids (and then doing the same again with the new list.. ).

However, this technique can actually be used to determine if there are still jobs that are supposed to be stopped, but haven't finished stopping yet. Since they should be listed as stop/(pre-stop|post-stop|killed), we can determine exactly which pids we expect to go away. Since upstart has its own idea of how long to wait before it kills these, we should actually wait indefinitely.

I'm attaching a debdiff that solves the race as far as I can tell, though I think it needs a good long look, since it could mean shutdowns hang for a long time waiting (I'm especially curious if the pre-stop/post-stop's are subject to kill timeout)