Ubuntu

init: restart command fails to restart main process when pre-stop stanza exists

Reported by Clint Byrum on 2011-01-17
106
This bug affects 16 people
Affects Status Importance Assigned to Milestone
upstart
Medium
Unassigned
upstart (Ubuntu)
Medium
James Hunt

Bug Description

While testing the portmap daemon's recent changes in Ubuntu which cause statd to properly follow it on stop/start, I tried to restart portmap, only to find that it was not in fact restarted.

This happens with any job that has a pre-stop. The reason is that in job_restart(), the code does this:

    job_change_goal (job, JOB_STOP);
    job_change_goal (job, JOB_START);

job_change_goal will return as soon as the *first* state change has been completed. If there is no pre-stop, that is the change from JOB_RUNNING to JOB_KILLED, which does dutifully kill the main process.

However, if there is a pre-stop, the pre-stop is run, but then job_change_state returns to job_change_goal, which then returns to job_restart, which then changes the goal back to start, which makes the new job_next_state() one that will bypass the change to JOB_KILLED.

This was found on upstart v0.6.7-3 in Ubuntu, but also exists in the code in the current trunk.

TEST CASE

1. create job file /etc/init/test-restart-prestop.conf with this content:
# test-restart-prestop

pre-stop exec /bin/true

exec /bin/sleep 3600
2. run "start test-restart-prestop" -- record pid of job
3. immediately run "restart test-restart-prestop" -- if bug is present, pid remains the same when it should be a new pid.

Clint Byrum (clint-fewbar) wrote :

On further investigation, this does *not* affect jobs that block the 'stopping' event. So I believe this may be related more to the way job_process_run treates pre-stop. According to comments in init/job_process.c:

        /* We don't change the state if we're in post-start and there's
         * a post-start process running, or if we're in pre-stop and
         * there's a pre-stop process running; we wait for those to
         * finish instead.
         */
        if ((job->state == JOB_POST_START)
            && job->class->process[PROCESS_POST_START]
            && (job->pid[PROCESS_POST_START] > 0)) {
            state = FALSE;
        } else if ((job->state == JOB_PRE_STOP)
            && job->class->process[PROCESS_PRE_STOP]
            && (job->pid[PROCESS_PRE_STOP] > 0)) {
            state = FALSE;
        }

Later on, if (!state) calls job_change_goal(job, JOB_RESPAWN) ..

Digging into job_next_state a bit, I found that a goal of JOB_RESPAWN causes job_next_state to change the goal only if state is JOB_POST_START or JOB_PRE_STOP. JOB_POST_START changes the goal to JOB_START, which makes sense. But JOB_PRE_STOP changes the goal to JOB_START as well! This doesn't make much sense to me, and I thought the obvious fix would be to change it to be JOB_STOP, however this caused test_job to fail here:

...with post-start job and a goal of respawn
BAD: wrong value for job->goal, expected 1 got 0
 at tests/test_job.c:4129 (test_next_state).
/bin/bash: line 5: 29784 Aborted ${dir}$tst
FAIL: test_job

This was confusing until I realized that the test code is wrong:

    /* Check that the next state if we're respawning after a pre-stop
     * job is stopping with the goal changed.
     */
    TEST_FEATURE ("with post-start job and a goal of respawn");
    job->goal = JOB_RESPAWN;
    job->state = JOB_PRE_STOP;

    TEST_EQ (job_next_state (job), JOB_STOPPING);
    TEST_EQ (job->goal, JOB_START);

That should read 'with pre-stop job and a goal of respawn'.

So, this brings to the edge case that I think is causing this problem, and my perceived fix.

If you want a running job to be restarted, it should move from wherever it was, through stopping->pre-stop->killed->post-stop->starting-> ...

However, this doesn't happen. Instead the goal is changed to STOP, but job_change_goal() does not wait for that to happen, which its docblock would suggest.

Instead, it returns with the state set to JOB_PRE_STOP ..

This is my proposed fix.. though I'm not entirely confident this is the right way to handle it, and I haven't had a chance to test this yet.

=== modified file 'init/job.c'
--- init/job.c 2010-12-14 15:32:41 +0000
+++ init/job.c 2011-01-17 07:56:24 +0000
@@ -1265,6 +1265,12 @@
   nih_list_add (&job->blocking, &blocked->entry);

  job_change_goal (job, JOB_STOP);
+ if (job->state == JOB_PRE_STOP)
+ {
+ /* hack: should not happen but does */
+ job->goal = JOB_RESPAWN;
+ job_change_goal (job, JOB_STOP);
+ }
  job_change_goal (job, JOB_START);

  if (! wait)

scm (scm) on 2011-02-24
tags: added: glucid lucid
scm (scm) wrote :

I can confirm I see this behavior in upstart (0.6.5-8) on lucid. On several of our custom tasks, if I change pre-stop to post-stop, then restart behaves normally. I will see if I can craft an upstart job to mimic this behavior using stock tools.

Yeah there are quite a few bugs related to pre-stop scripts, they don't
really work like any other, so there are definite gaps in the coverage

On Thu, Feb 24, 2011 at 8:00 AM, scm <email address hidden> wrote:

> I can confirm I see this behavior in upstart (0.6.5-8) on lucid. On
> several of our custom tasks, if I change pre-stop to post-stop, then
> restart behaves normally. I will see if I can craft an upstart job to
> mimic this behavior using stock tools.
>
> --
> You received this bug notification because you are a member of Upstart
> Developers, which is subscribed to upstart .
> https://bugs.launchpad.net/bugs/703800
>
> Title:
> restart command fails to restart main process when pre-stop stanza
> exists
>
> Status in Upstart:
> New
> Status in “upstart” package in Ubuntu:
> New
>
> Bug description:
> While testing the portmap daemon's recent changes in Ubuntu which
> cause statd to properly follow it on stop/start, I tried to restart
> portmap, only to find that it was not in fact restarted.
>
> This happens with any job that has a pre-stop. The reason is that in
> job_restart(), the code does this:
>
>
> job_change_goal (job, JOB_STOP);
> job_change_goal (job, JOB_START);
>
> job_change_goal will return as soon as the *first* state change has
> been completed. If there is no pre-stop, that is the change from
> JOB_RUNNING to JOB_KILLED, which does dutifully kill the main process.
>
> However, if there is a pre-stop, the pre-stop is run, but then
> job_change_state returns to job_change_goal, which then returns to
> job_restart, which then changes the goal back to start, which makes
> the new job_next_state() one that will bypass the change to
> JOB_KILLED.
>
> This was found on upstart v0.6.7-3 in Ubuntu, but also exists in the
> code in the current trunk.
>
> TEST CASE
>
> 1. create job file /etc/init/test-restart-prestop.conf with this content:
> # test-restart-prestop
>
> pre-stop exec /bin/true
>
> exec /bin/sleep 3600
> 2. run "start test-restart-prestop" -- record pid of job
> 3. immediately run "restart test-restart-prestop" -- if bug is present,
> pid remains the same when it should be a new pid.
>

Having found another instance of this problem, the powernap package in Ubuntu, I am marking this as Confirmed in Upstart, and Triaged in Ubuntu.

Changed in upstart:
status: New → Confirmed
Changed in upstart (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
status: Confirmed → Triaged
summary: - restart command fails to restart main process when pre-stop stanza
+ init: restart command fails to restart main process when pre-stop stanza
exists
Changed in upstart:
status: Confirmed → Triaged
importance: Undecided → Medium
Clint Byrum (clint-fewbar) wrote :

FYI, xinetd in Ubuntu is also affected by this.

tags: added: testcase
maxaon (maxaon) on 2011-10-16
Changed in upstart (Ubuntu):
status: Triaged → Incomplete
status: Incomplete → Confirmed
David Weber (wb-munzinger) wrote :

Any timeline to fix this? It's still in Precise...

Changed in upstart (Ubuntu):
status: Confirmed → Triaged
Steve Langasek (vorlon) on 2012-06-01
Changed in upstart (Ubuntu):
assignee: nobody → James Hunt (jamesodhunt)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints