init: restart command fails to restart main process when pre-stop stanza exists

Bug #703800 reported by Clint Byrum
142
This bug affects 28 people
Affects Status Importance Assigned to Milestone
upstart
Triaged
Medium
Unassigned
upstart (Ubuntu)
Triaged
Medium
James Hunt

Bug Description

While testing the portmap daemon's recent changes in Ubuntu which cause statd to properly follow it on stop/start, I tried to restart portmap, only to find that it was not in fact restarted.

This happens with any job that has a pre-stop. The reason is that in job_restart(), the code does this:

    job_change_goal (job, JOB_STOP);
    job_change_goal (job, JOB_START);

job_change_goal will return as soon as the *first* state change has been completed. If there is no pre-stop, that is the change from JOB_RUNNING to JOB_KILLED, which does dutifully kill the main process.

However, if there is a pre-stop, the pre-stop is run, but then job_change_state returns to job_change_goal, which then returns to job_restart, which then changes the goal back to start, which makes the new job_next_state() one that will bypass the change to JOB_KILLED.

This was found on upstart v0.6.7-3 in Ubuntu, but also exists in the code in the current trunk.

TEST CASE

1. create job file /etc/init/test-restart-prestop.conf with this content:
# test-restart-prestop

pre-stop exec /bin/true

exec /bin/sleep 3600
2. run "start test-restart-prestop" -- record pid of job
3. immediately run "restart test-restart-prestop" -- if bug is present, pid remains the same when it should be a new pid.

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

On further investigation, this does *not* affect jobs that block the 'stopping' event. So I believe this may be related more to the way job_process_run treates pre-stop. According to comments in init/job_process.c:

        /* We don't change the state if we're in post-start and there's
         * a post-start process running, or if we're in pre-stop and
         * there's a pre-stop process running; we wait for those to
         * finish instead.
         */
        if ((job->state == JOB_POST_START)
            && job->class->process[PROCESS_POST_START]
            && (job->pid[PROCESS_POST_START] > 0)) {
            state = FALSE;
        } else if ((job->state == JOB_PRE_STOP)
            && job->class->process[PROCESS_PRE_STOP]
            && (job->pid[PROCESS_PRE_STOP] > 0)) {
            state = FALSE;
        }

Later on, if (!state) calls job_change_goal(job, JOB_RESPAWN) ..

Digging into job_next_state a bit, I found that a goal of JOB_RESPAWN causes job_next_state to change the goal only if state is JOB_POST_START or JOB_PRE_STOP. JOB_POST_START changes the goal to JOB_START, which makes sense. But JOB_PRE_STOP changes the goal to JOB_START as well! This doesn't make much sense to me, and I thought the obvious fix would be to change it to be JOB_STOP, however this caused test_job to fail here:

...with post-start job and a goal of respawn
BAD: wrong value for job->goal, expected 1 got 0
 at tests/test_job.c:4129 (test_next_state).
/bin/bash: line 5: 29784 Aborted ${dir}$tst
FAIL: test_job

This was confusing until I realized that the test code is wrong:

    /* Check that the next state if we're respawning after a pre-stop
     * job is stopping with the goal changed.
     */
    TEST_FEATURE ("with post-start job and a goal of respawn");
    job->goal = JOB_RESPAWN;
    job->state = JOB_PRE_STOP;

    TEST_EQ (job_next_state (job), JOB_STOPPING);
    TEST_EQ (job->goal, JOB_START);

That should read 'with pre-stop job and a goal of respawn'.

So, this brings to the edge case that I think is causing this problem, and my perceived fix.

If you want a running job to be restarted, it should move from wherever it was, through stopping->pre-stop->killed->post-stop->starting-> ...

However, this doesn't happen. Instead the goal is changed to STOP, but job_change_goal() does not wait for that to happen, which its docblock would suggest.

Instead, it returns with the state set to JOB_PRE_STOP ..

This is my proposed fix.. though I'm not entirely confident this is the right way to handle it, and I haven't had a chance to test this yet.

=== modified file 'init/job.c'
--- init/job.c 2010-12-14 15:32:41 +0000
+++ init/job.c 2011-01-17 07:56:24 +0000
@@ -1265,6 +1265,12 @@
   nih_list_add (&job->blocking, &blocked->entry);

  job_change_goal (job, JOB_STOP);
+ if (job->state == JOB_PRE_STOP)
+ {
+ /* hack: should not happen but does */
+ job->goal = JOB_RESPAWN;
+ job_change_goal (job, JOB_STOP);
+ }
  job_change_goal (job, JOB_START);

  if (! wait)

scm (scm)
tags: added: glucid lucid
Revision history for this message
scm (scm) wrote :

I can confirm I see this behavior in upstart (0.6.5-8) on lucid. On several of our custom tasks, if I change pre-stop to post-stop, then restart behaves normally. I will see if I can craft an upstart job to mimic this behavior using stock tools.

Revision history for this message
Scott James Remnant (scott) wrote : Re: [Bug 703800] Re: restart command fails to restart main process when pre-stop stanza exists

Yeah there are quite a few bugs related to pre-stop scripts, they don't
really work like any other, so there are definite gaps in the coverage

On Thu, Feb 24, 2011 at 8:00 AM, scm <email address hidden> wrote:

> I can confirm I see this behavior in upstart (0.6.5-8) on lucid. On
> several of our custom tasks, if I change pre-stop to post-stop, then
> restart behaves normally. I will see if I can craft an upstart job to
> mimic this behavior using stock tools.
>
> --
> You received this bug notification because you are a member of Upstart
> Developers, which is subscribed to upstart .
> https://bugs.launchpad.net/bugs/703800
>
> Title:
> restart command fails to restart main process when pre-stop stanza
> exists
>
> Status in Upstart:
> New
> Status in “upstart” package in Ubuntu:
> New
>
> Bug description:
> While testing the portmap daemon's recent changes in Ubuntu which
> cause statd to properly follow it on stop/start, I tried to restart
> portmap, only to find that it was not in fact restarted.
>
> This happens with any job that has a pre-stop. The reason is that in
> job_restart(), the code does this:
>
>
> job_change_goal (job, JOB_STOP);
> job_change_goal (job, JOB_START);
>
> job_change_goal will return as soon as the *first* state change has
> been completed. If there is no pre-stop, that is the change from
> JOB_RUNNING to JOB_KILLED, which does dutifully kill the main process.
>
> However, if there is a pre-stop, the pre-stop is run, but then
> job_change_state returns to job_change_goal, which then returns to
> job_restart, which then changes the goal back to start, which makes
> the new job_next_state() one that will bypass the change to
> JOB_KILLED.
>
> This was found on upstart v0.6.7-3 in Ubuntu, but also exists in the
> code in the current trunk.
>
> TEST CASE
>
> 1. create job file /etc/init/test-restart-prestop.conf with this content:
> # test-restart-prestop
>
> pre-stop exec /bin/true
>
> exec /bin/sleep 3600
> 2. run "start test-restart-prestop" -- record pid of job
> 3. immediately run "restart test-restart-prestop" -- if bug is present,
> pid remains the same when it should be a new pid.
>

Revision history for this message
Clint Byrum (clint-fewbar) wrote : Re: restart command fails to restart main process when pre-stop stanza exists

Having found another instance of this problem, the powernap package in Ubuntu, I am marking this as Confirmed in Upstart, and Triaged in Ubuntu.

Changed in upstart:
status: New → Confirmed
Changed in upstart (Ubuntu):
status: New → Confirmed
importance: Undecided → Medium
status: Confirmed → Triaged
summary: - restart command fails to restart main process when pre-stop stanza
+ init: restart command fails to restart main process when pre-stop stanza
exists
Changed in upstart:
status: Confirmed → Triaged
importance: Undecided → Medium
Revision history for this message
Clint Byrum (clint-fewbar) wrote :

FYI, xinetd in Ubuntu is also affected by this.

tags: added: testcase
maxaon (maxaon)
Changed in upstart (Ubuntu):
status: Triaged → Incomplete
status: Incomplete → Confirmed
Revision history for this message
David Weber (wb-munzinger) wrote :

Any timeline to fix this? It's still in Precise...

Changed in upstart (Ubuntu):
status: Confirmed → Triaged
Steve Langasek (vorlon)
Changed in upstart (Ubuntu):
assignee: nobody → James Hunt (jamesodhunt)
Revision history for this message
Patrick Hemmer (78luphr0rnk2nuqimstywepozxn9kl19tqh0tx66b5dki1xxsh5mkz9gl-launchpad-a811i2i3ytqlsztthjth0svbccw8inm65tmkqp9sarr553jq53in4xm1m) wrote :

It is now 2014. Any chance on getting this fixed?

Revision history for this message
Mike Power (mpower) wrote :

Is there a workaround to get pre-stop and restart to work together? I want to use pre-stop to leave a pid file behind should the process die. I can't do that with a post-stop because there is no way to tell if the post-stop is being called because a stop was issued or if the respawn limit was reached.

Revision history for this message
Ákos (axos88) wrote :

Guys, it has been 5(!!!) years, since this bug has been reported. Any chance for a fix?

Please raise severity to major, to reflect that this is a MAJOR issue with a CORE component in ubuntu, that affects a lot of services that would use pre-stop

Workaround-ish: service xxx restart works, because it invokes stop and then start.

Revision history for this message
Patrick Hemmer (78luphr0rnk2nuqimstywepozxn9kl19tqh0tx66b5dki1xxsh5mkz9gl-launchpad-a811i2i3ytqlsztthjth0svbccw8inm65tmkqp9sarr553jq53in4xm1m) wrote :

Given the age of this bug and the fact that the switch is being made to systemd, I doubt this will ever be fixed.

Revision history for this message
Paul Draper (paulddraper) wrote :

How is systemd relevant?

Precise is still supported. Trusty will be supported until 2019.

Neither of those have systemd.

dino99 (9d9)
tags: added: trusty
removed: glucid lucid
Revision history for this message
Patrick Hemmer (78luphr0rnk2nuqimstywepozxn9kl19tqh0tx66b5dki1xxsh5mkz9gl-launchpad-a811i2i3ytqlsztthjth0svbccw8inm65tmkqp9sarr553jq53in4xm1m) wrote :

> How is systemd relevant?

Because it replaced upstart, and support for discontinued software tends to get much less attention. If the issue was known for years and wasn't fixed while the software still had a future, it became extremely unlikely to be fixed once it didn't have a future.

> Precise is still supported. Trusty will be supported until 2019.

It's now 2021, and Trusty is in it's extended security maintenance phase, of which this issue certainly does not qualify for. So it seems the assessment was accurate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.