init: possible issue with blocked events?
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
upstart |
Undecided
|
Unassigned | |||
0.3 |
Medium
|
Unassigned |
Bug Description
I've found 3 and fixed (patched, more like) 2 problems. My question is - has this been found and fixed already in later versions?
Details:
1) The test below makes init dump core and restart, triggered by an assert in job.c::
service
respawn
script
exit 0
end script
post-start script
sleep 4
end script
As far as I could see JOB_RUNNING was legitimate, so I updated the assert condition to allow it, and respawning worked.
2) after the change above, stopping this job produced another core. This time PROCESS_POST_START got JOB_WAITING.
I patched this as well by allowing JOB_WAITING and setting state and stop to FALSE in this case.
I'm not sure the second fix is a good one.
I tried to compare against 0.5, but it's been re-written in C++, so there's no simple diff.
Alex Nekrasov (ennnot) wrote : | #2 |
this is my second attempt at fixing it in 0.3.8. The first fix was bad.
I added checks to job_next_state() to handle respawning and post-start/pre-stop processes as you suggested.
I also made jobs forget their "blocked" status when advanced to WAITING. Now totally sure about this one.
In main.c, I made it reboot on crash rather than restart init.
As a result, my tests passed, but I still got a new core from init reported just today. The problem is, these things are getting harder to reproduce.
Alex Nekrasov (ennnot) wrote : | #3 |
and the main.c
Alex Nekrasov (ennnot) wrote : | #4 |
here's a test that let me catch the main/pre-stop race, but it would usually need to be run:
#!/bin/sh
while true
do
if ls /cores/core*
then
echo cores!
cp /var/log/messages ${HOME}
exit 1
fi
initctl status test2
if ! initctl start test2
then
echo start failed
exit 1
fi
initctl status test2
if ! initctl stop test2
then
echo stop failed
exit 1
fi
done
Could you supply your proposed changes as a diff/patch rather than new copies of the source files?
A similar bug exists in 0.5, bug #381048
This is caused by the respawn handling being done before the check for not transitioning a state. It's very possible that this is just a duplicate, and the bug has existed that long.
I'm not clear as to why you've modified job_next_state() or why you've introduced checks for JOB_WAITING anywhere - could you explain those? To me it looks like the job_child_reaper() change is sufficient - which is the proposed fix for bug #381048
Changed in upstart: | |
milestone: | none → 0.3.10 |
Alex Nekrasov (ennnot) wrote : | #7 |
as far as I understand job_child_reaper() fix addresses the respawn problem only. There's a general problem of advancing job state beyond POST_START when post-start is still running, etc. As far as I could tell, the idea is to remain in PRE/POST_START or PRE/POST_STOP until those processes have finished, since we don't kill them. Is that so?
job_child_reaper() is one place where job state is advanced. The other is job_handle_
The clearing of "blocked" flag when the job goes to JOB_WAITING was mainly due to my lack of understanding of what DELETED means. I was getting my jobs advanced to JOB_DELETED; in my environment, we don't delete jobs, so the fix works for me, but I wouldn't propose it as a permanent fix. If other places are fixed, it shouldn't ever come to that situation.
I'll do the diff/patch thing next time. I'm not yet used to working with open source projects.
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 387216] Re: 0.3 assert if post-start process ends after main process | #8 |
On Mon, 2009-06-15 at 15:48 +0000, Alex Nekrasov wrote:
> as far as I understand job_child_reaper() fix addresses the respawn
> problem only. There's a general problem of advancing job state beyond
> POST_START when post-start is still running, etc. As far as I could
> tell, the idea is to remain in PRE/POST_START or PRE/POST_STOP until
> those processes have finished, since we don't kill them. Is that so?
>
That's absolutely correct, yes.
> job_child_reaper() is one place where job state is advanced. The other
> is job_handle_
>
job_handle_
event being handled.
job->blocked will only be non-NULL if the job is in one of the blocking
states, that is JOB_STARTING and JOB_STOPPING.
Since the only thing that will advance the state beyond these, when
job->blocked has been set, is job_event_
job->blocked back to NULL again, I'm not sure that you're correct.
Could you provide a test case?
Pure code review:
job.c:970-974 job->blocked should never be non-NULL in the JOB_WAITING
state, this seems unnecessary.
job.c:1044-
logic in this code. In particular, the assertion that we should stop if
there's no main process is wrong. It's better to not try and change the
state at all.
job.c:1751-1757 this can never be reached, the case asserts that the
job state is only JOB_POST_STOP, so checking for a different value is
meaningless
Scott
--
Scott James Remnant
<email address hidden>
summary: |
- 0.3 assert if post-start process ends after main process + 0.3 possible issue with blocked events? |
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: 0.3 assert if post-start process ends after main process | #9 |
The respawn/post-start issue is being tracked at bug #381048,
we'll continue the discussion about job->blocked here
Thanks
Changed in upstart: | |
milestone: | 0.3.10 → 0.3-later |
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: 0.3 possible issue with blocked events? | #10 |
Did you mean "without checking for a live pre-start", or did you mean "post-start/
The pre-start script is not run until the JOB_PRE_START state, which happens only after job_handle_
Changed in upstart: | |
milestone: | 0.3-later → none |
importance: | Undecided → Medium |
status: | New → Incomplete |
(status mess to move to 0.3)
Changed in upstart: | |
importance: | Medium → Undecided |
status: | Incomplete → Invalid |
summary: |
- 0.3 possible issue with blocked events? + init: possible issue with blocked events? |
Alex Nekrasov (ennnot) wrote : | #12 |
give me a couple days. I'll need to revert to the original 0.3.8 and run the test, then I can give you trace
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 387216] Re: init: possible issue with blocked events? | #13 |
On Thu, 2009-06-18 at 15:42 +0000, Alex Nekrasov wrote:
> give me a couple days. I'll need to revert to the original 0.3.8 and run
> the test, then I can give you trace
>
Thanks!
Scott
--
Scott James Remnant
<email address hidden>
Err, sorry, wrong bug :-)
Any luck re-running that test?
Alex Nekrasov (ennnot) wrote : | #16 |
sorry, been on vacation. But no matter, here we go.
The test2 job tries to create a race between the main process and pre-stop. As you see there's no respawn flag. The test.sh script starts and stops test2 until detected failure, for which it looks for a core file in /cores.
I ran test.sh for quite some time on an otherwise idle machine without failures. Then I added load to the machine in the hope of altering timing and within several minutes got init dump core.
Attached are the test, core file, init binary, and /var/log/messages filtered by "spa init:". spa is our hostname.
This must be the original 0.3.8 init binary, i.e., without my changes.
On Thu, 2009-07-09 at 11:35 +0000, Alex Nekrasov wrote:
> The test2 job tries to create a race between the main process and pre-
> stop. As you see there's no respawn flag. The test.sh script starts and
> stops test2 until detected failure, for which it looks for a core file
> in /cores.
>
Can you try the same with 0.6.0?
Scott
--
Scott James Remnant
<email address hidden>
Alex Nekrasov (ennnot) wrote : | #18 |
not easy. AFAIK 0.6 uses different/
mean the basic runlevels, stuff like /etc/even.d, etc. Is that so?
On Mon, Jul 13, 2009 at 6:25 PM, Scott James Remnant<email address hidden> wrote:
> On Thu, 2009-07-09 at 11:35 +0000, Alex Nekrasov wrote:
>
>> The test2 job tries to create a race between the main process and pre-
>> stop. As you see there's no respawn flag. The test.sh script starts and
>> stops test2 until detected failure, for which it looks for a core file
>> in /cores.
>>
> Can you try the same with 0.6.0?
>
> Scott
> --
> Scott James Remnant
> <email address hidden>
>
> --
> init: possible issue with blocked events?
> https:/
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Upstart: Invalid
> Status in upstart 0.3 series: Incomplete
>
> Bug description:
> I've found 3 and fixed (patched, more like) 2 problems. My question is - has this been found and fixed already in later versions?
>
> Details:
>
> 1) The test below makes init dump core and restart, triggered by an assert in job.c::
>
> service
> respawn
>
> script
> exit 0
> end script
>
> post-start script
> sleep 4
> end script
>
> As far as I could see JOB_RUNNING was legitimate, so I updated the assert condition to allow it, and respawning worked.
>
> 2) after the change above, stopping this job produced another core. This time PROCESS_POST_START got JOB_WAITING.
>
> I patched this as well by allowing JOB_WAITING and setting state and stop to FALSE in this case.
>
> I'm not sure the second fix is a good one.
>
> I tried to compare against 0.5, but it's been re-written in C++, so there's no simple diff.
>
>
>
On Mon, 2009-07-13 at 18:11 +0000, Alex Nekrasov wrote:
> not easy. AFAIK 0.6 uses different/
> mean the basic runlevels, stuff like /etc/even.d, etc. Is that so?
>
They're largely compatible, aside from a few minor differences. The
main one is that:
start on foo
start on bar
start on baz
should be changed to:
start on foo or bar or baz
Scott
--
Scott James Remnant
<email address hidden>
Tracked this one down, marked it as a duplicate of the more recent bug with the reason for the crash.
The bug is that the running process exiting transitions the state - it should stay in POST_START if the post-start script is still running, and then transition directly to stopping after.
0.5 should be checked to see if it's still got the same bug.
If possible, we should backport a patch to 0.3 since it's still in active use