Deserialising a job with the attribute "kill_timer" and "kill_process"="PROCESS_MAIN" results in abort
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
upstart (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Upstart sometimes aborts on a stateful re-execution
triggered by "telinit u":
job.c:1977: Assertion failed in job_deserialise: job->kill_process
Caught abort, core dumped
init:job.c:1977: Assertion failed in job_deserialise: job->kill_process
[ 69.668199] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000600
The attached file (sessions.json) is a salvaged dump of the Upstart state
that triggers the assertion failure; the problem evidently occurs while
processing the following piece:
[...]
"name": "",
"path": "\/com\
"goal": "JOB_STOP",
"state": "JOB_KILLED",
[...]
"due": 245
},
[...]
The issue has been caught in the package ubuntu-1.12.1 (Ubuntu 14.04)
and is caused by the following code:
[init/job.c]
1954 json_kill_timer = json_object_
1955
1956 if (json_kill_timer) {
[...]
1973 nih_local NihTimer *kill_timer = job_deserialise
1974 if (! kill_timer)
1975 goto error;
1976
1977 nih_assert (job->kill_
1978 job_process_
1979 kill_timer-
1980 job_process_
1981 }
The assertion (job->kill_process) fails in the routine job_deserialise()
if the deserialised job has an associated kill timer and
the field kill_process == PROCESS_MAIN.
It seems the issue might still affect the trunk as well:
there're no similar checks in the routines job_process_kill()
and job_serialise(), so if the Upstart state is serialised
after the job_process_kill() but before the job kill timer fires
then the resulting state representation cannot be restored
since job->kill_timer is non-NULL and job->kill_process
isn't PROCESS_INVALID that is a result of job_process_
operation.
Probably the assertion in question should read
(job->kill_process != PROCESS_INVALID)
if job_process_
Unfortunately the issue is extremely difficult to reproduce
so additional diagnostics might be difficult to perform
and it might kill the race that triggers the issue.