Deserialising a job with the attribute "kill_timer" and "kill_process"="PROCESS_MAIN" results in abort

Bug #1514609 reported by regmka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
upstart (Ubuntu)
New
Undecided
Unassigned

Bug Description

Upstart sometimes aborts on a stateful re-execution
triggered by "telinit u":

job.c:1977: Assertion failed in job_deserialise: job->kill_process
Caught abort, core dumped
init:job.c:1977: Assertion failed in job_deserialise: job->kill_process
[ 69.668199] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000600

The attached file (sessions.json) is a salvaged dump of the Upstart state
that triggers the assertion failure; the problem evidently occurs while
processing the following piece:

[...]
          "name": "",
          "path": "\/com\/ubuntu\/Upstart\/jobs\/ureadahead\/_",
          "goal": "JOB_STOP",
          "state": "JOB_KILLED",
[...]
          "kill_timer": {
            "timeout": 180,
            "due": 245
          },
          "kill_process": "PROCESS_MAIN",
[...]

The issue has been caught in the package ubuntu-1.12.1 (Ubuntu 14.04)
and is caused by the following code:

[init/job.c]

1954 json_kill_timer = json_object_object_get (json, "kill_timer");
1955
1956 if (json_kill_timer) {
[...]
1973 nih_local NihTimer *kill_timer = job_deserialise_kill_timer (json_kill_timer);
1974 if (! kill_timer)
1975 goto error;
1976
1977 nih_assert (job->kill_process);
1978 job_process_set_kill_timer (job, job->kill_process,
1979 kill_timer->timeout);
1980 job_process_adj_kill_timer (job, kill_timer->due);
1981 }

The assertion (job->kill_process) fails in the routine job_deserialise()
if the deserialised job has an associated kill timer and
the field kill_process == PROCESS_MAIN.

It seems the issue might still affect the trunk as well:
there're no similar checks in the routines job_process_kill()
and job_serialise(), so if the Upstart state is serialised
after the job_process_kill() but before the job kill timer fires
then the resulting state representation cannot be restored
since job->kill_timer is non-NULL and job->kill_process
isn't PROCESS_INVALID that is a result of job_process_set_kill_timer()
operation.

Probably the assertion in question should read

 (job->kill_process != PROCESS_INVALID)

if job_process_set_kill_timer() is assumed to operate correctly.

Unfortunately the issue is extremely difficult to reproduce
so additional diagnostics might be difficult to perform
and it might kill the race that triggers the issue.

Revision history for this message
regmka (regmka) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.