The supervisor should kill all running tasks in the event of a critical job failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenQuake Engine |
Fix Released
|
High
|
Lars Butler |
Bug Description
At present, if oq-engine encounters an error in a task, the job will be marked as 'failed' and the calculation will be aborted. Any unexecuted tasks in queue will be aborted immediately once they are taken out of the queue. However, any tasks which are already executing at this point will continue to execute needlessly.
The goal of this change is to revoke all in-queue or running tasks immediately once a failure is detected.
We should be able to use Celery's built-in `revoke` functionality to accomplish this: http://
To get the currently executing tasks, we can use the celery `inspect` API: http://
From this, we can get the `active` tasks, then filter tasks by the job_id we're concerned with (in case multiple jobs are running concurrently).
NOTE: Using `revoke` to kill tasks in this way will produce error messages like the following.
[2013-05-21 14:18:05,105: ERROR/MainProcess] Task openquake.
WorkerLostError: Worker exited prematurely.
This doesn't seem to cause any issues.
Changed in oq-engine: | |
milestone: | none → 1.0.0 |
assignee: | nobody → Lars Butler (lars-butler) |
importance: | Undecided → High |
status: | New → Confirmed |
Changed in oq-engine: | |
status: | Confirmed → In Progress |
description: | updated |
description: | updated |
Changed in oq-engine: | |
status: | In Progress → Fix Committed |
Changed in oq-engine: | |
status: | Fix Committed → Fix Released |
Pull request: https:/ /github. com/gem/ oq-engine/ pull/1210