The supervisor should kill all running tasks in the event of a critical job failure

Bug #1180271 reported by Lars Butler
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake Engine
Fix Released
High
Lars Butler

Bug Description

At present, if oq-engine encounters an error in a task, the job will be marked as 'failed' and the calculation will be aborted. Any unexecuted tasks in queue will be aborted immediately once they are taken out of the queue. However, any tasks which are already executing at this point will continue to execute needlessly.

The goal of this change is to revoke all in-queue or running tasks immediately once a failure is detected.

We should be able to use Celery's built-in `revoke` functionality to accomplish this: http://docs.celeryproject.org/en/latest/userguide/workers.html#revoking-tasks

To get the currently executing tasks, we can use the celery `inspect` API: http://docs.celeryproject.org/en/latest/reference/celery.app.control.html?highlight=inspect#celery.app.control.Inspect

From this, we can get the `active` tasks, then filter tasks by the job_id we're concerned with (in case multiple jobs are running concurrently).

NOTE: Using `revoke` to kill tasks in this way will produce error messages like the following.

[2013-05-21 14:18:05,105: ERROR/MainProcess] Task openquake.engine.calculators.hazard.classical.core.hazard_curves[a8b57bed-e724-405e-ae6c-ae0c99608aeb] raised exception: WorkerLostError('Worker exited prematurely.',)
WorkerLostError: Worker exited prematurely.

This doesn't seem to cause any issues.

Changed in oq-engine:
milestone: none → 1.0.0
assignee: nobody → Lars Butler (lars-butler)
importance: Undecided → High
status: New → Confirmed
Changed in oq-engine:
status: Confirmed → In Progress
description: updated
description: updated
Revision history for this message
Lars Butler (lars-butler) wrote :
Revision history for this message
Michele Simionato (michele-simionato) wrote :
Changed in oq-engine:
status: In Progress → Fix Committed
Changed in oq-engine:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.