Comment 6 for bug 1867642

Revision history for this message
Maximiliano Bertacchini (maxiberta) wrote :

Thanks all for your feedback. The plan:

- With the addition of the STARTED state, I think it is extremely unlikely to get PENDING tasks lost. Let's focus on lost STARTED tasks then.

- ClickPackageUpload.scan_task and .review_task are plain string ids that can be used to retrieve the task via:
  - celery.result.AsyncResult(task_id) from the results backend (DB as per django-celery-results), or
  - celery_app.control.inspect() from running workers.

- As queues/workers are partitined by sca release, workers only contain a partial view of running tasks. Thus, tasks have to be retrieved from the results backend.

- Cron job to run on the active leader node every 1h.
  - Get all STARTED scan/review tasks from the results backend that are older than the task timeout (20 minutes), but no older than eg. 24h to prevent infinite loops (need to update django-celery-results to latest version which added date_created field to TaskResult model [0]).
  - The task status can be updated to some custom state such as 'LOST' for easier tracing; this seems to be safe and won't affect celery internals as the task *is* effectively lost.
  - Fire a new task of the same type with the same args/kwargs as the retrieved task.
  - Update the scan_task/review_task field in the respective upload.
  - Profit.

Makes sense?

[0] https://github.com/celery/django-celery-results/pull/111