Comment 4 for bug 1867642

Revision history for this message
Maximiliano Bertacchini (maxiberta) wrote :

> IIRC we already have relatively good retry behavior for failed tasks (task
> failed within a living worker); could we simply mark STARTED tasks older
> than a certain age (maybe the maximum amount of timeout * retries we
> currently have plus a small cushion) as FAILED and letting our retry logic
> do its thing? (I'm wildly speculating, might not work at all - I don't know
> if celery retries actually look at that status). So instead of tasks timing
> out internally, this is an external watchdog of sorts.

Afaict, task status is read-only from our point of view and cannot be manipulated from outside the task itself (the task can call self.retry(), or raise Reject(requeue=True), for example). With the task effectively lost, what we can do is just fire a new task with the same arguments (ie. just the upload id in this case) and update the upload's review_task/scan_task field accordingly.

> If not, a cron that forcibly requeues STARTED tasks that are more than 24h
> old (or 12h? or 6h? tasks typically should process quickly) or so might
> suffice. The concern here is that if something is causing tasks to pile up
> and we're requeueing them we might make things worse.

I think we can target STARTED tasks that are older than the task timeout, which in staging is currently 20m, but no older than maybe 24h to prevent infinite loops.

Of course, my assumption here is that, whatever the real cause of lost tasks, these are all in STARTED; and PENDING tasks are not really affected.

Adding the STARTED status sounds like a good starting point for further debugging. Maybe Daniel can confirm how often we've seen this issue before?