Comment 1 for bug 1867642

Revision history for this message
Maximiliano Bertacchini (maxiberta) wrote :

Do we know how these tasks/workers are lost, ie. what's that $something that leaves tasks in limbo? Any documented case or logs? I guess our ELK's 2w retention doesn't help...

By default, Celery tasks go from PENDING to SUCCESS/FAILURE; thus a pending task might be either in queue or being worked on.

I believe setting `task_track_started=True` [0] would help us diagnose the issue by discerning between PENDING stuck tasks (eg. overloaded workers, but will eventually be processed) and STARTED stuck tasks (expected to be currently being processed in one worker). Then, we can inspect workers in runtime [1] to get the list of task ids currently in progress and compare with those in STARTED status. Otherwise, we'd additionally need to query rabbitmq which would make things more complicated and racy, as mentioned above.

Also, note that both PackageScanTask and PackageReviewTask use `acks_late=True`. As per the Celery docs: "Even if task_acks_late is enabled, the worker will acknowledge tasks when the worker process executing them abruptly exits or is signaled (e.g., KILL/INT, etc)", which would explain why even with this flag, these tasks are not being retried (assuming worker was killed). Setting `task_reject_on_worker_lost=True` allows the message to be re-queued instead, so that the task will execute again by the same worker, or another worker. This should probably help but, otoh, enabling this can cause message loops [2].

Alternatively, a cron job could prevent infinite looping tasks by only re-queing tasks created within eg. the last 24h.

[0] https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-track-started
[1] http://docs.celeryproject.org/en/latest/userguide/workers.html#inspecting-workers
[2] https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-reject-on-worker-lost