The Lost Task
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Snap Store Server |
Fix Released
|
Undecided
|
Maximiliano Bertacchini |
Bug Description
When celery workers go away or otherwise drop tasks on the floor, the rest of the system/flow expects those tasks to be completed eventually, which never happens, causing things to get stuck.
This is about behaviors not covered by our existing timeout/retry logic, which is nowadays pretty good at retrying slow or failed tasks, but doesn't cover tasks that just disappear.
For example, if a snap's review task is enqueued but then $something happens and the task is never processed, "task xxxxxxx waiting for execution" is the current status, but said task is neither complete, in the queue or being processed at that time. Then subsequent uploads/releases of the snap will fail because the in-progress task holds the per-snap queue.
The proposal can be periodically scanning for tasks which should be in progress (this is known and stored somewhere because we show, for example in a snap's page, the task id). Ideally if the task is not in progress or in the queue waiting to be picked up, it could be just rescheduled/
Alternatively, instead of looking at in-progress or in-queue tasks, just scan the database for tasks that are marked as "in progress" for more than a conservative time (4-6 hours) and just blindly reschedule them. This could be less racy and still means that snaps will get auto-unwedged without manual intervention, but the waiting time might be a bit long.
Changed in snapstore: | |
assignee: | nobody → Maximiliano Bertacchini (maxiberta) |
status: | New → In Progress |
Changed in snapstore: | |
status: | In Progress → Fix Committed |
Changed in snapstore: | |
status: | Fix Committed → Fix Released |
Do we know how these tasks/workers are lost, ie. what's that $something that leaves tasks in limbo? Any documented case or logs? I guess our ELK's 2w retention doesn't help...
By default, Celery tasks go from PENDING to SUCCESS/FAILURE; thus a pending task might be either in queue or being worked on.
I believe setting `task_track_ started= True` [0] would help us diagnose the issue by discerning between PENDING stuck tasks (eg. overloaded workers, but will eventually be processed) and STARTED stuck tasks (expected to be currently being processed in one worker). Then, we can inspect workers in runtime [1] to get the list of task ids currently in progress and compare with those in STARTED status. Otherwise, we'd additionally need to query rabbitmq which would make things more complicated and racy, as mentioned above.
Also, note that both PackageScanTask and PackageReviewTask use `acks_late=True`. As per the Celery docs: "Even if task_acks_late is enabled, the worker will acknowledge tasks when the worker process executing them abruptly exits or is signaled (e.g., KILL/INT, etc)", which would explain why even with this flag, these tasks are not being retried (assuming worker was killed). Setting `task_reject_ on_worker_ lost=True` allows the message to be re-queued instead, so that the task will execute again by the same worker, or another worker. This should probably help but, otoh, enabling this can cause message loops [2].
Alternatively, a cron job could prevent infinite looping tasks by only re-queing tasks created within eg. the last 24h.
[0] https:/ /docs.celerypro ject.org/ en/stable/ userguide/ configuration. html#task- track-started docs.celeryproj ect.org/ en/latest/ userguide/ workers. html#inspecting -workers /docs.celerypro ject.org/ en/stable/ userguide/ configuration. html#task- reject- on-worker- lost
[1] http://
[2] https:/