The Lost Task

Bug #1867642 reported by Daniel Manrique on 2020-03-16
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Snap Store
Undecided
Unassigned

Bug Description

When celery workers go away or otherwise drop tasks on the floor, the rest of the system/flow expects those tasks to be completed eventually, which never happens, causing things to get stuck.

This is about behaviors not covered by our existing timeout/retry logic, which is nowadays pretty good at retrying slow or failed tasks, but doesn't cover tasks that just disappear.

For example, if a snap's review task is enqueued but then $something happens and the task is never processed, "task xxxxxxx waiting for execution" is the current status, but said task is neither complete, in the queue or being processed at that time. Then subsequent uploads/releases of the snap will fail because the in-progress task holds the per-snap queue.

The proposal can be periodically scanning for tasks which should be in progress (this is known and stored somewhere because we show, for example in a snap's page, the task id). Ideally if the task is not in progress or in the queue waiting to be picked up, it could be just rescheduled/requeued. This could be racy but provides the fastest recovery, as ideally the immediate next scan after a busted worker event would fix/rerun all the wedged tasks.

Alternatively, instead of looking at in-progress or in-queue tasks, just scan the database for tasks that are marked as "in progress" for more than a conservative time (4-6 hours) and just blindly reschedule them. This could be less racy and still means that snaps will get auto-unwedged without manual intervention, but the waiting time might be a bit long.

Do we know how these tasks/workers are lost, ie. what's that $something that leaves tasks in limbo? Any documented case or logs? I guess our ELK's 2w retention doesn't help...

By default, Celery tasks go from PENDING to SUCCESS/FAILURE; thus a pending task might be either in queue or being worked on.

I believe setting `task_track_started=True` [0] would help us diagnose the issue by discerning between PENDING stuck tasks (eg. overloaded workers, but will eventually be processed) and STARTED stuck tasks (expected to be currently being processed in one worker). Then, we can inspect workers in runtime [1] to get the list of task ids currently in progress and compare with those in STARTED status. Otherwise, we'd additionally need to query rabbitmq which would make things more complicated and racy, as mentioned above.

Also, note that both PackageScanTask and PackageReviewTask use `acks_late=True`. As per the Celery docs: "Even if task_acks_late is enabled, the worker will acknowledge tasks when the worker process executing them abruptly exits or is signaled (e.g., KILL/INT, etc)", which would explain why even with this flag, these tasks are not being retried (assuming worker was killed). Setting `task_reject_on_worker_lost=True` allows the message to be re-queued instead, so that the task will execute again by the same worker, or another worker. This should probably help but, otoh, enabling this can cause message loops [2].

Alternatively, a cron job could prevent infinite looping tasks by only re-queing tasks created within eg. the last 24h.

[0] https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-track-started
[1] http://docs.celeryproject.org/en/latest/userguide/workers.html#inspecting-workers
[2] https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-reject-on-worker-lost

Download full text (5.5 KiB)

On Mon, Mar 30, 2020 at 5:31 PM Maximiliano Bertacchini <
<email address hidden>> wrote:

> Do we know how these tasks/workers are lost, ie. what's that $something
> that leaves tasks in limbo? Any documented case or logs? I guess our
> ELK's 2w retention doesn't help...
>

Not with certainty, no. Sometimes these are only reported days or weeks
after the fact (people can be patient!). But I did manage to catch logs for
one task on time before they were rotated. All I saw was the log for the
task's start and then nothing else on ELK.

I can empiric-anecdotally say that they typically map to network or compute
node outages and/or rollouts, so my theory is that it happens when the
celery process/worker crashes or goes away entirely. I imagine from your
description below that the task is being "worked on" and then the worker
dies - in this case the task might be pending but we have no record of what
happened to the worker, and from dashboard's point of view it's still
"waiting for execution" (true but useless :).

>
> By default, Celery tasks go from PENDING to SUCCESS/FAILURE; thus a
> pending task might be either in queue or being worked on.
>
> I believe setting `task_track_started=True` [0] would help us diagnose
> the issue by discerning between PENDING stuck tasks (eg. overloaded
> workers, but will eventually be processed) and STARTED stuck tasks
> (expected to be currently being processed in one worker). Then, we can
> inspect workers in runtime [1] to get the list of task ids currently in
> progress and compare with those in STARTED status. Otherwise, we'd
> additionally need to query rabbitmq which would make things more
> complicated and racy, as mentioned above.
>
> Also, note that both PackageScanTask and PackageReviewTask use
> `acks_late=True`. As per the Celery docs: "Even if task_acks_late is
> enabled, the worker will acknowledge tasks when the worker process
> executing them abruptly exits or is signaled (e.g., KILL/INT, etc)",
> which would explain why even with this flag, these tasks are not being
> retried (assuming worker was killed). Setting
> `task_reject_on_worker_lost=True` allows the message to be re-queued
> instead, so that the task will execute again by the same worker, or
> another worker. This should probably help but, otoh, enabling this can
> cause message loops [2].
>
> Alternatively, a cron job could prevent infinite looping tasks by only
> re-queing tasks created within eg. the last 24h.
>

IIRC we already have relatively good retry behavior for failed tasks (task
failed within a living worker); could we simply mark STARTED tasks older
than a certain age (maybe the maximum amount of timeout * retries we
currently have plus a small cushion) as FAILED and letting our retry logic
do its thing? (I'm wildly speculating, might not work at all - I don't know
if celery retries actually look at that status). So instead of tasks timing
out internally, this is an external watchdog of sorts.

If not, a cron that forcibly requeues STARTED tasks that are more than 24h
old (or 12h? or 6h? tasks typically should process quickly) or so might
suffice. The concern here is that if something is causing tasks to pile up
and...

Read more...

Matias Bordese (matiasb) wrote :

Having the STARTED status could help debugging the issue, and it sounds like a simple and non-breaking update, so +1 to that.

On the other hand, it would be nice to hunt for a particular case to investigate and check logs to gather as much information as possible to understand the real issue. I guess the worth in investing time in a work-around without knowing the problem behind depends on how often do we have these and how "undebuggable" this is.

> IIRC we already have relatively good retry behavior for failed tasks (task
> failed within a living worker); could we simply mark STARTED tasks older
> than a certain age (maybe the maximum amount of timeout * retries we
> currently have plus a small cushion) as FAILED and letting our retry logic
> do its thing? (I'm wildly speculating, might not work at all - I don't know
> if celery retries actually look at that status). So instead of tasks timing
> out internally, this is an external watchdog of sorts.

Afaict, task status is read-only from our point of view and cannot be manipulated from outside the task itself (the task can call self.retry(), or raise Reject(requeue=True), for example). With the task effectively lost, what we can do is just fire a new task with the same arguments (ie. just the upload id in this case) and update the upload's review_task/scan_task field accordingly.

> If not, a cron that forcibly requeues STARTED tasks that are more than 24h
> old (or 12h? or 6h? tasks typically should process quickly) or so might
> suffice. The concern here is that if something is causing tasks to pile up
> and we're requeueing them we might make things worse.

I think we can target STARTED tasks that are older than the task timeout, which in staging is currently 20m, but no older than maybe 24h to prevent infinite loops.

Of course, my assumption here is that, whatever the real cause of lost tasks, these are all in STARTED; and PENDING tasks are not really affected.

Adding the STARTED status sounds like a good starting point for further debugging. Maybe Daniel can confirm how often we've seen this issue before?

Daniel Manrique (roadmr) wrote :

On Tue, Mar 31, 2020 at 4:50 PM Matias Bordese <email address hidden>
wrote:

> Having the STARTED status could help debugging the issue, and it sounds
> like a simple and non-breaking update, so +1 to that.
>
> On the other hand, it would be nice to hunt for a particular case to
> investigate and check logs to gather as much information as possible to
> understand the real issue. I guess the worth in investing time in a
> work-around without knowing the problem behind depends on how often do
> we have these and how "undebuggable" this is.
>

This happens frequently when the network goes bonkers or compute nodes are
lost; because we don't have a lot of visibility we typically only learn
about this when people report it.

One thing we could do *now* is look at uploads that are in pending states
and check whether there's a corresponding task for them. If not, we can
drill down in logs to try to find further causes, but as I said, the
typical real issue is "the worker went away", which by definition we can't
really prevent :/ so just a way to deal with those failures gracefully
should help.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1867642
>
> Title:
> The Lost Task
>
> Status in Snap Store:
> New
>
> Bug description:
> When celery workers go away or otherwise drop tasks on the floor, the
> rest of the system/flow expects those tasks to be completed
> eventually, which never happens, causing things to get stuck.
>
> This is about behaviors not covered by our existing timeout/retry
> logic, which is nowadays pretty good at retrying slow or failed tasks,
> but doesn't cover tasks that just disappear.
>
> For example, if a snap's review task is enqueued but then $something
> happens and the task is never processed, "task xxxxxxx waiting for
> execution" is the current status, but said task is neither complete,
> in the queue or being processed at that time. Then subsequent
> uploads/releases of the snap will fail because the in-progress task
> holds the per-snap queue.
>
> The proposal can be periodically scanning for tasks which should be in
> progress (this is known and stored somewhere because we show, for
> example in a snap's page, the task id). Ideally if the task is not in
> progress or in the queue waiting to be picked up, it could be just
> rescheduled/requeued. This could be racy but provides the fastest
> recovery, as ideally the immediate next scan after a busted worker
> event would fix/rerun all the wedged tasks.
>
> Alternatively, instead of looking at in-progress or in-queue tasks,
> just scan the database for tasks that are marked as "in progress" for
> more than a conservative time (4-6 hours) and just blindly reschedule
> them. This could be less racy and still means that snaps will get
> auto-unwedged without manual intervention, but the waiting time might
> be a bit long.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/snapstore/+bug/1867642/+subscriptions
>

--
- Daniel

Thanks all for your feedback. The plan:

- With the addition of the STARTED state, I think it is extremely unlikely to get PENDING tasks lost. Let's focus on lost STARTED tasks then.

- ClickPackageUpload.scan_task and .review_task are plain string ids that can be used to retrieve the task via:
  - celery.result.AsyncResult(task_id) from the results backend (DB as per django-celery-results), or
  - celery_app.control.inspect() from running workers.

- As queues/workers are partitined by sca release, workers only contain a partial view of running tasks. Thus, tasks have to be retrieved from the results backend.

- Cron job to run on the active leader node every 1h.
  - Get all STARTED scan/review tasks from the results backend that are older than the task timeout (20 minutes), but no older than eg. 24h to prevent infinite loops (need to update django-celery-results to latest version which added date_created field to TaskResult model [0]).
  - The task status can be updated to some custom state such as 'LOST' for easier tracing; this seems to be safe and won't affect celery internals as the task *is* effectively lost.
  - Fire a new task of the same type with the same args/kwargs as the retrieved task.
  - Update the scan_task/review_task field in the respective upload.
  - Profit.

Makes sense?

[0] https://github.com/celery/django-celery-results/pull/111

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers