Snap Store Server

Bug #1867642
Comment #2

Comment 2 for bug 1867642

Revision history for this message

Daniel Manrique (roadmr) wrote on 2020-03-30: Re: [Bug 1867642] Re: The Lost Task

On Mon, Mar 30, 2020 at 5:31 PM Maximiliano Bertacchini <
<email address hidden>> wrote:

> Do we know how these tasks/workers are lost, ie. what's that $something
> that leaves tasks in limbo? Any documented case or logs? I guess our
> ELK's 2w retention doesn't help...
>

Not with certainty, no. Sometimes these are only reported days or weeks
after the fact (people can be patient!). But I did manage to catch logs for
one task on time before they were rotated. All I saw was the log for the
task's start and then nothing else on ELK.

I can empiric-anecdotally say that they typically map to network or compute
node outages and/or rollouts, so my theory is that it happens when the
celery process/worker crashes or goes away entirely. I imagine from your
description below that the task is being "worked on" and then the worker
dies - in this case the task might be pending but we have no record of what
happened to the worker, and from dashboard's point of view it's still
"waiting for execution" (true but useless :).

>
> By default, Celery tasks go from PENDING to SUCCESS/FAILURE; thus a
> pending task might be either in queue or being worked on.
>
> I believe setting `task_track_started=True` [0] would help us diagnose
> the issue by discerning between PENDING stuck tasks (eg. overloaded
> workers, but will eventually be processed) and STARTED stuck tasks
> (expected to be currently being processed in one worker). Then, we can
> inspect workers in runtime [1] to get the list of task ids currently in
> progress and compare with those in STARTED status. Otherwise, we'd
> additionally need to query rabbitmq which would make things more
> complicated and racy, as mentioned above.
>
> Also, note that both PackageScanTask and PackageReviewTask use
> `acks_late=True`. As per the Celery docs: "Even if task_acks_late is
> enabled, the worker will acknowledge tasks when the worker process
> executing them abruptly exits or is signaled (e.g., KILL/INT, etc)",
> which would explain why even with this flag, these tasks are not being
> retried (assuming worker was killed). Setting
> `task_reject_on_worker_lost=True` allows the message to be re-queued
> instead, so that the task will execute again by the same worker, or
> another worker. This should probably help but, otoh, enabling this can
> cause message loops [2].
>
> Alternatively, a cron job could prevent infinite looping tasks by only
> re-queing tasks created within eg. the last 24h.
>

IIRC we already have relatively good retry behavior for failed tasks (task
failed within a living worker); could we simply mark STARTED tasks older
than a certain age (maybe the maximum amount of timeout * retries we
currently have plus a small cushion) as FAILED and letting our retry logic
do its thing? (I'm wildly speculating, might not work at all - I don't know
if celery retries actually look at that status). So instead of tasks timing
out internally, this is an external watchdog of sorts.

If not, a cron that forcibly requeues STARTED tasks that are more than 24h
old (or 12h? or 6h? tasks typically should process quickly) or so might
suffice. The concern here is that if something is causing tasks to pile up
and we're requeueing them we might make things worse.

Thanks for looking into this!

>
> [0]
> https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-track-started
> [1]
> http://docs.celeryproject.org/en/latest/userguide/workers.html#inspecting-workers
> [2]
> https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-reject-on-worker-lost
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1867642
>
> Title:
> The Lost Task
>
> Status in Snap Store:
> New
>
> Bug description:
> When celery workers go away or otherwise drop tasks on the floor, the
> rest of the system/flow expects those tasks to be completed
> eventually, which never happens, causing things to get stuck.
>
> This is about behaviors not covered by our existing timeout/retry
> logic, which is nowadays pretty good at retrying slow or failed tasks,
> but doesn't cover tasks that just disappear.
>
> For example, if a snap's review task is enqueued but then $something
> happens and the task is never processed, "task xxxxxxx waiting for
> execution" is the current status, but said task is neither complete,
> in the queue or being processed at that time. Then subsequent
> uploads/releases of the snap will fail because the in-progress task
> holds the per-snap queue.
>
> The proposal can be periodically scanning for tasks which should be in
> progress (this is known and stored somewhere because we show, for
> example in a snap's page, the task id). Ideally if the task is not in
> progress or in the queue waiting to be picked up, it could be just
> rescheduled/requeued. This could be racy but provides the fastest
> recovery, as ideally the immediate next scan after a busted worker
> event would fix/rerun all the wedged tasks.
>
> Alternatively, instead of looking at in-progress or in-queue tasks,
> just scan the database for tasks that are marked as "in progress" for
> more than a conservative time (4-6 hours) and just blindly reschedule
> them. This could be less racy and still means that snaps will get
> auto-unwedged without manual intervention, but the waiting time might
> be a bit long.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/snapstore/+bug/1867642/+subscriptions
>

--
- Daniel

On Mon, Mar 30, 2020 at 5:31 PM Maximiliano Bertacchini <
1867642@bugs.launchpad.net> wrote:

> Do we know how these tasks/workers are lost, ie. what's that $something
> that leaves tasks in limbo? Any documented case or logs? I guess our
> ELK's 2w retention doesn't help...
>

IIRC we already have relatively good retry behavior for failed tasks (task
failed within a living worker); could we simply mark STARTED tasks older
than a certain age (maybe  the maximum amount of timeout * retries we
currently have plus a small cushion) as FAILED and letting our retry logic
do its thing? (I'm wildly speculating, might not work at all - I don't know
if celery retries actually look at that status). So instead of tasks timing
out internally, this is an external watchdog of sorts.

Thanks for looking into this!

>
> [0]
> https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-track-started
> [1]
> http://docs.celeryproject.org/en/latest/userguide/workers.html#inspecting-workers
> [2]
> https://docs.celeryproject.org/en/stable/userguide/configuration.html#task-reject-on-worker-lost
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1867642
>
> Title:
>   The Lost Task
>
> Status in Snap Store:
>   New
>
> Bug description:
>   When celery workers go away or otherwise drop tasks on the floor, the
>   rest of the system/flow expects those tasks to be completed
>   eventually, which never happens, causing things to get stuck.
>
>   This is about behaviors not covered by our existing timeout/retry
>   logic, which is nowadays pretty good at retrying slow or failed tasks,
>   but doesn't cover tasks that just disappear.
>
>   For example, if a snap's review task is enqueued but then $something
>   happens and the task is never processed, "task xxxxxxx waiting for
>   execution" is the current status, but said task is neither complete,
>   in the queue or being processed at that time. Then subsequent
>   uploads/releases of the snap will fail because the in-progress task
>   holds the per-snap queue.
>
>   The proposal can be periodically scanning for tasks which should be in
>   progress (this is known and stored somewhere because we show, for
>   example in a snap's page, the task id). Ideally if the task is not in
>   progress or in the queue waiting to be picked up, it could be just
>   rescheduled/requeued. This could be racy but provides the fastest
>   recovery, as ideally the immediate next scan after a busted worker
>   event would fix/rerun all the wedged tasks.
>
>   Alternatively, instead of looking at in-progress or in-queue tasks,
>   just scan the database for tasks that are marked as "in progress" for
>   more than a conservative time (4-6 hours) and just blindly reschedule
>   them. This could be less racy and still means that snaps will get
>   auto-unwedged without manual intervention, but the waiting time might
>   be a bit long.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/snapstore/+bug/1867642/+subscriptions
>

-- 
- Daniel