Comment 5 for bug 1867642

Revision history for this message
Daniel Manrique (roadmr) wrote : Re: [Bug 1867642] Re: The Lost Task

On Tue, Mar 31, 2020 at 4:50 PM Matias Bordese <email address hidden>
wrote:

> Having the STARTED status could help debugging the issue, and it sounds
> like a simple and non-breaking update, so +1 to that.
>
> On the other hand, it would be nice to hunt for a particular case to
> investigate and check logs to gather as much information as possible to
> understand the real issue. I guess the worth in investing time in a
> work-around without knowing the problem behind depends on how often do
> we have these and how "undebuggable" this is.
>

This happens frequently when the network goes bonkers or compute nodes are
lost; because we don't have a lot of visibility we typically only learn
about this when people report it.

One thing we could do *now* is look at uploads that are in pending states
and check whether there's a corresponding task for them. If not, we can
drill down in logs to try to find further causes, but as I said, the
typical real issue is "the worker went away", which by definition we can't
really prevent :/ so just a way to deal with those failures gracefully
should help.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1867642
>
> Title:
> The Lost Task
>
> Status in Snap Store:
> New
>
> Bug description:
> When celery workers go away or otherwise drop tasks on the floor, the
> rest of the system/flow expects those tasks to be completed
> eventually, which never happens, causing things to get stuck.
>
> This is about behaviors not covered by our existing timeout/retry
> logic, which is nowadays pretty good at retrying slow or failed tasks,
> but doesn't cover tasks that just disappear.
>
> For example, if a snap's review task is enqueued but then $something
> happens and the task is never processed, "task xxxxxxx waiting for
> execution" is the current status, but said task is neither complete,
> in the queue or being processed at that time. Then subsequent
> uploads/releases of the snap will fail because the in-progress task
> holds the per-snap queue.
>
> The proposal can be periodically scanning for tasks which should be in
> progress (this is known and stored somewhere because we show, for
> example in a snap's page, the task id). Ideally if the task is not in
> progress or in the queue waiting to be picked up, it could be just
> rescheduled/requeued. This could be racy but provides the fastest
> recovery, as ideally the immediate next scan after a busted worker
> event would fix/rerun all the wedged tasks.
>
> Alternatively, instead of looking at in-progress or in-queue tasks,
> just scan the database for tasks that are marked as "in progress" for
> more than a conservative time (4-6 hours) and just blindly reschedule
> them. This could be less racy and still means that snaps will get
> auto-unwedged without manual intervention, but the waiting time might
> be a bit long.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/snapstore/+bug/1867642/+subscriptions
>

--
- Daniel