Snap Store Server

Bug #1867642
Comment #5

Comment 5 for bug 1867642

Revision history for this message

Daniel Manrique (roadmr) wrote on 2020-04-01: Re: [Bug 1867642] Re: The Lost Task

On Tue, Mar 31, 2020 at 4:50 PM Matias Bordese <email address hidden>
wrote:

> Having the STARTED status could help debugging the issue, and it sounds
> like a simple and non-breaking update, so +1 to that.
>
> On the other hand, it would be nice to hunt for a particular case to
> investigate and check logs to gather as much information as possible to
> understand the real issue. I guess the worth in investing time in a
> work-around without knowing the problem behind depends on how often do
> we have these and how "undebuggable" this is.
>

This happens frequently when the network goes bonkers or compute nodes are
lost; because we don't have a lot of visibility we typically only learn
about this when people report it.

One thing we could do *now* is look at uploads that are in pending states
and check whether there's a corresponding task for them. If not, we can
drill down in logs to try to find further causes, but as I said, the
typical real issue is "the worker went away", which by definition we can't
really prevent :/ so just a way to deal with those failures gracefully
should help.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1867642
>
> Title:
> The Lost Task
>
> Status in Snap Store:
> New
>
> Bug description:
> When celery workers go away or otherwise drop tasks on the floor, the
> rest of the system/flow expects those tasks to be completed
> eventually, which never happens, causing things to get stuck.
>
> This is about behaviors not covered by our existing timeout/retry
> logic, which is nowadays pretty good at retrying slow or failed tasks,
> but doesn't cover tasks that just disappear.
>
> For example, if a snap's review task is enqueued but then $something
> happens and the task is never processed, "task xxxxxxx waiting for
> execution" is the current status, but said task is neither complete,
> in the queue or being processed at that time. Then subsequent
> uploads/releases of the snap will fail because the in-progress task
> holds the per-snap queue.
>
> The proposal can be periodically scanning for tasks which should be in
> progress (this is known and stored somewhere because we show, for
> example in a snap's page, the task id). Ideally if the task is not in
> progress or in the queue waiting to be picked up, it could be just
> rescheduled/requeued. This could be racy but provides the fastest
> recovery, as ideally the immediate next scan after a busted worker
> event would fix/rerun all the wedged tasks.
>
> Alternatively, instead of looking at in-progress or in-queue tasks,
> just scan the database for tasks that are marked as "in progress" for
> more than a conservative time (4-6 hours) and just blindly reschedule
> them. This could be less racy and still means that snaps will get
> auto-unwedged without manual intervention, but the waiting time might
> be a bit long.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/snapstore/+bug/1867642/+subscriptions
>

--
- Daniel

On Tue, Mar 31, 2020 at 4:50 PM Matias Bordese <1867642@bugs.launchpad.net>
wrote:

This happens frequently when the network goes bonkers or compute nodes are
lost; because we don't have a lot of visibility we typically only learn
about this when people report it.

>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1867642
>
> Title:
>   The Lost Task
>
> Status in Snap Store:
>   New
>
> Bug description:
>   When celery workers go away or otherwise drop tasks on the floor, the
>   rest of the system/flow expects those tasks to be completed
>   eventually, which never happens, causing things to get stuck.
>
>   This is about behaviors not covered by our existing timeout/retry
>   logic, which is nowadays pretty good at retrying slow or failed tasks,
>   but doesn't cover tasks that just disappear.
>
>   For example, if a snap's review task is enqueued but then $something
>   happens and the task is never processed, "task xxxxxxx waiting for
>   execution" is the current status, but said task is neither complete,
>   in the queue or being processed at that time. Then subsequent
>   uploads/releases of the snap will fail because the in-progress task
>   holds the per-snap queue.
>
>   The proposal can be periodically scanning for tasks which should be in
>   progress (this is known and stored somewhere because we show, for
>   example in a snap's page, the task id). Ideally if the task is not in
>   progress or in the queue waiting to be picked up, it could be just
>   rescheduled/requeued. This could be racy but provides the fastest
>   recovery, as ideally the immediate next scan after a busted worker
>   event would fix/rerun all the wedged tasks.
>
>   Alternatively, instead of looking at in-progress or in-queue tasks,
>   just scan the database for tasks that are marked as "in progress" for
>   more than a conservative time (4-6 hours) and just blindly reschedule
>   them. This could be less racy and still means that snaps will get
>   auto-unwedged without manual intervention, but the waiting time might
>   be a bit long.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/snapstore/+bug/1867642/+subscriptions
>

-- 
- Daniel