No-op snap takes 1.5 min to build, 6.5 minutes to publish

Bug #1689282 reported by Evan on 2017-05-08
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
High
Colin Watson

Bug Description

Publishing takes an oddly long amount of time (6.5 minutes) off the smallest snap. Shouldn't this be seconds at most?

See https://github.com/canonical-ols/build.snapcraft.io/issues/717#issuecomment-299431648 for an initial assessment of the issue.

Related branches

Colin Watson (cjwatson) wrote :

Just copying my explanation here so that it isn't dependent on an external site:

The background for this is that publishing involves multiple steps: we have to do the equivalent of snapcraft push and then the equivalent of snapcraft release, and between those we have to wait for the store to finish scanning the upload, which is an asynchronous job and takes an undetermined amount of time. Rather than polling and thus taking up a worker slot in Launchpad for that undetermined amount of time, we retry the job later with a one-minute delay up to a maximum of 20 times.

There are various things that we could look at to reduce the latency of this process (which aren't all mutually-exclusive, and we may not know the best strategy until we do some more analysis):

 * Unlike the initial job, retries don't seem to be handled by celery for some reason, but instead are picked up by the fallback cron job some time later. This is the source of most of the unnecessary delay, and is probably just a simple bug somewhere. Assuming that the store scans the upload reasonably promptly, we could cut the typical delay for small snaps down to a little over a minute by getting celery to pick up the retries.
 * We could consider having the job poll for a short time after it pushes the snap, which would cut out almost all the extra latency in the case that the store manages to scan it immediately. This may be a good idea, but probably only if the store typically does in fact manage quick scans; otherwise we'd be tying up workers for longer and degrading overall system performance.
 * We could try some kind of exponential backoff approach, so that the first retries happen more quickly.
 * We could look at having the store tell us when it's done by way of a webhook. This seems like the most elegant approach, but it's also a lot of work in that it requires implementing webhook sending in the store and webhook receiving in Launchpad.

tags: added: lp-snappy performance
Changed in launchpad:
status: New → Triaged
importance: Undecided → High
Colin Watson (cjwatson) on 2017-06-29
Changed in launchpad:
status: Triaged → In Progress
assignee: nobody → Colin Watson (cjwatson)
Launchpad QA Bot (lpqabot) wrote :
tags: added: qa-needstesting
Changed in launchpad:
status: In Progress → Fix Committed
Colin Watson (cjwatson) wrote :

For the record, I went for options 1 and 3 from my previous comment (i.e. fix the bug that caused retries not to be handled by celery, and perform the first few retries more quickly when we're just polling the status endpoint). That gets the "Store upload in progress" stage down to 15 seconds plus change for small snaps.

tags: added: qa-ok
removed: qa-needstesting
Colin Watson (cjwatson) on 2017-07-26
Changed in launchpad:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers