Comment 9 for bug 1644758

Revision history for this message
James Tait (jamestait) wrote :

While we still haven't got to the bottom of this, we have uncovered some details that seem to be converging on a root cause.

There are a number of requests around the time in question. Several PUT requests succeeded. Several POST requests failed with HTTP 504 (Gateway Timeout) errors. The PUT requests contain a single Package resource, while the POST requests contain a full list of all revisions of the package - 588 revisions at the time of writing, making the POST payload ~1.8MB in size. Timestamps in the logs show that at one point there were three concurrent POST requests attempting to update the same package.

When the update tasks fail, the celery worker will schedule them to be retried. Thus it's entirely plausible that we end up with an interleaving of updates with potentially conflicting content. We still need to dig into the publishing task to verify the exact circumstances under which PUT and POST are used and the specifics of how the payload for those requests is built - lazily at task execution time or actively at task request time - and then we'll have a better idea of how to proceed.

I'm adding software-center-agent to the affected projects, as we may end up modifying the way packages are indexed. Another potential course of action is to tweak the way the index operation is performed in click-package-index to not refresh after each document insertion; this should improve the performance of the batch update, but we also need to be careful to consider the impact on result currency, although now we store a history of all prior revisions this may be less of an issue than when we only stored a single revision per package.