different revisions of snap found in consecutive queries for a specific channel and arch

Bug #1644758 reported by Federico Gimenez
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Software Center Agent
Fix Released
Critical
Facundo Batista

Bug Description

- Rationale -
We have started using spread-cron [1] for triggering test executions in response to changes in external resources.

We are watching the core snap versions on the store on the different channels so that we know if all goes well after a snap promotion. This is the way we are querying the store for revision changes for the candidate channel:

curl -s -H \"X-Ubuntu-Architecture: amd64\" https://search.apps.ubuntu.com/api/v1/package/core/candidate | sed -e 's|.*\"revision\": \(.*\), \"status\".*|\1|'

- Expected behavior -

After a snap update on the candidate channel, the new version is always available until a new version is uploaded.

- Actual behavior -

We updated core on candidate yesterday to rev 548 and detected a little bit later on the same channel rev 394, later 548 again, a little bit later 394 and back to 548 about 45 min later. The complete sequence can be seen at [2]

As a side note, rev 394 is the available version at the stable channel, the test results when the flip-flop was detected match the behavior of the snap in that channel. Also, by the time this happened a promotion for another arch (i386) was being made. We are also experimenting a lot of timeouts while promoting snaps.

Thanks,

[1] https://github.com/snapcore/spread-cron
[2] https://travis-ci.org/snapcore/spread-cron/builds

Revision history for this message
James Tait (jamestait) wrote :

The logs aren't turning up anything of any particular interest. What might help is to see the full request and response from the cronjob, including HTTP headers. It *smells* like a caching issue, although it's interesting that the observed revision flip-flopped between the latest revision on the candidate channel and that on the stable channel.

I did notice a metadata request (snap refresh) that showed a delta download available on the candidate channel for the core snap (the same request also asked for updates to test-snapd-tools in the edge channel, but no delta was available) that corresponds to line 742 of https://travis-ci.org/snapcore/spread-cron/builds/178708453, but the client didn't request deltas so that shouldn't make any difference.

Changed in click-package-index:
status: New → Incomplete
Revision history for this message
Federico Gimenez (fgimenez) wrote :

Hi James, thanks for your response.

> What might help is to see the full request and response from the cronjob, including HTTP headers.

We weren't storing the headers, sorry, I'll try to modify the logic to keep debugging info.

> It *smells* like a caching issue, although it's interesting that the observed revision flip-flopped between the latest revision on the candidate channel and that on the stable channel.

I've set up a test job with full verbosity on and storing the sent and received headers. It uses a promotion-test snap, begins asking for it with a version uploaded to candidate and keeps asking after promoting from beta to candidate from the web ui. All seems to be working fine, but I'm getting in the response these headers:

Cache-Control: max-age=5
...
X-Cache: MISS from juju-prod-ols-cpi-machine-6
X-Cache-Lookup: MISS from juju-prod-ols-cpi-machine-6:3128

Please correct me if I'm wrong but it seems that cache is not working in this case? I've tried the same query with the core snap from candidate and get the same cache-related headers, this is the complete query:

$ curl -s -v -H "X-Ubuntu-Architecture: amd64" https://search.apps.ubuntu.com/api/v1/package/core/candidate
* Trying 162.213.33.200...
* Connected to search.apps.ubuntu.com (162.213.33.200) port 443 (#0)
* found 173 certificates in /etc/ssl/certs/ca-certificates.crt
* found 704 certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification OK
* server certificate status verification SKIPPED
* common name: search.apps.ubuntu.com (matched)
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: C=GB,L=London,O=Canonical Group Ltd,CN=search.apps.ubuntu.com
* start date: Mon, 30 May 2016 00:00:00 GMT
* expire date: Wed, 21 Jun 2017 12:00:00 GMT
* issuer: C=US,O=DigiCert Inc,CN=DigiCert SHA2 Secure Server CA
* compression: NULL
* ALPN, server did not agree to a protocol
> GET /api/v1/package/core/candidate HTTP/1.1
> Host: search.apps.ubuntu.com
> User-Agent: curl/7.50.1
> Accept: */*
> X-Ubuntu-Architecture: amd64
>
< HTTP/1.1 200 OK
< Date: Mon, 28 Nov 2016 10:03:00 GMT
< Server: gunicorn/19.3.0
< X-Suggested-Currency: EUR
< Access-Control-Expose-Headers: X-Suggested-Currency
< Expires: Mon, 28 Nov 2016 10:03:05 GMT
< Vary: Accept,Accept-Language,Authorization,X-Device-Authorization,X-GeoIP-Country-Code,X-Ubuntu-Architecture,X-Ubuntu-Confinement,X-Ubuntu-Device-Channel,X-Ubuntu-Frameworks,X-Ubuntu-Slots,X-Ubuntu-Release,X-Ubuntu-Series,X-Ubuntu-Store
< X-Bzr-Revision-Number: 532
< Cache-Control: max-age=5
< Access-Control-Allow-Origin: *
< Access-Control-Allow-Methods: GET
< Content-Type: application/hal+json
< Content-Length: 2260
< X-Cache: MISS from juju-prod-ols-cpi-machine-7
< X-Cache-Lookup: MISS from juju-prod-ols-cpi-machine-7:3128
< Via: 1.1 juju-prod-ols-cpi-machine-7 (squid/3.3.8)
< Strict-Transport-Security: max-age=2592000
<
{...}

If the cache is not working in this case, what could be the cause of the flip-flop behavior we experienced?

Thanks!

Changed in click-package-index:
status: Incomplete → Confirmed
Revision history for this message
Federico Gimenez (fgimenez) wrote :

For completion, this is the list of commits that were made during the flip-flop behavior [1] (our tool creates a commit with each change on the revision of the core snap for a specific architecture and channel, more on this here[2]).

These are the logs of the test executions:

First change detected, time 21:50 (394 -> 548): http://paste.ubuntu.com/23553496/ successful because 548 was available
Second change detected, time 22:15 (548 -> 394): http://paste.ubuntu.com/23553528/ failure because 394 was available
Third change detected time 22:25 (394 -> 548): http://paste.ubuntu.com/23553507/ successful because 548 was available
Fourth change detected time 23:20 (548 -> 394): http://paste.ubuntu.com/23553513/ failure because 394 was available
Fifth (last) change detected 00:05 (394 -> 548): http://paste.ubuntu.com/23553517/ failure because TLS handshake error.

[1] https://github.com/snapcore/spread-cron/commits/snapd-core-candidate
[2] https://github.com/snapcore/spread-cron

Revision history for this message
Bret Barker (noise) wrote :

Not likely the cause of the issue, but the snap/details endpoint should be used, the v1/package endpoint is for Clicks.

https://search.apps.ubuntu.com/docs/#get--api-v1-snaps-details

Revision history for this message
James Tait (jamestait) wrote :

To expand on that a little, the correct command is:

curl -s -v -H "X-Ubuntu-Architecture: amd64" -H 'X-Ubuntu-Series: 16' https://search.apps.ubuntu.com/api/v1/snaps/details/core?channel=candidate

This issues a different request on the back-end, filtering on the package_name field rather than package_id or alias.raw. As Bret says, this is unlikely to be the cause of the problem, but takes one variable out of the equation.

I'll raise a separate bug for the snaps search endpoint returning embedded click URLs.

Changed in click-package-index:
importance: Undecided → High
Revision history for this message
James Tait (jamestait) wrote :

Bug #1645831 has been raised for the incorrect URLs in snap search results.

Revision history for this message
Federico Gimenez (fgimenez) wrote :

Hey Bret and James, thanks for your comments.

> Not likely the cause of the issue, but the snap/details endpoint should be used, the v1/package endpoint is for Clicks.

Sure, we'll update the checker to use the right URL. Anyway, this is only used for triggering the tests, in the results we have evidence that, while the 394 revision was reported by that endpoint, that revision of the core snap was actually available and retrieved with "snap install".

Revision history for this message
Michael Vogt (mvo) wrote :

The details for the core builds are at https://code.launchpad.net/~snappy-dev/+snap/core. The last build was 9h - https://launchpadlibrarian.net/295482671/buildlog_snap_ubuntu_xenial_amd64_core_BUILDING.txt.gz (30 Nov 04:15:36 ntpdate[1713]: adjust time server 10.211.37.1 offset 0.006129 sec)

Revision history for this message
James Tait (jamestait) wrote :

While we still haven't got to the bottom of this, we have uncovered some details that seem to be converging on a root cause.

There are a number of requests around the time in question. Several PUT requests succeeded. Several POST requests failed with HTTP 504 (Gateway Timeout) errors. The PUT requests contain a single Package resource, while the POST requests contain a full list of all revisions of the package - 588 revisions at the time of writing, making the POST payload ~1.8MB in size. Timestamps in the logs show that at one point there were three concurrent POST requests attempting to update the same package.

When the update tasks fail, the celery worker will schedule them to be retried. Thus it's entirely plausible that we end up with an interleaving of updates with potentially conflicting content. We still need to dig into the publishing task to verify the exact circumstances under which PUT and POST are used and the specifics of how the payload for those requests is built - lazily at task execution time or actively at task request time - and then we'll have a better idea of how to proceed.

I'm adding software-center-agent to the affected projects, as we may end up modifying the way packages are indexed. Another potential course of action is to tweak the way the index operation is performed in click-package-index to not refresh after each document insertion; this should improve the performance of the batch update, but we also need to be careful to consider the impact on result currency, although now we store a history of all prior revisions this may be less of an issue than when we only stored a single revision per package.

Revision history for this message
Federico Gimenez (fgimenez) wrote :

We had another flip-flop this morning with the upload of the core snap to edge, this time using the snap/details endpoint for checking the snap versions. These are the detected changes from the git log:

commit 60c014ff543412c2cf93627b2cfd87d563bcefe1
Author: Snappy Developers <email address hidden>
Date: Fri Dec 2 06:50:22 2016 +0000

    New OS snap in edge channel (598)

commit 23e710420c41c50f124c008c9127c874fd7bc554
Author: Snappy Developers <email address hidden>
Date: Fri Dec 2 06:45:22 2016 +0000

    New OS snap in edge channel (590)

commit 9159ba4fbaa90569e28498bb77423ae39ca4dd86
Author: Snappy Developers <email address hidden>
Date: Fri Dec 2 05:40:22 2016 +0000

    New OS snap in edge channel (598)

commit 52722cea75b5f67d3ef7b60b560e9bdd201e3d04
Author: Snappy Developers <email address hidden>
Date: Fri Dec 2 05:35:22 2016 +0000

    New OS snap in edge channel (590)

commit 32c0b28622502679dd5a2f7e7d125a43d54d8be9
Author: Snappy Developers <email address hidden>
Date: Fri Dec 2 05:05:22 2016 +0000

    New OS snap in edge channel (598)

commit db58391bb86ee3d30a825a4b8735e895cfaf70af
Author: Snappy Developers <email address hidden>
Date: Fri Dec 2 05:00:22 2016 +0000

    New OS snap in edge channel (590)

commit d4527791c8f20526b24a2c55db117f0b82efe742
Author: Snappy Developers <email address hidden>
Date: Fri Dec 2 04:40:22 2016 +0000

    New OS snap in edge channel (598)

We also enabled detailed logging of headers and response received, I've attached the relevant time frame.

Please let us know if we can provide any further info.

Revision history for this message
Federico Gimenez (fgimenez) wrote :
Changed in software-center-agent:
status: New → Confirmed
assignee: nobody → Facundo Batista (facundo)
importance: Undecided → Critical
Revision history for this message
James Tait (jamestait) wrote :

Thanks for the additional information. I've used it to collate some logs to try and understand the timeline, but I'm hitting some challenges getting a complete information set. What is abundantly clear, though, is that the bulk package updates (POST requests) are all failing, while the single revision updates (PUT requests) are all succeeding.

I took this information and ran some tests on my local dev setup. A bulk update with 500 revisions consistently takes upwards of 50s. On a hunch, I removed the explicit refresh from the call that updates the backend, but it made no discernible difference. So on the clickindex side, that's one potential remedial action seemingly ruled out.

I also noted from the OOPS reports that this problem is more widespread than just the core snap. I haven't dug into all of the affected packages yet, but the smallest number of revisions for an affected snap so far is 255 for unity8-session.

There are various mitigation measures being considered on both sides of this interface to work around the problem in the short term, but longer term there are changes already in progress that will eliminate the problem entirely.

Changed in click-package-index:
assignee: nobody → James Tait (jamestait)
James Tait (jamestait)
Changed in click-package-index:
status: Confirmed → Fix Committed
James Tait (jamestait)
Changed in click-package-index:
status: Fix Committed → Fix Released
assignee: James Tait (jamestait) → nobody
Changed in software-center-agent:
status: Confirmed → Fix Released
Revision history for this message
Michael Vogt (mvo) wrote :

Can we get a bit more info about the fix?

Revision history for this message
William Grant (wgrant) wrote :

I'll only go into mildly gory details, but I can provide more if you're really
interested. There were three key problems that combined to result in the
long-term Store inconsistencies:

 - SCA is pushing too much data to CPI on each release. This isn't quite fixed
   yet, but branches are in progress as part of the current channel rework.

 - CPI was processing the large volume of data very slowly. James has
   thoroughly fixed that by optimising our database access patterns, and I've
   got some more proposed changes to make it even faster.

 - Misconfigured timeouts and automatic retries in the Store stack caused SCA
   to believe that the slow CPI requests had actually failed, and so they were
   retried repeatedly for up to two hours. We've fixed the timeouts so a slow
   request just appears slow, not failed. Additionally, we've revised the
   retry algorithms to wait a shorter interval between attempts and give up
   more quickly, limiting the window of inconsistency even if requests do still
   somehow fail.

The core snap was particularly badly affected because it has a lot of
revisions and regular, near-simultaneous pushes and releases across several
architectures. The large revision count caused CPI pushes to time out
frequently, and the retry bug caused pushes that consistently timed out to be
retried repeatedly over a multi-hour window. Pushes from multiple
architectures often ran into the same situation within just a couple of
minutes of each other, so their successive retries clobbered each others'
data, resulting in the observed flip-flopping behaviour. Eventually each job
either succeeded or gave up, usually leaving things in a correct steady state
-- just two hours later than intended.

We believe that we've addressed the CPI slowness and SCA retry issues, so the
known correctness bugs are fixed. The further near-term SCA changes are just an
optimisation, so do let us know if you see any inconsistencies from now on.

This investigation has also revealed some other less critical problems in our
inter-service communications, logging and metrics. I'm investigating how best
to address those.

Revision history for this message
Michael Vogt (mvo) wrote :

Thanks a bunch William and James!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.