Britney flops back and forth between Running and Approved a lot

Bug #1537866 reported by Robert Bruce Park
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Bileto
Fix Released
Critical
Martin Pitt
britney
Fix Released
High
Martin Pitt

Bug Description

It seems as though any silo which would run a unity8 autopkgtest often flips back and forth between Running and Approved, eg:

https://requests.ci-train.ubuntu.com/#/ticket/872 (click 'Show Audit Logs')

2016-01-25 10:35:40 -0800 (britney-bot) britney_signoff: Running
2016-01-24 18:39:23 -0800 (britney-bot) britney_signoff: Approved
2016-01-24 17:07:26 -0800 (britney-bot) britney_signoff: Running
2016-01-23 01:58:31 -0800 (britney-bot) britney_signoff: Approved
2016-01-22 16:16:51 -0800 (britney-bot) britney_signoff: Running
2016-01-22 14:46:48 -0800 (britney-bot) britney_signoff: Approved
2016-01-22 13:11:09 -0800 (britney-bot) britney_signoff: Running
2016-01-21 13:50:24 -0800 (britney-bot) britney_signoff: Approved
2016-01-21 11:16:16 -0800 (britney-bot) britney_signoff: Running
2016-01-21 09:18:31 -0800 (britney-bot) britney_signoff: Approved
2016-01-21 08:03:20 -0800 (britney-bot) britney_signoff: Running
2016-01-19 17:30:56 -0800 (britney-bot) britney_signoff: Approved
2016-01-19 07:18:21 -0800 (britney-bot) britney_signoff: Running
2016-01-18 23:46:16 -0800 (britney-bot) britney_signoff: Approved
2016-01-18 20:47:04 -0800 (britney-bot) britney_signoff: Running
2016-01-18 20:07:36 -0800 (britney-bot) britney_signoff: Approved
2016-01-18 18:45:21 -0800 (britney-bot) britney_signoff: Running
2016-01-18 11:51:37 -0800 (britney-bot) britney_signoff: Approved
2016-01-18 08:41:17 -0800 (britney-bot) britney_signoff: Running
2016-01-18 07:28:07 -0800 (pete-woods) lander_signoff: Approved
2016-01-14 07:47:46 -0800 (ci-train-bot) Successfully built

So looking at the timestamps, that's over a period of 8 days that it continuously re-runs the tests without there being any new package builds in the silo.

I would guess that the autopkgtest infrastructure is not correctly isolating silo builds and, eg, when one silo triggers a unity8 autopkgtest, all unity8-related silos get re-run.

This is probably really wasteful of autopkgtest resources and it'd be good if silos autopkgtest results were not affected by tests run in other silos.

Changed in britney:
assignee: nobody → Martin Pitt (pitti)
Changed in bileto:
assignee: nobody → Robert Bruce Park (robru)
Revision history for this message
Robert Bruce Park (robru) wrote :

Currently happening with these three tickets:

12:37:15 <queuebot> boiko artmello tiagosh, https://requests.ci-train.ubuntu.com/#/ticket/696 QA Signoff: Ready
12:37:15 <queuebot> dobey, https://requests.ci-train.ubuntu.com/#/ticket/910 QA Signoff: Ready
12:37:16 <queuebot> pete-woods, https://requests.ci-train.ubuntu.com/#/ticket/872 QA Signoff: Ready
13:17:57 <queuebot> boiko artmello tiagosh, https://requests.ci-train.ubuntu.com/#/ticket/696 QA Signoff: Required
13:17:58 <queuebot> dobey, https://requests.ci-train.ubuntu.com/#/ticket/910 QA Signoff: Required
13:17:58 <queuebot> pete-woods, https://requests.ci-train.ubuntu.com/#/ticket/872 QA Signoff: Required

And when I look through the excuses files, it's always unity8 in xenial on armhf that is "running"

Revision history for this message
Robert Bruce Park (robru) wrote :

Also it appears in running.shtml that it really is re-running unity8 tests against these three silos, but I'm not sure why:

https://archive.is/s4pKu#selection-2609.0-2847.11

Revision history for this message
Martin Pitt (pitti) wrote :

As discussed by email (we should have moved here right away), this needs the britney log files to see what actually happens. Robert and I don't have an off-hand idea what's wrong her.e

Changed in britney:
status: New → Incomplete
Revision history for this message
Robert Bruce Park (robru) wrote :

Just a note, because the unity8 test that keeps being retried is an ALWAYSFAIL, I workarounded this by making bileto ignore RUNNING-ALWAYSFAIL and only report RUNNING as Running. In order to reproduce the issue you'll need to edit iterate.py and replace 'RUNNING\n' with 'RUNNING'.

Changed in bileto:
assignee: Robert Bruce Park (robru) → Martin Pitt (pitti)
Revision history for this message
Martin Pitt (pitti) wrote :

We don't have current logs for the tickets mentioned above, but it looks like https://requests.ci-train.ubuntu.com/#/ticket/634 currently exhibits this.

https://requests.ci-train.ubuntu.com/static/britney/log_20160203_130001.txt has 634 as "Approved" and https://requests.ci-train.ubuntu.com/static/britney/log_20160203_132501.txt flips it back to "Running".

Martin Pitt (pitti)
Changed in britney:
status: Incomplete → In Progress
Revision history for this message
Martin Pitt (pitti) wrote :

I just caught it in the act, and current https://requests.ci-train.ubuntu.com/static/britney/xenial/landing-038/excuses.html indeed flips unity8 back to running, and I see the request on http://autopkgtest.ubuntu.com/running.shtml#pkg-unity8.

https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial-ci-train-ppa-service-landing-038?format=plain has plenty of unity8 results; however, the run on 20160202 (yesterday) was because webbrowser-app actually did get updated yesterday, so that is justified. However, there is not a single unity8 result on any arch from today's requests.

Yesterday's armhf result timed out with exit code 20, thus not counted as a tmpfail: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial-ci-train-ppa-service-landing-038/xenial/armhf/u/unity8/20160202_201435@/log.gz

But the corresponding i386 result succeeded and thus should satisfy britney's request: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial-ci-train-ppa-service-landing-038/xenial/i386/u/unity8/20160202_195540@/result.tar

Revision history for this message
Martin Pitt (pitti) wrote :

These are the relevant log entries for three consecutive britney runs for xenial landing-038 wrt. unity8:

log_20160203_145501.txt:
Read pending requested tests from /tmp/britney_data/landing-038-xenial/autopkgtest/pending.json: {'webbrowser-app/0.23+16.04.20160114-0ubuntu1': {'qtcreator-plugin-ubuntu': ['amd64']}, 'webbrowser-app/0.23+16.04.20160202-0ubuntu1': {'qtcreator-plugin-ubuntu': ['amd64']}}
Fetched test result for unity8/8.11+16.04.20160129-0ubuntu1/amd64 20160202_194925@ (triggers: ['webbrowser-app/0.23+16.04.20160202-0ubuntu1']): pass
> does not match any pending request for unity8/amd64
[.. NO "Requesting unity8 autopkgtest" ]

log_20160203_152501.txt:
Read pending requested tests from /tmp/britney_data/landing-038-xenial/autopkgtest/pending.json: {'webbrowser-app/0.23+16.04.20160114-0ubuntu1': {'qtcreator-plugin-ubuntu': ['amd64']}}
[.. NO "Fetched test result..." for unity8, just for qtcreator-plugin-ubuntu/i386]
Requesting unity8 autopkgtest on i386 to verify webbrowser-app/0.23+16.04.20160202-0ubuntu1

log_20160203_155501.txt:
Read pending requested tests from /tmp/britney_data/landing-038-xenial/autopkgtest/pending.json: {'webbrowser-app/0.23+16.04.20160114-0ubuntu1': {'qtcreator-plugin-ubuntu': ['amd64']}, 'webbrowser-app/0.23+16.04.20160202-0ubuntu1': {'unity8': ['amd64', 'armhf', 'i386']}}
[.. NO "Fetched test result..." for unity8, just for qtcreator-plugin-ubuntu/i386]
[.. NO "Requesting unity8 autopkgtest" ]

So in the 145501 run britney apparently fetches and considers the already present result for that trigger, and correctly does not request it. In the 152501 it somehow does not fetch/see that result, only for qtcreator, and consequently requests it. The next result 155501 (and the subsequent ones) recognize that the test is already pending and thus don't re-request it, but they still fail to fetch the logs

log_20160203_172501.txt is the next one with a significant change as the test finally ran and the result is available:
Read pending requested tests from /tmp/britney_data/landing-038-xenial/autopkgtest/pending.json: {'webbrowser-app/0.23+16.04.20160114-0ubuntu1': {'qtcreator-plugin-ubuntu': ['amd64']}, 'webbrowser-app/0.23+16.04.20160202-0ubuntu1': {'unity8': ['amd64', 'armhf', 'i386']}}
Fetched test result for unity8/8.11+16.04.20160129-0ubuntu1/amd64 20160203_173231@ (triggers: ['webbrowser-app/0.23+16.04.20160202-0ubuntu1']): pass
-> matches pending request unity8/amd64 for trigger webbrowser-app/0.23+16.04.20160202-0ubuntu1
Test unity8/armhf for webbrowser-app/0.23+16.04.20160202-0ubuntu1 is already pending, not queueing
[.. NO "Requesting unity8 autopkgtest" ]

log_20160203_180001.txt then again looks similar to 145501.

Revision history for this message
Robert Bruce Park (robru) wrote :

Distressingly, the staging PPA I set up has failed to reproduce this issue:

https://requests.ci-train.ubuntu.com/#/ticket/634
2016-02-03 18:10:47 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 17:35:22 -0800 (britney-bot) britney_signoff: Running
2016-02-03 11:14:44 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 07:52:30 -0800 (britney-bot) britney_signoff: Running
2016-02-03 07:22:31 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 06:52:52 -0800 (britney-bot) britney_signoff: Running
2016-02-03 06:22:37 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 05:52:47 -0800 (britney-bot) britney_signoff: Running
2016-02-02 12:10:18 -0800 (britney-bot) britney_signoff: Approved
2016-02-02 10:49:53 -0800 (britney-bot) britney_signoff: Running
2016-02-02 10:13:03 -0800 (abreu-alexandre) lander_signoff: Approved

https://requests.ci-train.staging.ubuntu.com/#/ticket/6
2016-02-03 13:20:27 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 11:32:40 -0800 (britney-bot) britney_signoff: Running
2016-02-03 11:12:44 -0800 (robru) lander_signoff: Approved

(note in the timestamps that ticket 634 has a flip flop newer than ticket 6's most recent approval)

Changed in bileto:
importance: Undecided → Critical
Revision history for this message
Martin Pitt (pitti) wrote :

> 2016-02-03 11:32:40 -0800 (britney-bot) britney_signoff: Running

This is interesting -- in all the logs around that time not a single unity8 test got requested -- but this one for *vidid* landing-038. And with the improved logging there is now also a reason why:

E: [Thu Feb 4 01:13:30 2016] - Failure to fetch swift results from https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-vivid-ci-train-ppa-service-landing-038?delimiter=%40&prefix=vivid%2Famd64%2Fq%2Fqtcreator-plugin-ubuntu%2F: HTTP Error 502: Proxy Error
I: [Thu Feb 4 01:13:30 2016] - qtcreator-plugin-ubuntu/amd64 triggered by webbrowser-app/0.23+15.04.20160202-0ubuntu1: still no result after fetching again
I: [Thu Feb 4 01:13:30 2016] - Requesting qtcreator-plugin-ubuntu autopkgtest on amd64 to verify webbrowser-app/0.23+15.04.20160202-0ubuntu1

(in https://requests.ci-train.ubuntu.com/static/britney/log_20160204_010502.txt)

This would definitively explain the evidence so far: As we don't cache previous results, such temporary network glitches are currently interpreted as "no results", and if that happens at a time when a test is not already marked as pending (because we previously saw a result), this leads to re-running the test as we don't have a result in the cache or on swift, and nothing in pending. It also explains why this wasn't reproducible locally or in staging, as such random proxy errors don't happen there -- on staging (PS4.5) there is no proxy required to talk to swift, as that's in PS4.5 as well.

Fix for britney: Fail hard for such "proxy errors", and have the next run retry again.

Fix for bileto: ensure that talking to proposed-migration's swift does NOT use $http_proxy (i. e. squid.internal) -- you should be able to talk to it directly. The error message might of course also be from some transparent proxy or from swift itself, but I haven't seen it yet.

There is still some inexplicable bit left, though: yesterday afternoon's flip/flop around 15:52 UTC was not such a proxy failure, this should have occurred in the logs even without the additional logging from yesterday evening. There is no "Failure to" for either vivid or xenial landing-038. There are a few error messages with "HTTP Error 401: Unauthorized", but that happens if you poll a container which hasn't been created yet, i. e. for new silos which didn't get any test result yet. This *should* count as "no result". But fixing the above should get us one step further already.

Revision history for this message
Martin Pitt (pitti) wrote :

Pushed http://bazaar.launchpad.net/~ubuntu-release/britney/britney2-ubuntu/revision/570 which now considers such errors as fatal instead of thinking "oh, no results yet, let me trigger a run".

Changed in britney:
status: In Progress → Fix Committed
importance: Undecided → High
Revision history for this message
Robert Bruce Park (robru) wrote :

Argh! Just saw another flip:

https://requests.ci-train.ubuntu.com/#/ticket/634
2016-02-05 16:20:01 -0800 (britney-bot) britney_signoff: Failed
2016-02-05 15:30:06 -0800 (britney-bot) britney_signoff: Approved
2016-02-04 11:22:29 -0800 (britney-bot) britney_signoff: Failed
2016-02-04 10:41:42 -0800 (britney-bot) britney_signoff: Running
2016-02-04 06:55:04 -0800 (britney-bot) britney_signoff: Failed
2016-02-04 06:15:28 -0800 (britney-bot) britney_signoff: Running
2016-02-03 18:10:47 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 17:35:22 -0800 (britney-bot) britney_signoff: Running
2016-02-03 11:14:44 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 07:52:30 -0800 (britney-bot) britney_signoff: Running
2016-02-03 07:22:31 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 06:52:52 -0800 (britney-bot) britney_signoff: Running
2016-02-03 06:22:37 -0800 (britney-bot) britney_signoff: Approved
2016-02-03 05:52:47 -0800 (britney-bot) britney_signoff: Running
2016-02-02 12:10:18 -0800 (britney-bot) britney_signoff: Approved
2016-02-02 10:49:53 -0800 (britney-bot) Updated autopkgtest
2016-02-02 10:49:53 -0800 (britney-bot) britney_signoff: Running
2016-02-02 10:13:03 -0800 (abreu-alexandre) lander_signoff: Approved

Revision history for this message
Robert Bruce Park (robru) wrote :

If people were triggering retries on the failure, I'd expect Failed -> Running -> Failed but Failed -> Approved -> Failed is quite strange.

Martin Pitt (pitti)
Changed in britney:
status: Fix Committed → In Progress
Revision history for this message
Martin Pitt (pitti) wrote :

Finally the more detailled logging has shown something interesting. A successful run looks like this:

I: [Thu Feb 4 17:57:54 2016] - fetch_swift_results: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial-ci-train-staging-area-landing-021?delimiter=%40&prefix=xenial%2Famd64%2Fu%2Funity8%2F&marker=xenial%2Famd64%2Fu%2Funity8%2F20160203_151827%40 result 200
I: [Thu Feb 4 17:57:54 2016] - Fetched test result for unity8/8.11+16.04.20160129-0ubuntu1/amd64 20160203_202530@ (triggers: ['webbrowser-app/0.23+16.04.20160202-0ubuntu1']): pass

whereas the next run looks like this:

I: [Thu Feb 4 18:08:00 2016] - fetch_swift_results: https://objectstorage.prodstack4-5.canonical.com/v1/AUTH_77e2ada1e7a84929a74ba3b87153c0ac/autopkgtest-xenial-ci-train-staging-area-landing-021?delimiter=%40&marker=xenial%2Famd64%2Fu%2Funity8%2F20160204_143755%40&prefix=xenial%2Famd64%2Fu%2Funity8%2F result 204
I: [Thu Feb 4 18:08:00 2016] - unity8/amd64 triggered by webbrowser-app/0.23+16.04.20160202-0ubuntu1: still no result after fetching again
I: [Thu Feb 4 18:08:00 2016] - Requesting unity8 autopkgtest on amd64 to verify webbrowser-app/0.23+16.04.20160202-0ubuntu1

The previous one used marker "20160203_151827" as the most recently seen timestamp, while the latter uses "20160204_143755". This is always the most recent result for that package in results.cache -- however, that contains the last results for *Ubuntu*, not for that silo. So whenever there is a new result for package P in Ubuntu, all silos will see that new result, and not see any earlier results of P in their swift container, and thus re-request the test.

Revision history for this message
Martin Pitt (pitti) wrote :

So essentially, in this ADT_SHARED_RESULTS_CACHE mode we must not trust timestamps in results.cache at all, and merely use that to determine REGRESSION vs. ALWAYSFAILED.

Changed in bileto:
status: New → Triaged
Revision history for this message
Martin Pitt (pitti) wrote :

Fixed in http://bazaar.launchpad.net/~ubuntu-release/britney/britney2-ubuntu/revision/573 . As far as I know, britney auto-updates in bileto, so setting that to "fix committed" for now. Let's watch this for a few more days and confirm that it doesn't happen any more.

Changed in britney:
status: In Progress → Fix Released
Changed in bileto:
status: Triaged → Fix Committed
Revision history for this message
Robert Bruce Park (robru) wrote :

Yes bileto pulls in britney hourly.

I've also changed bileto back so that it doesn't wait around for RUNNING-ALWAYSFAIL (as unity8 is), so even if this isn't fixed, it may be difficult / more rare to notice (eg it's possible it'll re-request tests redundantly, but we'll no longer see the flip/flop in the logs unless the test being requested has a potential for real regression.

Changed in bileto:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.