make it easier to reset baseline for autopkgtests that regress in release

Bug #1700668 reported by Steve Langasek on 2017-06-26
24
This bug affects 2 people
Affects Status Importance Assigned to Milestone
britney
Undecided
Unassigned

Bug Description

Currently, if an autopkgtest has ever passed for a particular package on a given architecture, any future failures of that autopkgtest are treated as a regression.

In some cases, an autopkgtest will have succeeded in the past, but now fails even in the release pocket. This can be for a variety of reasons: changes to the autopkgtest infrastructure, newer versions of undeclared test dependencies making it into the release, newer versions of indirect dependencies making it into the release.

When I see a regression blocking a package in proposed-migration, I will often retry the test against the release pocket alone to confirm that this is actually a regression in the release. If it fails, I will add a force-badtest hint for the confirmed-bad test in release.

It occurs to me that the second step may be redundant, if we instead reset the baseline pass/fail of the test to the most recent run of the current release version of the test. This would reduce the need for the release team to spend so much time managing hints for failing autopkgtests.

- if the package is confirmed to fail in release, it will not block other packages from migrating.
- if a new version of the package is uploaded, and the test still fails, the new version will also be allowed to migrate without the need for bumping hints.
- if a package test is flaky, we don't have to hammer the retry button repeatedly for each reverse-dependency (though we can, if we care about confirming that the test suite *can* pass, or if this is easier because the testsuite usually passes); we can retry for the release pocket and reset the baseline.
- but if a flaky test has passed in the most recent version, the default is that it will still gate, so we don't have to worry about this causing us to blindly accept regressions.
- if a test has regressed due to a regression on the infrastructure which will later be fixed, no special handling is required to re-enable gating when the test starts passing again in the same package version (i.e. no inaccurate 'badtest' hint to be removed)
- release team members no longer have to choose between badtesting a single version (which requires active management when a new version lands in -proposed) and hinting all versions (which leaves us blind to actual regressions).
- indeed, the release team can be hands-off as a whole, since anyone with autopkgtest retry privs can do this.

Steve Langasek (vorlon) on 2017-06-26
description: updated
Steve Langasek (vorlon) wrote :

As a concrete example of this, the latest perl upload triggered test failures on 34 reverse-dependencies, all of which were reproducible with the packages in the artful release pocket - at least on amd64, which was as much as I spent the time to analyze, and that alone took me upwards of an hour's investment.

Having a mechanism to reset the baseline would make better use of release team time than on this manual, highly error-prone, and incomplete analysis of build failures.

Iain Lane (laney) wrote :

I think this might be a dupe of bug #1688516 - please mark it if so.

I do think that 'package got broken in release' shouldn't be an automatic force-badtest. We need some level of pressure on people to fix tests, and the fact that it managed to slip through proposed-migration isn't by itself good enough to me to let a package pass forever more.

Jeremy Bicha (jbicha) wrote :

Here's another example of a test that only passed once (well twice):

https://autopkgtest.ubuntu.com//packages/ostree/zesty/amd64

Steve Langasek (vorlon) wrote :

We've discussed this further on IRC and I don't think we've converged on a consensus just yet, but I'll lay out my position here for the bug log:

A regression in an autopkgtest in the release pocket means our gate has failed. We absolutely should care about that, and analyze why it happened in order to try to prevent it in the future - with the understanding that we will never have 100% success (due to things like kernels in LTS release not gating on devel userspace, etc.)

*When* the gate fails, we should acknowledge that this is the new (slightly worse) status quo for the release pocket, and move forward. We should not penalize packages in -proposed for these unrelated regressions; we should not penalize developers who are managing transitions in -proposed by making it critical path that they resolve these regressions; we should not penalize the release team by making them manage override hints for these regressions. Time spent managing override hints is time *not* spent making Ubuntu better; and while time spent fixing the test regressions is improving the quality of Ubuntu (either now or for the future), tying the fixing of autopkgtests to proposed-migration when those test failures do not represent a regression between the release pocket and -proposed is an artificial prioritization of that work, and IMHO not in the spirit of the gate as designed.

AIUI there is a range of practices today across the release team regarding these problems. Some release team members will tend to add 'skiptest' hints when the failure rate for important packages is 'good enough' without necessarily analyzing the individual failures to understand if they are true regressions. Some will dig in and 'badtest' those packages that they confirm have regressed in the release. Some will go further and try to resolve the regressions, even if they've already landed in release. That we have this range of practices today tells me that the p-m ratchet is already not very effective at driving fixes of those in-release test regressions. I therefore think we should solve that elsewhere, and implement a policy for p-m that requires minimal release team management.

This should eliminate the need for the majority of force-badtest / force-skiptest hints by the release team, and actually allow us to be more *strict* about their use than we have been.

On Wed, Jun 28, 2017 at 11:43:49PM -0000, Steve Langasek wrote:
> We've discussed this further on IRC and I don't think we've converged on
> a consensus just yet, but I'll lay out my position here for the bug log.

Thanks for writing this. I just want to say that I'm away for a week and
I'll respond to this when I get back. FWIW, I still remain concerned
about this approach and I'll try to lay those out on my return.

(I'll also then review the britney branches if you still need me to.)

--
Iain Lane [ <email address hidden> ]
Debian Developer [ <email address hidden> ]
Ubuntu Developer [ <email address hidden> ]

Simon Quigley (tsimonq2) wrote :

Bump; this is still often a pain to deal with as an uploader who does transitions (especially involving KDE packages with flaky autopkgtests (which is another issue)) but isn't a member of the Release Team.

Paul Gevers (paul-climbing) wrote :

As I am close to enabling autopkgtest results in Debian (albeit in a different form: not as gate, but as an age shifter) this is very relevant for me. In Debian it will happen way more than in Ubuntu that a regression moves into testing and remain there. Thus, for Debian I believe I'll have to be working along the reasoning of Steve.

My idea is that check_ever_passed considers the version of the package in Debian testing/the release pocket, instead of *all* the results in the past. Unless we come up with something better in this bug, I'll probably implement something along those lines.

Paul Gevers (paul-climbing) wrote :

The patch in my previous message is incomplete. I think it will work alright when a package migrates while having a regression, but the code is insufficient for regressions caused by dependencies migrating. I'll think about it.

Paul Gevers (paul-climbing) wrote :

Debian is now (since yesterday) using my approach of baselining. I.e. Debian checks against a baseline that for PASS looks at all tests except one's own from unstable. A FAIL test triggered by "migration-reference/0" will reset the baseline for testing to FAIL.

(In Debian, we take all result of 7 days into account so this doesn't work perfect yet if the regression happened in the last 7 days, but for now good enough).

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers