Clarify when it's worth creating an elastic-recheck query

Bug #1408259 reported by James Polley
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Core Infrastructure
In Progress
Undecided
James Polley
OpenStack-Gate
New
Undecided
Unassigned

Bug Description

In https://review.openstack.org/#/c/141043/, patch set 3 failed the check-tripleo-ironic-overcloud-precise-nonha check.

Jenkins adds a note suggesting that http://docs.openstack.org/infra/manual/developers.html#automated-testing has instructions on how to proceed.

Those instructions instruct the reader to:

    If a nice message from Elastic Recheck didn’t show up in your change when Jenkins failed, and you’ve identified a bug to recheck against, help out by writing an elastic-recheck query for the bug.

In this case, a nice message didn't show up, so I filed https://review.openstack.org/#/c/144090; which was eventually abandoned because (A) the bug was fixed long before the new query got review, and (B) it seems that the affected projects aren't even tracked in elastic-recheck, so the query would do nothing.

It's probably too difficult to have Jenkins vary what it posts, but I'd think it would be better if the docs could be clearer about when it's worth filing an elastic-recheck query. Presumably there's something the reader could check to see if the targeted project is tracked in elastic-recheck, and it sounds like it's usually not worth filing a query if the bug is expected to be solved within a few days.

Revision history for this message
Matt Riedemann (mriedem) wrote :

James, I was looking for the same thing this week, i.e. how can we tell if e-r is even going to report on a failure in a given job.

This is a good place to start:

http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/elastic_recheck/elasticRecheck.py#n202

Which is called from here:

http://git.openstack.org/cgit/openstack-infra/elastic-recheck/tree/elastic_recheck/elasticRecheck.py#n309

Those show it's got to be a voting job in an openstack project.

So my mistake in reviewing https://review.openstack.org/#/c/144090 was that I didn't realize check-tripleo-ironic-overcloud-precise-nonha was a voting job, so we could/should have taken the query.

When I reviewed the change, I saw hits in logstash but not in the gate queue, plus we could tell it was fixed. Normally an elastic-recheck query for a fixed bug is only useful to get the uncategorized bugs percentage up:

http://status.openstack.org/elastic-recheck/data/uncategorized.html

That filters on failures for jobs only in the gate queue.

When I search on that failing job in the gate queue, I don't get any hits:

http://logstash.openstack.org/#eyJzZWFyY2giOiJidWlsZF9uYW1lOlwiY2hlY2stdHJpcGxlby1pcm9uaWMtb3ZlcmNsb3VkLXByZWNpc2Utbm9uaGFcIiBBTkQgYnVpbGRfcXVldWU6XCJnYXRlXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MjA2NTA4NDg5MjZ9

And it doesn't show up on the uncategorized bugs page, so I figured it wasn't worth tracking since anything that hit it had already rechecked.

So a few things:

1. Does check-tripleo-ironic-overcloud-precise-nonha run in the gate queue? I don't see a gate-tripleo-ironic-overcloud-precise-nonha job so I'm assuming it doesn't.

2. It's a voting job so that's OK. I wish we had a field in logstash queries where we could tell if a job is voting or not, like the build_queue field tells us check or gate (or experimental for that matter).

3. If the bug is fixed and isn't in the gate queue, it's not on the uncategorized bugs list so there isn't a huge reason to add an e-r query for it.

--

Regarding solutions/next steps, it should be possible to implement #2 but would probably require direction from the infra team, e.g. clarkb or sdague.

For the rest of this, we could probably simply update the elastic-recheck readme since that has information on writing queries:

http://docs.openstack.org/infra/elastic-recheck/readme.html

That doesn't mention anything today about voting vs non-voting jobs, nor does it mention anything about the uncategorized bugs page and how a fixed bug that's only hit in the check queue isn't probably worth classifying. If you want to take a crack at pushing a change to the e-r readme I'd gladly review it to make this more clear.

tags: added: documentation
Revision history for this message
James Polley (tchaypo) wrote :
Download full text (3.9 KiB)

Sorry that it's taken me so long to respond to this. Most of the delay has been caused by me wanting to take the time to write a careful response: I'm very impressed by the elastic-recheck system, and I'm very grateful to the people who implemented it, run it, and have made it so easy to add rules. I'm aware that anything I'm about to be unhappy about is, at worst, a minor docbug on a system that is, on the whole, very successful. I'm worried that I'm going to sound snarky, so I've delayed responding until I had time to write very carefully and try to make sure that I convey the right meaning with my words...

but some of the delay is because I had a wonderful response half-written and then I hit command-q instead of command-w and lost it all. Yay me!

So this is the rushed re-typing of what I can remember of my first response. I'm sorry if it comes across as me being grumpy or unhappy or upset - that's not how I feel, but I have a sense of humour that often interferes with my attempts to talk calmly about things.

"Normally an elastic-recheck query for a fixed bug" - it wasn't fixed at the time I filed the review that would have added the rule. I think I'd identified the problem and had a potential fix up for review - but it wasn't fixed. I got the impression that the elastic-recheck rule would be processed with some priority, so I thought it likely that the rule would land before the fix did - but this turned out to be false.

The existing docs suggest to me that this process should be followed for all bugs with no exceptions, but Matt's response here suggests that in reality we only expect queries to be added for long-running issues with no immediately obvious fixes. Perhaps it's worth adding a note somewhere (perhaps http://docs.openstack.org/infra/manual/developers.html#automated-testing) explaining that it's only worth adding a query for bugs that have no immediately obvious fixes?

" Does check-tripleo-ironic-overcloud-precise-nonha run in the gate queue" I don't believe it does - certainly it doesn't on the tripleo-incubator project. I'm not sure why that's relevant though: http://docs.openstack.org/infra/manual/developers.html#automated-testing says that "If a change fails tests in Jenkins, please follow the steps below:" - if it's only gate checks that matter, perhaps that should say "If a change fails gate tests in Jenkins, please follow the steps below:" ?

"If the bug is fixed and isn't in the gate queue, it's not on the uncategorized bugs list so there isn't a huge reason to add an e-r query for it." There's reason for me. We had many changes across many projects (not just TripleO projects - https://review.openstack.org/#/c/141043/ was a change on Heat, but it ran the TripleO checks, for instance) that had failed this check. I was under the understanding that landing an elastic-recheck query would cause an automatic recheck for all of those changes - simpler (for me) than finding them all and manually requesting a recheck

But in hindsight I'm not sure why I thought that - it seems more logical that the new elastic-recheck rule would be only applied on new changes as they failed, rather than retroactively being applied to old ch...

Read more...

Tom Fifield (fifieldt)
affects: openstack-manuals → openstack-ci
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to infra-manual (master)

Fix proposed to branch: master
Review: https://review.openstack.org/151221

Changed in openstack-ci:
assignee: nobody → James Polley (tchaypo)
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to infra-manual (master)

Reviewed: https://review.openstack.org/151221
Committed: https://git.openstack.org/cgit/openstack-infra/infra-manual/commit/?id=1bbd1dc7d970f255fa2bd8fd8ac1cea0434bde29
Submitter: Jenkins
Branch: master

commit 1bbd1dc7d970f255fa2bd8fd8ac1cea0434bde29
Author: James Polley <email address hidden>
Date: Mon Jan 26 16:06:23 2015 +0000

    Add clarifications about when to file an ER query

    This update aims to clarify that although we want queries for all bugs
    that affect the gate queues, we aren't as interested in rechecks for
    bugs that don't affect gate queues.

    Change-Id: Ic29d45869fee1eecae9646193e0b7cc78004d91d
    Partial-bug: 1408259

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.