oops reports grouping (exception type and exception value) does not collate reports where the value is not really relevant

Bug #461269 reported by Diogo Matsubara
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
python-oops-tools
Triaged
High
Unassigned

Bug Description

Currently oops reports are grouped by exception type and exception value in oops summaries. They should use the signature value instead

Changed in oops-tools:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Ursula Junque (ursinha) wrote :

This has the side effect of making the oops-bug link useless for timeouts, and also quite the entire timeouts section. By grouping different timeouts into the same infestation, it's impossible to know how many different problems to fix we have.

Example: we had lots of timeouts like OOPS-1662A1099, in attachments. A bug search on launchpad finds bug 424671. But the oops itself is linked to bug 607879, that is a timeout in the +participation page, because they were similar to oops-tools but not related at all.

Revision history for this message
Gary Poster (gary) wrote :

As I discussed with Diogo, I am not convinced that this will be an improvement. We would like better OOPS grouping (into "infestations"), but this proposal simply makes the grouping more granular along a particular axis. In discussion, Diogo had no data analysis to indicate that this would in fact result in better grouping. It is another heuristic, and good heuristics are a matter of statistics, and we don't have any statistics.

I think working on OOPS grouping will involve two components. The first will be to try and identify better "first guess" heuristics by doing some OOPS analyses. The second will be to provide a way to dynamically teach the oops tools to group things differently when the heuristic, inevitably, fails. That might be ways to link OOPS signatures into a single group, bayesian approaches, regexes, or something else.

One idea: we come up with a number of axes for signatures, like exception type, exception value, pageid, normalized full traceback, and so on. Each axis has a unique weight. An infestation is identified, by default, with only a couple of axes (perhaps exception type and normalized exception value, as it does now). It links to the collected values in the other axes that it contains. When you want to teach the oops tools that an oops should be grouped differently, you can change the axes used for identifying an infestation, and you can specify one or more matching values. There is a constraint that infestation signature rules cannot overlap (two cannot match the same exact signature). When you get a new OOPS, you find the infestations that match the different axes. An infestation with more matches wins. If an infestation has the same number of matches, the one with the heaviest different axis wins. It's an idea; maybe not a good one. It's pretty manual, for one thing, and would require a non-trivial amount of work.

Anyway, this is not a clear-cut problem in my mind.

Revision history for this message
Robert Collins (lifeless) wrote : Re: [Bug 461269] Re: oops reports should be grouped by oops signature not exception type and exception value

I agree its not a clear cut problem, and I've looked at the code a
little - its non trivial.

What might be a related useful step is the ability to break a wrong
infestation and supply a specific bug.

However, all that said: this is a pressing problem. Perhaps the most
important thing is to stop TimeOuts grouping together for no good
reason.

Revision history for this message
Gary Poster (gary) wrote : Re: oops reports should be grouped by oops signature not exception type and exception value

Diogo, Ursula and I talked about this bug today.

- Non-timeout exceptions are (almost?) always just fine now. The problem is primarily timeout exceptions, and primarily improper grouping, as Robert describes it in comment 3 above.

- Ursula will be performing an analysis of some live OOPS data to see if merely adding pageid as a discriminator to infestations will make timeout exceptions divided properly, while not causing problems to the non-timeout exceptions. If this analysis reveals that the change would be a win, the code change should be small.

- We hope that the previous bullet point will lead to a sufficient solution. We do not want to pursue solutions like what I described in comment 2. However, if we did, looking at the oops model, these are "axes" that could be used.
  * pageid
  * url (vhost/path)
  * prefix / appinstance (group of prefixes)
  * http_method
  * user_agent
  * most_expensive_statement
  * is_bot
  * is_local_referrer
  * classification (used in our report)
  * exception value
  * exception type

- We also touched on how to identify infestations for teams better (bug 592355). I'll add notes there as well. However, part of this discussion included adding the ability to change the bug associated with an infestation, which is not possible through the UI now, and is (close?) to what Robert described in the second paragraph of comment 2.

affects: oops-tools → python-oops-tools
summary: - oops reports should be grouped by oops signature not exception type and
- exception value
+ oops reports grouping (exception type and exception value) does not
+ collate reports where the value is not really relevant
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.