Error rate calculations inappropriately include Autopilot/LRT errors

Bug #1324455 reported by Matthew Paul Thomas on 2014-05-29
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Brian Murray

Bug Description

When Autopilot or LRT triggers a crash, the error is submitted to just as it would be if the error was experienced by a machine in normal use.

The error is then, as usual, included in the error rate calculations and the occurrence counts. But both of these are problems.

The calculated error rate is naturally understood (or would be, if the axis was labelled) as the number of errors per day experienced by a machine in normal use. But the point of automated tests is that they encounter errors much more quickly than normal use will. This is an example of Campbell's Law: reducing the error rate is a good thing, but we could reduce the measured error rate by not running the automated tests any more, which would be a bad thing.

Including fuzzer errors in the occurrence counts is also a bad thing, because it may lead to poor prioritization of fixes. For example, imagine there was a crash whenever you switched apps less than half a second after revealing the Launcher. Humans would seldom encounter errors like this, but a fuzzer often would, and would submit it many times. So it would rank highly in the occurrences table, misleading developers into thinking that it was more important than errors humans are encountering more often.

If the automated error reports are a drop in the bucket, neither of these things matter, so this bug can be marked Won't Fix.

Otherwise, either Autopilot and LRT should override Apport's usual behavior, and report bugs directly to Launchpad rather than submitting errors; or Errors should ignore Autopilot/LRT reports when calculating error rates and counting occurrences. The latter would be more complicated, but would have the advantage that the automated tools might sometimes provide the only evidence that a bug remains unfixed in a new package version.

Matthew Paul Thomas (mpt) wrote :

Brian points out that the same problem affects Autopilot. It's reporting errors to that inflate the measured error rate.

description: updated
summary: - Error rate calculations inappropriately include fuzzer errors
+ Error rate calculations inappropriately include Autopilot/LRT errors
Evan (ev) wrote :

Brian and I agreed that automated testing systems should change their CRASH_DB_IDENTIFIER to start with "testing" such that we can filter them out server side from incrementing counters.

Evan (ev) wrote :

For the sake of consistency, we'll say that it needs to start with deadbeef

Any advice on how we do that?

Brian Murray (brian-murray) wrote :

It can be set in either /etc/init/whoopsie.conf or as an environmental variable when whoopsie is started.

no longer affects: autopilot
Chris Gagnon (chris.gagnon) wrote :

I'll update my crash ids to start with deadbeef

affects: errors → daisy
Changed in daisy:
status: New → In Progress
assignee: nobody → Brian Murray (brian-murray)
importance: Undecided → High

On Thu, Aug 21, 2014 at 11:49:03PM -0000, Thomi Richards wrote:
> ** No longer affects: autopilot

Does that mean you've setup autopilot to use the suggested
CRASHDB_IDENTIFIER or does it mean something else?

Brian Murray

Chris Gagnon (chris.gagnon) wrote :

I've moved my crash identifiers back to start without deadbeef until comment 8 can be fixed.

This is the code I use to set the id

exec_with_adb "sed -i '/CRASH_DB_IDENTIFIER/d' /etc/init/whoopsie.conf"
exec_with_adb "sed -i '/env CRASH_DB_URL=https:\/\/ env CRASH_DB_IDENTIFIER=$CRASH_ID' /etc/init/whoopsie.conf"
exec_with_adb "reboot"

Brian Murray (brian-murray) wrote :

On Fri, Aug 29, 2014 at 01:16:48PM -0000, Chris Gagnon wrote:
> I've moved my crash identifiers back to start without deadbeef until
> comment 8 can be fixed.
> This is the code I use to set the id
> exec_with_adb "sed -i '/CRASH_DB_IDENTIFIER/d' /etc/init/whoopsie.conf"
> exec_with_adb "sed -i '/env CRASH_DB_URL=https:\/\/ env CRASH_DB_IDENTIFIER=$CRASH_ID' /etc/init/whoopsie.conf"
> exec_with_adb "reboot"

Um, where does $CRASH_ID get set?

Brian Murray

Brian Murray (brian-murray) wrote :

The changes have been deployed on the daisy frontends and the retracers now.

Chris Gagnon (chris.gagnon) wrote :

the id gets set earlier in the script:

if [ $test_to_run == "lrt.test_random_gestures" ]; then

if [ $test_to_run == "lrt.test_switch" ]; then
echo $CRASH_ID

if [ $test_to_run == "lrt.test_ap_core_apps" ]; then

I'll try again with the string starting with deadbeef

Chris Gagnon (chris.gagnon) wrote :

Changing the string to start with deadbeef causes the system identifier to be dropped from the report like in comment #8 again.

Brian Murray (brian-murray) wrote :

Ah, its because you've prepended 'deadbeef' to the crash id, making it too long, while you need to replace the first 8 characters with 'deadbeef'.

Chris Gagnon (chris.gagnon) wrote :

This has been working now that the id is not too long.

Changed in lrt:
status: New → Fix Released
Changed in daisy:
status: In Progress → Fix Released
Steve Langasek (vorlon) wrote :

FWIW I disagree with the change that was made for this bug. The net effect is that bugs that were being discovered automatically, and might happen quite frequently under test, are now hidden from view of the developers - and yet internally, developers are still being asked to fix the bugs found by automated tests.

Every crash that's found in autotesting is a real crash. Particularly while the real userbase of the phone is small, it's important to surface all of these crashes even if they've only ever been seen in the lab. If the crashes seen in the lab are skewing the statistics, there's one sure-fire way to correct this: drive the number of crashes in the lab down to zero!

Also, while the automated tests may skew the crash counts overall, one place where they shouldn't be skewing is on the per-image / per-rootfs counts - because each combination is usually only tested once, or a small number of times. So including automated tests in these counts will provide a much better indicator of image quality than omitting them.

Matthew Paul Thomas (mpt) wrote :

Steve, no-one disputes that auto-testing crashes are real crashes. But the purpose of any defect tracker is to help developers make best use of their time, and driving "the number of crashes in the lab down to zero" is not necessarily the best use of their time. Imagine that crash A is triggered by humans once a day on average, but by LRT once an hour on average, while crash B is hourly for humans and daily for LRT. If an engineer has time to fix one of those for a particular release, and leads them to fix A instead of B, it has failed.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers