"Errors/day" wrongly depends on how many hours Ubuntu is used

Bug #1046269 reported by Matthew Paul Thomas
This bug affects 2 people
Affects Status Importance Assigned to Milestone

Bug Description

Currently, the graph for "Average errors per day" is per calendar day. This is almost certainly why the graph dips every weekend: apparently people who use Ubuntu during weekdays do so for longer periods than those who use it during weekends.

Measuring by calendar day has some subjective value: "Ubuntu crashed for me twice yesterday" is more memorable than "Ubuntu crashed for me about once every three hours yesterday".

However, there are big problems with measuring things this way. It means, for example, that if people used Ubuntu for shorter periods, the error rate would go down. Conversely, if the proportion of Ubuntu use in workplaces went up -- so people were using it for longer periods on average -- the errors/day rate would go up, even if nothing about the code changed.

To fix this, we should change "errors/day" to "errors/hour" or even "errors per 24 hours".

To do that, probably we will need to know -- for each error report submitted -- how long Ubuntu had been running when the error occurred. (This might be as simple as including the output of the "uptime" command in every error report.)

We may be able to extrapolate, from the distribution of times, to guess how long an Ubuntu session continues after the last error is submitted. If not, we might also need to know the total duration of the *previous* Ubuntu session, now that it's finished, so we can divide the number of errors that machine reported in *that* session by the length of the session.

description: updated
Revision history for this message
Alistair Buxton (a-j-buxton) wrote :

I've looked at that graph a few times and it has never made much sense to me, but this explanation makes it a bit clearer.

What effect would using uptime have for people who never reboot their system, like me?

Revision history for this message
Alistair Buxton (a-j-buxton) wrote :

Also, (I assume) you can't get data from machines that never crash, so fixing bugs can actually lead to the overall rate increasing when those machines stop reporting. A form of selection bias.

Revision history for this message
Matthew Paul Thomas (mpt) wrote :

A machine that never reboots would have been running all of the past 24 hours, so its error count over those 24 hours would be multiplied by 1. A machine that had been running for 6 of the past 24 hours would have its error count multiplied by 6/24 -- regardless of how many reboots were required to reach that total of 6 hours. So just knowing uptime since the most recent reboot wouldn't be enough; we need to know uptime over the past day.

That we can't get data from machines that never crash is indeed a difficulty. Currently we assume that "all machines that would report an error if they had one" is roughly equal to "all machines that have reported any errors in the past 90 days". So there would be a selection bias if the error rate ever declined anywhere close to 1/90, and/or if the error rate was strangely distributed amongst machines, such that a substantial proportion of reporting machines wouldn't have reported in the past 90 days. Fixing bug 1077122 should fix that. A bigger problem at the moment is a sudden increase in the number of reporting machines, which is bug 1069827.

Evan (ev)
Changed in errors:
importance: Undecided → Low
status: New → Confirmed
Changed in whoopsie:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers