Machine weighted at 100% 89 days after last report, 0% 90 days after
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Daisy |
Confirmed
|
High
|
Unassigned | ||
Errors |
Confirmed
|
High
|
Unassigned |
Bug Description
When calculating the daily error rate, we count a machine as 100% of a machine for 89 days after its most recent report. Then 90 days after, we suddenly count it as 0% of a machine. That's ... a little bit coarse.
We might have enough data now that we can calculate the probability P, t days after the machine's last report, that it is still active to ever report an error (on any Ubuntu version) in the future. I don't know what this curve will look like, but I predict it will look like exponential decay:
P = e ^ (–λ t)
Regardless of whether it is exponential decay, we should weight each machine by the historical probability that it is still active. We should drop its weight to zero for an Ubuntu version only if we get a report from the same machine running a different Ubuntu version. (For performance reasons, we may also want to drop a machine's weight to zero once it falls below 0.01 or so.)
Each time we get another report from a machine, we may also want to back-weight it to 1.0 for all the days since its previous report. But that would be expensive, and I'm not sure it's actually correct. If it was never turned on for most of those days, we may have been right to down-weight it.
Changed in daisy: | |
importance: | Undecided → Medium |
status: | New → Confirmed |
Changed in errors: | |
importance: | Undecided → High |
status: | New → Confirmed |
Changed in daisy: | |
importance: | Medium → High |
My initial thoughts follow:
A system with the identifier a378 reports an error for Ubuntu 12.04. We attempt
to find the first time this system reported an error for Ubuntu 12.04 by
looking it up in the FirstError Column Family (new). If it cannot be found, we
know that this error is the first one for this system in Ubuntu 12.04.
firsterror = pycassa. ColumnFamily( pool, 'FirstError')
try: error_date = firsterror. get('Ubuntu 12.04', columns=['a378']) error_date = first_error_ date.values( )[0] error_date = datetime. datetime. today() insert( 'Ubuntu 12.04', {'a378': first_error_date})
first_
first_
except NotFoundException:
first_
firsterror.
FirstError ------- -+----- -----+ ------- -+----- -----+
+------
| Ubuntu 12.04 | a378 |
| +----------+
| | 20130310 |
+------
We then add a new column to the ErrorsByRelease column family (new) in the row for
the composite of the Ubuntu release and current date. The name of this column
is a version 1 UUID (TimeUUID), which makes it unique to the row. The value of
the column is the date of the first time we saw an error from this system for
this Ubuntu release. The combination of this column name and value signify an
error for Ubuntu 12.04 for the current calendar day:
today = datetime. datetime. today() ColumnFamily( pool, 'ErrorsByRelease') .insert( ('Ubuntu 12.04, today), {uuid.uuid1(): first_error_date})
errorsbyrelease = pycassa.
errorsbyrelease
ErrorsByRelease ------- ------- ------- ------- ------- ------- ------- ------- ---+ 88ec-11e2- 967a-080027b228 98 | ------- ------- ------- ------- ----+ ------- ------- ------+ ------- ------- ------- ------- ------- ---+
+------
| (Ubuntu 12.04, 20130312) | 09e59900-
| +------
| | 20130310 |
+------
(Since the column names are TimeUUIDs, we don't technically have to include the
date in the row key. We can just get a column slice of the days that we're
looking for. However, 2 billion errors (the column limit) could potentially
occur within a single release, so we use the date to shard the data and prevent
running out of space in the row.)
Now, when we want to calculate the average errors per calendar day for Ubuntu
12.04, we take a range of dates (say, 90) leading up to yesterday and iterate
over them. For each date, we get each column for the row that matches the
composite of Ubuntu 12.04 and that date. Each one of these columns represents
an error for that date. With each one of them, we take the value, which is the
first date that the system which reported this error reported its first error
for Ubuntu 12.04. We take the difference in days between yesterday and the
first error date and divide that by 90. If it's greater than 90, we use 1 for
the output value. We add this value into a running total, which will be the
weighted number of errors for that calendar day, to be divided by the number of
unique systems over a 90 day period for that calendar day:
total = 0 timedelta( days=1) datetime. today() - one_day timedelta( days=90)
one_day = datetime.
yesterday = datetime.
working_date = yesterday - datetime.
while working_date <= yesterday: .xget(. ..
for error in errorsbyrelease