Machine weighted at 100% 89 days after last report, 0% 90 days after

Bug #1077122 reported by Matthew Paul Thomas on 2012-11-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Daisy
High
Unassigned
Errors
High
Unassigned

Bug Description

When calculating the daily error rate, we count a machine as 100% of a machine for 89 days after its most recent report. Then 90 days after, we suddenly count it as 0% of a machine. That's ... a little bit coarse.

We might have enough data now that we can calculate the probability P, t days after the machine's last report, that it is still active to ever report an error (on any Ubuntu version) in the future. I don't know what this curve will look like, but I predict it will look like exponential decay:

    P = e ^ (–λ t)

Regardless of whether it is exponential decay, we should weight each machine by the historical probability that it is still active. We should drop its weight to zero for an Ubuntu version only if we get a report from the same machine running a different Ubuntu version. (For performance reasons, we may also want to drop a machine's weight to zero once it falls below 0.01 or so.)

Each time we get another report from a machine, we may also want to back-weight it to 1.0 for all the days since its previous report. But that would be expensive, and I'm not sure it's actually correct. If it was never turned on for most of those days, we may have been right to down-weight it.

Evan (ev) on 2012-11-28
Changed in daisy:
importance: Undecided → Medium
status: New → Confirmed
Changed in errors:
importance: Undecided → High
status: New → Confirmed
Changed in daisy:
importance: Medium → High
Evan (ev) wrote :
Download full text (4.5 KiB)

My initial thoughts follow:

A system with the identifier a378 reports an error for Ubuntu 12.04. We attempt
to find the first time this system reported an error for Ubuntu 12.04 by
looking it up in the FirstError Column Family (new). If it cannot be found, we
know that this error is the first one for this system in Ubuntu 12.04.

firsterror = pycassa.ColumnFamily(pool, 'FirstError')

try:
    first_error_date = firsterror.get('Ubuntu 12.04', columns=['a378'])
    first_error_date = first_error_date.values()[0]
except NotFoundException:
    first_error_date = datetime.datetime.today()
    firsterror.insert('Ubuntu 12.04', {'a378': first_error_date})

FirstError
+--------------+----------+
| Ubuntu 12.04 | a378 |
| +----------+
| | 20130310 |
+--------------+----------+

We then add a new column to the ErrorsByRelease column family (new) in the row for
the composite of the Ubuntu release and current date. The name of this column
is a version 1 UUID (TimeUUID), which makes it unique to the row. The value of
the column is the date of the first time we saw an error from this system for
this Ubuntu release. The combination of this column name and value signify an
error for Ubuntu 12.04 for the current calendar day:

today = datetime.datetime.today()
errorsbyrelease = pycassa.ColumnFamily(pool, 'ErrorsByRelease')
errorsbyrelease.insert(('Ubuntu 12.04, today), {uuid.uuid1(): first_error_date})

ErrorsByRelease
+-----------------------------------------------------------------+
| (Ubuntu 12.04, 20130312) | 09e59900-88ec-11e2-967a-080027b22898 |
| +--------------------------------------+
| | 20130310 |
+--------------------------+--------------------------------------+

(Since the column names are TimeUUIDs, we don't technically have to include the
date in the row key. We can just get a column slice of the days that we're
looking for. However, 2 billion errors (the column limit) could potentially
occur within a single release, so we use the date to shard the data and prevent
running out of space in the row.)

Now, when we want to calculate the average errors per calendar day for Ubuntu
12.04, we take a range of dates (say, 90) leading up to yesterday and iterate
over them. For each date, we get each column for the row that matches the
composite of Ubuntu 12.04 and that date. Each one of these columns represents
an error for that date. With each one of them, we take the value, which is the
first date that the system which reported this error reported its first error
for Ubuntu 12.04. We take the difference in days between yesterday and the
first error date and divide that by 90. If it's greater than 90, we use 1 for
the output value. We add this value into a running total, which will be the
weighted number of errors for that calendar day, to be divided by the number of
unique systems over a 90 day period for that calendar day:

total = 0
one_day = datetime.timedelta(days=1)
yesterday = datetime.datetime.today() - one_day
working_date = yesterday - datetime.timedelta(days=90)

while working_date <= yesterday:
    for error in errorsbyrelease.xget(...

Read more...

Evan (ev) wrote :

Now that FirstError and ErrorsByRelease are populated and continuously being written to, I've made a first pass at using the data:

http://jsfiddle.net/Ra4xT/1/

Matthew Paul Thomas (mpt) wrote :

This bug has nothing to do with when a machine reported its first error, or what series it reports errors from, so FirstError and ErrorsByRelease wouldn't help. Perhaps you have this confused with bug 1069827? :-)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers