Comment 1 for bug 1077122

Revision history for this message
Evan (ev) wrote :

My initial thoughts follow:

A system with the identifier a378 reports an error for Ubuntu 12.04. We attempt
to find the first time this system reported an error for Ubuntu 12.04 by
looking it up in the FirstError Column Family (new). If it cannot be found, we
know that this error is the first one for this system in Ubuntu 12.04.

firsterror = pycassa.ColumnFamily(pool, 'FirstError')

try:
    first_error_date = firsterror.get('Ubuntu 12.04', columns=['a378'])
    first_error_date = first_error_date.values()[0]
except NotFoundException:
    first_error_date = datetime.datetime.today()
    firsterror.insert('Ubuntu 12.04', {'a378': first_error_date})

FirstError
+--------------+----------+
| Ubuntu 12.04 | a378 |
| +----------+
| | 20130310 |
+--------------+----------+

We then add a new column to the ErrorsByRelease column family (new) in the row for
the composite of the Ubuntu release and current date. The name of this column
is a version 1 UUID (TimeUUID), which makes it unique to the row. The value of
the column is the date of the first time we saw an error from this system for
this Ubuntu release. The combination of this column name and value signify an
error for Ubuntu 12.04 for the current calendar day:

today = datetime.datetime.today()
errorsbyrelease = pycassa.ColumnFamily(pool, 'ErrorsByRelease')
errorsbyrelease.insert(('Ubuntu 12.04, today), {uuid.uuid1(): first_error_date})

ErrorsByRelease
+-----------------------------------------------------------------+
| (Ubuntu 12.04, 20130312) | 09e59900-88ec-11e2-967a-080027b22898 |
| +--------------------------------------+
| | 20130310 |
+--------------------------+--------------------------------------+

(Since the column names are TimeUUIDs, we don't technically have to include the
date in the row key. We can just get a column slice of the days that we're
looking for. However, 2 billion errors (the column limit) could potentially
occur within a single release, so we use the date to shard the data and prevent
running out of space in the row.)

Now, when we want to calculate the average errors per calendar day for Ubuntu
12.04, we take a range of dates (say, 90) leading up to yesterday and iterate
over them. For each date, we get each column for the row that matches the
composite of Ubuntu 12.04 and that date. Each one of these columns represents
an error for that date. With each one of them, we take the value, which is the
first date that the system which reported this error reported its first error
for Ubuntu 12.04. We take the difference in days between yesterday and the
first error date and divide that by 90. If it's greater than 90, we use 1 for
the output value. We add this value into a running total, which will be the
weighted number of errors for that calendar day, to be divided by the number of
unique systems over a 90 day period for that calendar day:

total = 0
one_day = datetime.timedelta(days=1)
yesterday = datetime.datetime.today() - one_day
working_date = yesterday - datetime.timedelta(days=90)

while working_date <= yesterday:
    for error in errorsbyrelease.xget(('Ubuntu 12.04', working_date)):
        first_error_date = error[1]
        day_difference = (yesterday - first_error_date).days
        total += min(day_difference, 90) / 90
    working_date += one_day

We can back-populate data into these two new column families by iterating over
the OOPS column family, taking the SystemIdentifier and DistroRelease columns
with timestamp set to True:

for key, data in oops.get_range(columns=['SystemIdentifier', 'DistroRelease'], include_timestamp=True):
    try:
        # We cannot use the SystemIdentifier timestamp, as that column may have
        # been added long after the original report was written.
        release, timestamp = data['DistroRelease']
        identifier = data['SystemIdentifier'][0]
    except KeyError:
        continue

    timestamp = datetime.datetime.fromtimestamp(timestamp / 10000000)
    try:
        first_error_date = firsterror.get(release, columns=[identifier]).values()[0]
    except NotFoundException:
        first_error_date = None
    if not first_error_date or timestamp < first_error_date:
        firsterror.insert(release, {identifier: timestamp})

This will be slow as we need to look up a column from FirstError with each row
in OOPS, and random reads are not particularly quick in Cassandra. We can
mitigate this somewhat by maintaining a cache of identifier to first error
dates. We can limit the size of this cache by only writing to it if
len(cache) < n.