My initial thoughts follow: A system with the identifier a378 reports an error for Ubuntu 12.04. We attempt to find the first time this system reported an error for Ubuntu 12.04 by looking it up in the FirstError Column Family (new). If it cannot be found, we know that this error is the first one for this system in Ubuntu 12.04. firsterror = pycassa.ColumnFamily(pool, 'FirstError') try: first_error_date = firsterror.get('Ubuntu 12.04', columns=['a378']) first_error_date = first_error_date.values()[0] except NotFoundException: first_error_date = datetime.datetime.today() firsterror.insert('Ubuntu 12.04', {'a378': first_error_date}) FirstError +--------------+----------+ | Ubuntu 12.04 | a378 | | +----------+ | | 20130310 | +--------------+----------+ We then add a new column to the ErrorsByRelease column family (new) in the row for the composite of the Ubuntu release and current date. The name of this column is a version 1 UUID (TimeUUID), which makes it unique to the row. The value of the column is the date of the first time we saw an error from this system for this Ubuntu release. The combination of this column name and value signify an error for Ubuntu 12.04 for the current calendar day: today = datetime.datetime.today() errorsbyrelease = pycassa.ColumnFamily(pool, 'ErrorsByRelease') errorsbyrelease.insert(('Ubuntu 12.04, today), {uuid.uuid1(): first_error_date}) ErrorsByRelease +-----------------------------------------------------------------+ | (Ubuntu 12.04, 20130312) | 09e59900-88ec-11e2-967a-080027b22898 | | +--------------------------------------+ | | 20130310 | +--------------------------+--------------------------------------+ (Since the column names are TimeUUIDs, we don't technically have to include the date in the row key. We can just get a column slice of the days that we're looking for. However, 2 billion errors (the column limit) could potentially occur within a single release, so we use the date to shard the data and prevent running out of space in the row.) Now, when we want to calculate the average errors per calendar day for Ubuntu 12.04, we take a range of dates (say, 90) leading up to yesterday and iterate over them. For each date, we get each column for the row that matches the composite of Ubuntu 12.04 and that date. Each one of these columns represents an error for that date. With each one of them, we take the value, which is the first date that the system which reported this error reported its first error for Ubuntu 12.04. We take the difference in days between yesterday and the first error date and divide that by 90. If it's greater than 90, we use 1 for the output value. We add this value into a running total, which will be the weighted number of errors for that calendar day, to be divided by the number of unique systems over a 90 day period for that calendar day: total = 0 one_day = datetime.timedelta(days=1) yesterday = datetime.datetime.today() - one_day working_date = yesterday - datetime.timedelta(days=90) while working_date <= yesterday: for error in errorsbyrelease.xget(('Ubuntu 12.04', working_date)): first_error_date = error[1] day_difference = (yesterday - first_error_date).days total += min(day_difference, 90) / 90 working_date += one_day We can back-populate data into these two new column families by iterating over the OOPS column family, taking the SystemIdentifier and DistroRelease columns with timestamp set to True: for key, data in oops.get_range(columns=['SystemIdentifier', 'DistroRelease'], include_timestamp=True): try: # We cannot use the SystemIdentifier timestamp, as that column may have # been added long after the original report was written. release, timestamp = data['DistroRelease'] identifier = data['SystemIdentifier'][0] except KeyError: continue timestamp = datetime.datetime.fromtimestamp(timestamp / 10000000) try: first_error_date = firsterror.get(release, columns=[identifier]).values()[0] except NotFoundException: first_error_date = None if not first_error_date or timestamp < first_error_date: firsterror.insert(release, {identifier: timestamp}) This will be slow as we need to look up a column from FirstError with each row in OOPS, and random reads are not particularly quick in Cassandra. We can mitigate this somewhat by maintaining a cache of identifier to first error dates. We can limit the size of this cache by only writing to it if len(cache) < n.