Daisy

Bug #1052954
Comment #2

Comment 2 for bug 1052954

Revision history for this message

Robert Collins (lifeless) wrote on 2012-09-26:

The http://www.cs.ucsb.edu/research/tech_reports/reports/2005-14.pdf paper is good - TPUT looks like it is trivially adaptable to our needs.

here is how I would tackle it:
- Implement TPUT as an efficient way of munging together the scores for a number of buckets.
- Pick some time range where you can feasible scan and process in-memory - e.g. 10K submissions, which at 1M/day would be 15minutes worth.
- Have a background task chunk things up into buckets half that size - e.g. 7 minutes worth, and when there are 100 (or some other number, determine by testing) buckets of that size, move to a larger bucket size, then eventually settle onto 24 hour buckets.
- each bucket has signature:count-in-bucket
- buckets are written to by one process, no race, and are sorted column families so you can do range queries within the bucket.

Generate top-k over an arbitrary range of buckets by applying TPUT to aggregate values from the buckets without paging in all signatures from all buckets: you'll process the 'now' data by hand.