Comment 2 for bug 1052954

Revision history for this message
Robert Collins (lifeless) wrote :

The http://www.cs.ucsb.edu/research/tech_reports/reports/2005-14.pdf paper is good - TPUT looks like it is trivially adaptable to our needs.

here is how I would tackle it:
 - Implement TPUT as an efficient way of munging together the scores for a number of buckets.
 - Pick some time range where you can feasible scan and process in-memory - e.g. 10K submissions, which at 1M/day would be 15minutes worth.
 - Have a background task chunk things up into buckets half that size - e.g. 7 minutes worth, and when there are 100 (or some other number, determine by testing) buckets of that size, move to a larger bucket size, then eventually settle onto 24 hour buckets.
 - each bucket has signature:count-in-bucket
 - buckets are written to by one process, no race, and are sorted column families so you can do range queries within the bucket.

Generate top-k over an arbitrary range of buckets by applying TPUT to aggregate values from the buckets without paging in all signatures from all buckets: you'll process the 'now' data by hand.