here is how I would tackle it:
- Implement TPUT as an efficient way of munging together the scores for a number of buckets.
- Pick some time range where you can feasible scan and process in-memory - e.g. 10K submissions, which at 1M/day would be 15minutes worth.
- Have a background task chunk things up into buckets half that size - e.g. 7 minutes worth, and when there are 100 (or some other number, determine by testing) buckets of that size, move to a larger bucket size, then eventually settle onto 24 hour buckets.
- each bucket has signature:count-in-bucket
- buckets are written to by one process, no race, and are sorted column families so you can do range queries within the bucket.
Generate top-k over an arbitrary range of buckets by applying TPUT to aggregate values from the buckets without paging in all signatures from all buckets: you'll process the 'now' data by hand.
The http:// www.cs. ucsb.edu/ research/ tech_reports/ reports/ 2005-14. pdf paper is good - TPUT looks like it is trivially adaptable to our needs.
here is how I would tackle it: count-in- bucket
- Implement TPUT as an efficient way of munging together the scores for a number of buckets.
- Pick some time range where you can feasible scan and process in-memory - e.g. 10K submissions, which at 1M/day would be 15minutes worth.
- Have a background task chunk things up into buckets half that size - e.g. 7 minutes worth, and when there are 100 (or some other number, determine by testing) buckets of that size, move to a larger bucket size, then eventually settle onto 24 hour buckets.
- each bucket has signature:
- buckets are written to by one process, no race, and are sorted column families so you can do range queries within the bucket.
Generate top-k over an arbitrary range of buckets by applying TPUT to aggregate values from the buckets without paging in all signatures from all buckets: you'll process the 'now' data by hand.