Comment 4 for bug 1091067

Revision history for this message
Evan (ev) wrote :

Yes. Anytime you see "Retried 1 times", that's a counter increment that failed. By default, Cassandra does not retry counters, since it is not an idempotent operation.

We should resolve this by doing three things:

1. Never retry counters. The result is unpredictable and can lead to over-counting by a lot whenever the Cassandra nodes come under heavy load (like during a compaction).
2. Catch the exceptions and pass on them. We have pycassa wired to statsd, and it should be producing graphs for retries at http://graphite.engineering.canonical.com. If it's not, we need to implement that.
3. Anytime we care about the accuracy of a counter, it should be matched with a column family that uses timeuuids (like the oops identifiers) or something else unique in a wide row. This should be matched with a cron job to count the wide row and repair the counter. See https://bugs.launchpad.net/daisy/+bug/1152206 for more details on this.

Ebay covered this approach while back:

http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/