pycassa.pool.MaximumRetryException: Retried 1 times. Last failure was TTransportException: TSocket read 0 bytes

Bug #1091067 reported by Haw Loeung
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Daisy
Confirmed
High
Unassigned

Bug Description

Hi,

So another cause of retracers existing seems to be related to stability issues with one of the cassandra nodes (jumbee). Whenever it stops accepting connections, this happens:

2012-12-17 02:48:09,683:INFO:root:Decompressing to /srv/daisy.ubuntu.com/cores/dbdeb754-46dc-11e2-a584-e4115b0f8a4a.core
2012-12-17 02:48:10,794:INFO:root:Writing back to Cassandra
2012-12-17 02:48:10,806:INFO:pycassa.pool:Connection 47028496 (91.189.89.250:9160) in pool 47028112 failed: TSocket read 0 bytes
Traceback (most recent call last):
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 517, in <module>
    main()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 514, in main
    retracer.listen()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 146, in listen
    self.run_forever(channel, self.callback, queue=retrace)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 159, in run_forever
    channel.wait()
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 117, in dispatch_method
    return amqp_method(self, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 2060, in _basic_deliver
    func(msg)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 372, in callback
    self.update_retrace_stats(release, day_key, retracing_time, True)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 193, in update_retrace_stats
    self.retrace_stats_fam.add(day_key, release + status)
  File "/usr/lib/pymodules/python2.7/pycassa/columnfamily.py", line 1025, in add
    allow_retries=self._allow_retries)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 577, in execute
    return getattr(conn, f)(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 146, in new_f
    (self._retry_count, exc.__class__.__name__, exc))
pycassa.pool.MaximumRetryException: Retried 1 times. Last failure was TTransportException: TSocket read 0 bytes

Any chance you could take a look into this?

Thanks,

Haw

Revision history for this message
Haw Loeung (hloeung) wrote :

From what I understand, it will only retry (max_retries) if it fails due to an TimedOutException or UnavailableException. In this case, it's failing due to TTransportException.

tags: added: canonical-webops-eng
Revision history for this message
Haw Loeung (hloeung) wrote :

We're also seeing this on the frontends. e.g. https://errors.ubuntu.com/oops-local/2013-01-18/03064.daisy.ubuntu.com7 :

Oops-Id: OOPS-2575daisy.ubuntu.com7
Exception-Type: MaximumRetryException
Exception-Value: Retried 1 times. Last failure was TTransportException: TSocket read 0 bytes
Date: 2013-01-18T00:51:04.569353+00:00
URL: http://daisy.ubuntu.com/541702cb678126520941d0f689f03060293e2d3c31bcd87f5bc2a4366c10967f61687c5bba302c026cb8973a835034258c30f27de59b758377743578a7d430c3

Any chance you could take a look?

Thanks,

Haw

Changed in daisy:
importance: Undecided → High
Revision history for this message
Brian Murray (brian-murray) wrote :

Looking at the errors from 2013-02-28 there are 163 occurrences of this TTransportException. It seems to happen when trying to update the counters column family for either the release: source_package counter or the MissingSAS counter.

This also happened before the branch adding MissingSAS and the release:source_package counters were added to daisy, if we look earlier in 2013-02 we can see it happening with KernelCrash.

2013-02-01/69917.daisy.ubuntu.com9: counters_fam.add('KernelCrash', day_key)
2013-02-01/73433.daisy.ubuntu.com11: counters_fam.add('KernelCrash', day_key)
2013-02-05/85220.daisy.ubuntu.com25: counters_fam.add('KernelCrash', day_key)
2013-02-05/52849.daisy.ubuntu.com38: counters_fam.add('KernelCrash', day_key)
...

I'd imagine it happens less frequently for kernel crashes since there are less of those.

Haw Loeung (hloeung)
Changed in daisy:
status: New → Confirmed
Revision history for this message
Evan (ev) wrote :

Yes. Anytime you see "Retried 1 times", that's a counter increment that failed. By default, Cassandra does not retry counters, since it is not an idempotent operation.

We should resolve this by doing three things:

1. Never retry counters. The result is unpredictable and can lead to over-counting by a lot whenever the Cassandra nodes come under heavy load (like during a compaction).
2. Catch the exceptions and pass on them. We have pycassa wired to statsd, and it should be producing graphs for retries at http://graphite.engineering.canonical.com. If it's not, we need to implement that.
3. Anytime we care about the accuracy of a counter, it should be matched with a column family that uses timeuuids (like the oops identifiers) or something else unique in a wide row. This should be matched with a cron job to count the wide row and repair the counter. See https://bugs.launchpad.net/daisy/+bug/1152206 for more details on this.

Ebay covered this approach while back:

http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.