Daisy

pycassa.pool.MaximumRetryException: Retried 1 times. Last failure was TTransportException: TSocket read 0 bytes

Bug #1091067 reported by Haw Loeung on 2012-12-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Daisy	Confirmed	High	Unassigned

Bug Description

Hi,

So another cause of retracers existing seems to be related to stability issues with one of the cassandra nodes (jumbee). Whenever it stops accepting connections, this happens:

2012-12-17 02:48:09,683:INFO:root:Decompressing to /srv/daisy.ubuntu.com/cores/dbdeb754-46dc-11e2-a584-e4115b0f8a4a.core
2012-12-17 02:48:10,794:INFO:root:Writing back to Cassandra
2012-12-17 02:48:10,806:INFO:pycassa.pool:Connection 47028496 (91.189.89.250:9160) in pool 47028112 failed: TSocket read 0 bytes
Traceback (most recent call last):
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 517, in <module>
    main()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 514, in main
    retracer.listen()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 146, in listen
    self.run_forever(channel, self.callback, queue=retrace)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 159, in run_forever
    channel.wait()
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 117, in dispatch_method
    return amqp_method(self, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 2060, in _basic_deliver
    func(msg)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 372, in callback
    self.update_retrace_stats(release, day_key, retracing_time, True)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 193, in update_retrace_stats
    self.retrace_stats_fam.add(day_key, release + status)
  File "/usr/lib/pymodules/python2.7/pycassa/columnfamily.py", line 1025, in add
    allow_retries=self._allow_retries)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 577, in execute
    return getattr(conn, f)(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 146, in new_f
    (self._retry_count, exc.__class__.__name__, exc))
pycassa.pool.MaximumRetryException: Retried 1 times. Last failure was TTransportException: TSocket read 0 bytes

Any chance you could take a look into this?

Thanks,

Haw

Tags:

Revision history for this message

Haw Loeung (hloeung) wrote on 2012-12-17:

From what I understand, it will only retry (max_retries) if it fails due to an TimedOutException or UnavailableException. In this case, it's failing due to TTransportException.

tags:

added: canonical-webops-eng

Revision history for this message

Haw Loeung (hloeung) wrote on 2013-01-18:

We're also seeing this on the frontends. e.g. https://errors.ubuntu.com/oops-local/2013-01-18/03064.daisy.ubuntu.com7 :

Oops-Id: OOPS-2575daisy.ubuntu.com7
Exception-Type: MaximumRetryException
Exception-Value: Retried 1 times. Last failure was TTransportException: TSocket read 0 bytes
Date: 2013-01-18T00:51:04.569353+00:00
URL: http://daisy.ubuntu.com/541702cb678126520941d0f689f03060293e2d3c31bcd87f5bc2a4366c10967f61687c5bba302c026cb8973a835034258c30f27de59b758377743578a7d430c3

Any chance you could take a look?

Thanks,

Haw

Brian Murray (brian-murray) on 2013-03-05

Changed in daisy:
importance:	Undecided → High

Revision history for this message

Brian Murray (brian-murray) wrote on 2013-03-05:

Looking at the errors from 2013-02-28 there are 163 occurrences of this TTransportException. It seems to happen when trying to update the counters column family for either the release: source_package counter or the MissingSAS counter.

This also happened before the branch adding MissingSAS and the release:source_package counters were added to daisy, if we look earlier in 2013-02 we can see it happening with KernelCrash.

2013-02-01/69917.daisy.ubuntu.com9: counters_fam.add('KernelCrash', day_key)
2013-02-01/73433.daisy.ubuntu.com11: counters_fam.add('KernelCrash', day_key)
2013-02-05/85220.daisy.ubuntu.com25: counters_fam.add('KernelCrash', day_key)
2013-02-05/52849.daisy.ubuntu.com38: counters_fam.add('KernelCrash', day_key)
...

I'd imagine it happens less frequently for kernel crashes since there are less of those.

Haw Loeung (hloeung) on 2013-03-05

Changed in daisy:
status:	New → Confirmed

Revision history for this message

Evan (ev) wrote on 2013-03-29:

Yes. Anytime you see "Retried 1 times", that's a counter increment that failed. By default, Cassandra does not retry counters, since it is not an idempotent operation.

We should resolve this by doing three things:

1. Never retry counters. The result is unpredictable and can lead to over-counting by a lot whenever the Cassandra nodes come under heavy load (like during a compaction).
2. Catch the exceptions and pass on them. We have pycassa wired to statsd, and it should be producing graphs for retries at http://graphite.engineering.canonical.com. If it's not, we need to implement that.
3. Anytime we care about the accuracy of a counter, it should be matched with a column family that uses timeuuids (like the oops identifiers) or something else unique in a wide row. This should be matched with a cron job to count the wide row and repair the counter. See https://bugs.launchpad.net/daisy/+bug/1152206 for more details on this.

Ebay covered this approach while back:

http://www.ebaytechblog.com/2012/08/14/cassandra-data-modeling-best-practices-part-2/

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.