Ubuntu error tracker backend (daisy)

i386 retracers crashed

Reported by Haw Loeung on 2012-08-15
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Daisy
Undecided
Unassigned

Bug Description

Hi,

We've just had two i386 retracers crashed. supervisor error logs showed the following:

i386-0:

Traceback (most recent call last):
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 364, in <module>
    main()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 361, in main
    retracer.listen()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 97, in listen
    self.run_forever(channel, self.callback, queue=retrace)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 110, in run_forever
    channel.wait()
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 117, in dispatch_method
    return amqp_method(self, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 2060, in _basic_deliver
    func(msg)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 296, in callback
    self.update_retrace_stats(release, day_key, retracing_time, False)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 140, in update_retrace_stats
    self.retrace_stats_fam.add(day_key, release + status)
  File "/usr/lib/pymodules/python2.7/pycassa/columnfamily.py", line 945, in add
    allow_retries=self._allow_retries)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 544, in execute
    return getattr(conn, f)(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 127, in new_f
    self._pool._notify_on_failure(exc, server=self.server, connection=self)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 694, in _notify_on_failure
    l.connection_failed(dic)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy-rev-114/metrics.py", line 28, in connection_failed
    get_metrics().increment(name)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy-rev-114/metrics.py", line 19, in get_metrics
    connection = UdpStatsDClient(host=configuration.statsd_host,
AttributeError: 'module' object has no attribute 'statsd_host'

i386-2:

WARNING: /usr/lib/python2.7/dist-packages/openravepy/_openravepy_0_6/convexdecompositionpy.so is needed, but cannot be mapped to a package
WARNING: /usr/lib/python2.7/dist-packages/openravepy/_openravepy_0_6/openravepy_int.so is needed, but cannot be mapped to a package
WARNING: Cannot find package which ships ExecutablePath
ERROR: ExecutablePath /srv/daisy.ubuntu.com/var/Ubuntu 12.04/cache-hBpjM4/sandbox/usr/bin/openrave0.6.py does not exist (report specified package openrave0.6-dp-python 0.6.6.6-ubuntu1~precise1 [origin: LP-PPA-openrave-release])
WARNING: package skype does not exist, ignoring
WARNING: /usr/lib/i386-linux-gnu/libORBitCosNaming-2.so.0.1.0 is needed, but cannot be mapped to a package
WARNING: /usr/lib/i386-linux-gnu/libORBit-2.so.0.1.0 is needed, but cannot be mapped to a package
WARNING: Cannot find package which ships ExecutablePath
ERROR: ExecutablePath /srv/daisy.ubuntu.com/var/Ubuntu 12.04/cache-hBpjM4/sandbox/usr/bin/skype does not exist (report specified package skype 4.0.0.8-1)
Traceback (most recent call last):
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 364, in <module>
    main()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 361, in main
    retracer.listen()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 97, in listen
    self.run_forever(channel, self.callback, queue=retrace)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 110, in run_forever
    channel.wait()
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 117, in dispatch_method
    return amqp_method(self, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 2060, in _basic_deliver
    func(msg)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 323, in callback
    utils.bucket(self.oops_config, oops_id, crash_signature, vals)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy-rev-114/utils.py", line 44, in bucket
    oopses.bucket(oops_config, oops_id, crash_signature, fields)
  File "/usr/lib/pymodules/python2.7/oopsrepository/oopses.py", line 155, in bucket
    daybucketcount_cf.add(':'.join((field, resolution)), bucketid)
  File "/usr/lib/pymodules/python2.7/pycassa/columnfamily.py", line 945, in add
    allow_retries=self._allow_retries)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 544, in execute
    return getattr(conn, f)(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 137, in new_f
    (self._retry_count, exc.__class__.__name__, exc))
pycassa.pool.MaximumRetryException: Retried 1 times. Last failure was timeout: timed out

Any chance of someone taking a look into this?

Thanks,

Haw

Haw Loeung (hloeung) on 2012-08-15
tags: added: canonical-webops-eng
Haw Loeung (hloeung) wrote :

Still seeing this.

hloeung@finfolk:/srv/daisy.ubuntu.com/production-logs/supervisor-childlog$ sudo -u whoopsie supervisorctl -c /srv/daisy.ubuntu.com/production/local_config/supervisor-${HOSTNAME}.conf status
retracer-amd64:retracer-amd64-0 RUNNING pid 2987, uptime 3:53:27
retracer-amd64:retracer-amd64-1 EXITED Aug 24 02:56 AM
retracer-amd64:retracer-amd64-2 RUNNING pid 23200, uptime 1:58:49
retracer-i386:retracer-i386-0 RUNNING pid 10749, uptime 9:17:17
retracer-i386:retracer-i386-1 RUNNING pid 10748, uptime 9:17:17
retracer-i386:retracer-i386-2 RUNNING pid 10747, uptime 9:17:17

hloeung@finfolk:/srv/daisy.ubuntu.com/production-logs/supervisor-childlog$ sudo -u whoopsie tail -n 35 retracer-amd64-1-stderr---supervisor-finfolk-MLpmvu.logWARNING: /lib/x86_64-linux-gnu/libglib-2.0.so.0.3306.0 is needed, but cannot be mapped to a package
Traceback (most recent call last):
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 364, in <module>
    main()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 361, in main
    retracer.listen()
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 97, in listen
    self.run_forever(channel, self.callback, queue=retrace)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 110, in run_forever
    channel.wait()
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 97, in wait
    return self.dispatch_method(method_sig, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/abstract_channel.py", line 117, in dispatch_method
    return amqp_method(self, args, content)
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/channel.py", line 2060, in _basic_deliver
    func(msg)
  File "/srv/daisy.ubuntu.com/production/whoopsie-daisy/process_core.py", line 276, in callback
    self.stack_fam.insert(stacktrace_addr_sig, report)
  File "/usr/lib/pymodules/python2.7/pycassa/columnfamily.py", line 897, in insert
    allow_retries=self._allow_retries)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 544, in execute
    return getattr(conn, f)(*args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 142, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 142, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 142, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 142, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 142, in new_f
    return new_f(self, *args, **kwargs)
  File "/usr/lib/pymodules/python2.7/pycassa/pool.py", line 137, in new_f
    (self._retry_count, exc.__class__.__name__, exc))
pycassa.pool.MaximumRetryException: Retried 6 times. Last failure was error: [Errno 104] Connection reset by peer

JuanJo Ciarlante (jjo) wrote :

FTR happened again (amd64)

On Fri, Aug 24, 2012 at 12:33 PM, JuanJo Ciarlante
<email address hidden> wrote:
> FTR happened again (amd64)

I've been looking into this most of the day and I am so far at a loss
for what could be causing it, other than the retracers are
successfully running against 12.04 as of the 15th (they were broken
between 6/18 and 8/15).

We're writing and reading everywhere at ConsistencyLevel one. So it
seems really odd that we're getting timeouts when talking to three
separate nodes. We're not running any queries to back populate, and we
haven't introduced any changes that would dramatically increase the
hit to Cassandra (like counting columns, as was the case the last time
the retracers started timing out).

Can someone on webops tell me what
/srv/daisy.ubuntu.com/production/local_config/local_config.py on
gremlin or cherufe has for cassandra_host? Are we using round robin,
or are the Apache frontends communicating with the cassandra ring
solely through a single node?

As mentioned on IRC, I'm definitely seeing the kind of weird behaviour
we had back before we stopped doing large column counts as part of the
front page of errors.ubuntu.com. That is, my ssh tunnel to jumbee
keeps dying or dropping requests until I restart it.

Haw Loeung (hloeung) wrote :

Hi Evan,

First of all, thank you for spending some time to look into this.

The local_config.py on both gremlin and cherufe has the cassandra_host currently set to 127.0.0.1. This is then balanced across the 3 cassandra backends.

I notice that on finfolk, local_config.py is actually pointing to just jumbee:

# The address of the Cassandra database.
cassandra_host = '91.189.89.250:9160'

hloeung@jumbee:~$ netstat -an --tcp | grep :9160 | wc -l
611

hloeung@nawao:~$ netstat -an --tcp | grep :9160 | wc -l
146

hloeung@tomte:~$ netstat -an --tcp | grep :9160 | wc -l
146

Not very balanced is it. Let me fix it up.

Haw Loeung (hloeung) wrote :

Hmm, it turns out you can't specify multiple cassandra hosts to try?

For example:

cassandra_host = '91.189.89.250:9160 91.189.89.249:9160'

Evan Dandrea (ev) wrote :

We now specify multiple hosts for the retracers to talk to and Tom has confirmed that we're seeing much more balanced traffic:

https://pastebin.canonical.com/73229/

However, we're still seeing problems. I'm begging to wonder if we're trying to insert too much data into a single row at once, hitting limits in Thrift. It would be extremely helpful to see a number of recent exceptions. If they're all saying "connection reset by peer" triggered from stack_fam.insert, then I'm pretty sure we need to split apart these large inserts.

Unfortunately I cannot check this myself as the log syncing to snakefruit does not appear to be working.

Evan Dandrea (ev) wrote :

This is long since resolved.

Changed in daisy:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers