Non-responsive zookeeper leads to spinning agent and traceback

Bug #1073083 reported by David Ames
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
pyjuju
In Progress
Wishlist
Kapil Thangavelu
txzookeeper
Fix Committed
Wishlist
Kapil Thangavelu

Bug Description

When the zookeeper is non responsive the client agent makes things worse by not backing off its requests.

Traceback:

2012-10-30 08:33:49,921: twisted@ERROR: Traceback (most recent call last):
2012-10-30 08:33:49,922: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/txzookeeper/managed.py", line 319, in _cb_created
2012-10-30 08:33:49,922: twisted@ERROR: if self._check_result(result_code, d, path=path):
2012-10-30 08:33:49,922: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/txzookeeper/client.py", line 221, in _check_result
2012-10-30 08:33:49,922: twisted@ERROR: self, error)
2012-10-30 08:33:49,922: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 134, in maybeDeferred
2012-10-30 08:33:49,922: twisted@ERROR: result = f(*args, **kw)
2012-10-30 08:33:49,922: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1187, in unwindGenerator
2012-10-30 08:33:49,922: twisted@ERROR: return _inlineCallbacks(None, gen, Deferred())
2012-10-30 08:33:49,922: twisted@ERROR: --- <exception caught here> ---
2012-10-30 08:33:49,922: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1045, in _inlineCallbacks
2012-10-30 08:33:49,923: twisted@ERROR: result = g.send(result)
2012-10-30 08:33:49,923: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/txzookeeper/managed.py", line 257, in _cb_connection_error
2012-10-30 08:33:49,923: twisted@ERROR: raise error
2012-10-30 08:33:49,923: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2012-10-30 08:33:49,923: twisted@ERROR: Traceback (most recent call last):
2012-10-30 08:33:49,923: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/txzookeeper/managed.py", line 319, in _cb_created
2012-10-30 08:33:49,923: twisted@ERROR: if self._check_result(result_code, d, path=path):
2012-10-30 08:33:49,923: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/txzookeeper/client.py", line 221, in _check_result
2012-10-30 08:33:49,923: twisted@ERROR: self, error)
2012-10-30 08:33:49,924: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 134, in maybeDeferred
2012-10-30 08:33:49,924: twisted@ERROR: result = f(*args, **kw)
2012-10-30 08:33:49,924: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", line 1187, in unwindGenerator
2012-10-30 08:33:49,924: twisted@ERROR: return _inlineCallbacks(None, gen, Deferred())
2012-10-30 08:33:49,924: twisted@ERROR: --- <exception caught here> ---
2012-10-30 08:33:49,924: twisted@ERROR: File "/usr/lib/python2.7/dist-packages/twisted/internet/defer.py", lin

Related branches

Revision history for this message
Clint Byrum (clint-fewbar) wrote :

Looks like txzookeeper.managed.SessionClient needs to understand ConnectionLossException so it doesn't spray exceptions as this is a normal condition like the others.. AND we need to back off and give the zookeeper server a little breathing room while retrying.

summary: - Non-responsive zookeeer leads to spinning agent and traceback
+ Non-responsive zookeeper leads to spinning agent and traceback
Changed in txzookeeper:
status: New → In Progress
importance: Undecided → High
assignee: nobody → Clint Byrum (clint-fewbar)
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

We were hit by this recently. About 60k backtraces in less than 4h, some as frequent as every 11ms:

root@juju-cstack-instance-67:/var/lib/juju/units/landscape-client-2/units/landscape-client-2# grep zookeeper.ConnectionLossException charm.log
2013-02-04 15:35:17,453: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
zookeeper.ConnectionLossException: connection loss /services/service-0000000050
2013-02-04 15:35:17,470: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,480: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,494: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,508: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,518: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,533: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,547: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,560: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,574: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,585: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
2013-02-04 15:35:17,600: twisted@ERROR: zookeeper.ConnectionLossException: connection loss
(...)

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

There must be a memory leak associated with these errors too, because machines that are experiencing it in our case also have juju agents taking up huge amounts of RES RAM and dying in a few minutes:

  824 root 20 0 1897m 1.6g 1240 R 99.4 83.2 3:01.18 /usr/bin/python -m juju.agents.unit --nodaemon --logfile /var/lib/juju/units/landscape-clie

Revision history for this message
Kapil Thangavelu (hazmat) wrote :

Attaching a branch which greatly improves retry/reconnect behavior with backoff and adds some tests around it. The other branch had undefined variables and import errors.

Changed in txzookeeper:
assignee: Clint Byrum (clint-fewbar) → Kapil Thangavelu (hazmat)
Changed in juju:
assignee: nobody → Kapil Thangavelu (hazmat)
status: New → In Progress
importance: Undecided → Critical
milestone: none → 0.8
Changed in txzookeeper:
status: In Progress → Fix Committed
Mark Ramm (mark-ramm)
Changed in juju:
importance: Critical → Wishlist
Changed in txzookeeper:
importance: High → Wishlist
Revision history for this message
Kapil Thangavelu (hazmat) wrote : Re: [Bug 1073083] Re: Non-responsive zookeeper leads to spinning agent and traceback

realstically this is won't fix, we're migrating to juju-core and we'll have
a plan in place for prod pyju migration. The underlying issue is buried in
libzookeeper and the upstream basically treats it as a second class
citizen. There are dozens of extant patches for libzk and the python
bindings that fix real world issues, sadly none directly on this, and gdb
and friends haven't proved effective in determining root cause. The
underlying libzk code here is pretty awful (same really applies to the java
code, google cyclomatic complexity and zookeeper)

On Wed, Sep 18, 2013 at 4:37 AM, Mark Ramm <
<email address hidden>> wrote:

> ** Changed in: juju
> Importance: Critical => Wishlist
>
> ** Changed in: txzookeeper
> Importance: High => Wishlist
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/1073083
>
> Title:
> Non-responsive zookeeper leads to spinning agent and traceback
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1073083/+subscriptions
>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.