Kazoo Driver Does not retry connections with Zookeeper

Bug #1495663 reported by Rohit Jaiswal
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Rohit Jaiswal
Joshua Harlow

Bug Description

https://github.com/openstack/tooz/blob/master/tooz/drivers/zookeeper.py#L339 can support proxying options to KazooClient in addition to just host. eg retrying connections in case of connection failure is one such option that can be supported in Tooz KazooDriver.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tooz (master)

Fix proposed to branch: master
Review: https://review.openstack.org/223259

Changed in python-tooz:
assignee: nobody → Joshua Harlow (harlowja)
status: New → In Progress
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :
Download full text (3.3 KiB)

Ceilometer uses Tooz for agent coordination and configurable connection retries will be useful to build resilience against random connection failures.

For example i see this in notification agent logs:

(kazoo.client): 2015-09-11 18:49:35,331 DEBUG connection _submit Sending request(xid=2): Create(path=u'/tooz/ceilometer.notification/b279f2ed-fe04-4113-b374-4627745c711c', data='\xc4\x00', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=1)
(kazoo.client): 2015-09-11 18:49:38,485 Level 5 connection _submit Sending request(xid=-2): Ping()
(kazoo.client): 2015-09-11 18:49:41,450 WARNING connection _connect_attempt Connection dropped: outstanding heartbeat ping not received
(kazoo.client): 2015-09-11 18:49:41,450 WARNING connection _connect_attempt Transition to CONNECTING
(kazoo.client): 2015-09-11 18:49:41,450 INFO client _session_callback Zookeeper connection lost
(ceilometer.openstack.common.threadgroup): 2015-09-11 18:49:41,463 ERROR threadgroup wait
Traceback (most recent call last):
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/openstack/common/threadgroup.py", line 145, in wait
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/openstack/common/threadgroup.py", line 47, in wait
    return self.thread.wait()
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait
    return self._exit_event.wait()
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait
    return hubs.get_hub().switch()
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
    return self.greenlet.switch()
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/openstack/common/service.py", line 491, in run_service
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/notification.py", line 143, in start
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/coordination.py", line 125, in join_group
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/tooz/drivers/zookeeper.py", line 427, in get
    return self._handler(self._kazoo_async_result, timeout, **self._kwargs)
  File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/tooz/drivers/zookeeper.py", line 137, in _join_group_handler
    raise coordination.ToozError(utils.exception_message(e))
(kazoo.client): 2015-09-11 18:49:41,550 WARNING connection zk_loop Failed connecting to Zookeeper within the connection retry policy.
(kazoo.client): 2015-09-11 18:49:41,551 INFO client _session_callback Zookeeper session lost, state: CLOSED
(kazoo.client): 2015-09-11 18:49:41,551 Level ...


Changed in ceilometer:
assignee: nobody → Rohit Jaiswal (rohit-jaiswal-3)
Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

The above stack trace is observed during notification agent start up in HA mode

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

Also, the implementation of join_group [1] in ceilometer/coordination.py should handle ToozError, currently it just handles MemberAlreadyExist and GroupNotCreated exceptions.

[1] https://github.com/openstack/ceilometer/blob/master/ceilometer/coordination.py#L115-L133

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tooz (master)

Reviewed: https://review.openstack.org/223259
Committed: https://git.openstack.org/cgit/openstack/tooz/commit/?id=4ae41738220e48148cc5ef2a3ecf350e067ff753
Submitter: Jenkins
Branch: master

commit 4ae41738220e48148cc5ef2a3ecf350e067ff753
Author: Joshua Harlow <email address hidden>
Date: Mon Sep 14 11:47:45 2015 -0700

    Allow more kazoo specific client options to be proxied through

    Closes-Bug: #1495663

    Change-Id: I3710845f9e66ab574eb358000b859dc96cbf08a0

Changed in python-tooz:
status: In Progress → Fix Committed
Revision history for this message
gordon chung (chungg) wrote :

is this still a problem with ceilometer or is it address by bug 1496982

Revision history for this message
Rohit Jaiswal (rohit-jaiswal-3) wrote :

https://review.openstack.org/#/c/224919/ (fix for bug 1496982) will just address issues when joining group so that the agent gets a chance to initialize listeners and start up correctly.

bug 1495663 is about adding a param to config to cap conn retries and should be useful for any other operations with Tooz/Zookpr. I think this is more generic fix and depends on the python-tooz fix above.

Revision history for this message
gordon chung (chungg) wrote :

ack, so a change is still required

Changed in ceilometer:
status: New → Triaged
importance: Undecided → Low
Changed in python-tooz:
milestone: none → 1.27.0
status: Fix Committed → Fix Released
Revision history for this message
gordon chung (chungg) wrote :

corresponding tooz lib has been released.

Changed in ceilometer:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers