Ceilometer uses Tooz for agent coordination and configurable connection retries will be useful to build resilience against random connection failures. For example i see this in notification agent logs: (kazoo.client): 2015-09-11 18:49:35,331 DEBUG connection _submit Sending request(xid=2): Create(path=u'/tooz/ceilometer.notification/b279f2ed-fe04-4113-b374-4627745c711c', data='\xc4\x00', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=1) (kazoo.client): 2015-09-11 18:49:38,485 Level 5 connection _submit Sending request(xid=-2): Ping() (kazoo.client): 2015-09-11 18:49:41,450 WARNING connection _connect_attempt Connection dropped: outstanding heartbeat ping not received (kazoo.client): 2015-09-11 18:49:41,450 WARNING connection _connect_attempt Transition to CONNECTING (kazoo.client): 2015-09-11 18:49:41,450 INFO client _session_callback Zookeeper connection lost (ceilometer.openstack.common.threadgroup): 2015-09-11 18:49:41,463 ERROR threadgroup wait Traceback (most recent call last): File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/openstack/common/threadgroup.py", line 145, in wait x.wait() File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/openstack/common/threadgroup.py", line 47, in wait return self.thread.wait() File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/eventlet/greenthread.py", line 175, in wait return self._exit_event.wait() File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/eventlet/event.py", line 121, in wait return hubs.get_hub().switch() File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch return self.greenlet.switch() File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main result = function(*args, **kwargs) File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/openstack/common/service.py", line 491, in run_service service.start() File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/notification.py", line 143, in start self.partition_coordinator.join_group(self.group_id) File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/ceilometer/coordination.py", line 125, in join_group join_req.get() File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/tooz/drivers/zookeeper.py", line 427, in get return self._handler(self._kazoo_async_result, timeout, **self._kwargs) File "/opt/stack/venv/ceilometer-20150911T173109Z/lib/python2.7/site-packages/tooz/drivers/zookeeper.py", line 137, in _join_group_handler raise coordination.ToozError(utils.exception_message(e)) ToozError (kazoo.client): 2015-09-11 18:49:41,550 WARNING connection zk_loop Failed connecting to Zookeeper within the connection retry policy. (kazoo.client): 2015-09-11 18:49:41,551 INFO client _session_callback Zookeeper session lost, state: CLOSED (kazoo.client): 2015-09-11 18:49:41,551 Level 5 connection zk_loop Connection stopped (oslo_messaging._drivers.impl_rabbit): 2015-09-11 18:49:42,333 ERROR impl_rabbit _error_callback Failed to consume message from queue: