schema did not reconnect to zookeeper after connection loss on a single-node setup

Bug #1395319 reported by Vedamurthy Joshi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Fix Committed
High
Sachin Bansal
R2.0
Fix Committed
High
Sachin Bansal
R2.1
Fix Committed
Undecided
Sachin Bansal

Bug Description

2.0 2469 centos 6.4 Single node Icehouse setup (Sanity setup)

http://anamika.englab.juniper.net:8080/view/SanityView/job/centos64_os_icehouse_Single_Node_Sanity/51/consoleFull

During this run, it was seen that schema was initially up and running fine.
Around 11/20/2014 08:38:41 AM, it lost connection to zookeeper and never reconnected again.

All logs are in http://10.204.216.50/Docs/bugs/#

On restart of schema, it started working fine.
1/20/2014 08:37:44 AM [schema]: Received response(xid=34): u'/id/bgp/route-targets/0008000009'
11/20/2014 08:38:17 AM [schema]: Sending request(xid=35): Delete(path='/id/bgp/route-targets/0008000009', version=-1)
11/20/2014 08:38:17 AM [schema]: Received response(xid=35): True
11/20/2014 08:38:18 AM [schema]: Sending request(xid=36): Delete(path='/id/virtual-networks/0000000009', version=-1)
11/20/2014 08:38:18 AM [schema]: Received response(xid=36): True
11/20/2014 08:38:41 AM [schema]: Sending request(xid=37): Create(path='/id/virtual-networks/0000000009', data='default-domain:TestECMPwithSVMChange-40767172:bridge_vn1-26413928', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=0)
11/20/2014 08:38:41 AM [schema]: Received response(xid=37): u'/id/virtual-networks/0000000009'
11/20/2014 08:38:41 AM [schema]: Sending request(xid=38): Create(path='/id/bgp/route-targets/0008000009', data='default-domain:TestECMPwithSVMChange-40767172:bridge_vn1-26413928:bridge_vn1-26413928', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=0)
11/20/2014 08:38:41 AM [schema]: Received response(xid=38): u'/id/bgp/route-targets/0008000009'
11/20/2014 08:38:54 AM [schema]: Connection dropped: outstanding heartbeat ping not received
11/20/2014 08:38:54 AM [schema]: Transition to CONNECTING
11/20/2014 08:38:54 AM [schema]: Zookeeper connection lost
11/20/2014 08:38:55 AM [schema]: Connecting to 10.204.216.201:2181
11/20/2014 08:38:55 AM [schema]: Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=220, time_out=20000, session_id=92828630529671169, passwd="\x94s\xb2\x06\x1a\x84\xfbt\xbc\xb6\xa1\x86'\xe7\x12\x0c", read_only=None)
11/20/2014 08:38:55 AM [schema]: Zookeeper connection established, state: CONNECTED
11/20/2014 08:39:05 AM [schema]: Sending request(xid=1): Create(path='/id/virtual-networks/0000000010', data='default-domain:TestECMPwithSVMChange-40767172:bridge_vn2-83713662', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=0)
11/20/2014 08:39:05 AM [schema]: Received response(xid=1): u'/id/virtual-networks/0000000010'
11/20/2014 08:39:05 AM [schema]: Sending request(xid=2): Create(path='/id/bgp/route-targets/0008000010', data='default-domain:TestECMPwithSVMChange-40767172:bridge_vn2-83713662:bridge_vn2-83713662', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=0)
11/20/2014 08:39:05 AM [schema]: Received response(xid=2): u'/id/bgp/route-targets/0008000010'
11/20/2014 08:39:37 AM [schema]: Sending request(xid=3): Create(path='/id/bgp/route-targets/0008000011', data='default-domain:TestECMPwithSVMChange-40767172:bridge_vn1-26413928:service-d980050f-36b3-4743-9cc0-4faffe1c4eb3-default-domain_TestECMPwithSVMChange-40767172_bridge_si-12612594_1', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=0)
11/20/2014 08:39:44 AM [schema]: Connection dropped: outstanding heartbeat ping not received
11/20/2014 08:39:44 AM [schema]: Transition to CONNECTING
11/20/2014 08:39:44 AM [schema]: Zookeeper connection lost <<<<<<<<<<<<<<<<<<<<<<<<<<
11/21/2014 04:59:51 PM [schema]: Connecting to 10.204.216.201:2181
11/21/2014 04:59:51 PM [schema]: Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, time_out=20000, session_id=0, passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', read_only=None)
11/21/2014 04:59:51 PM [schema]: Zookeeper connection established, state: CONNECTED

Tags: config
summary: - schema did not reconnect to zookeeper after connection loss
+ schema did not reconnect to zookeeper after connection loss on a single-
+ node setup
Revision history for this message
Sachin Bansal (sbansal) wrote :

From the above logs, it looks like it reconnected immediately. The last line of the logs indicates so. How did you determine that it did not reconnect?

Sachin Bansal (sbansal)
Changed in juniperopenstack:
status: New → Incomplete
Revision history for this message
Sandip Dey (sandipd) wrote :

We seems to hit this issue in recent sanity

Logs saved at http://10.204.216.50/Docs/bugs/1395319

contrail/schema-zk.log
====================

perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=0)
12/02/2014 01:36:45 PM [schema]: Received response(xid=34): u'/id/security-groups/id/0000000004'
12/02/2014 01:36:53 PM [schema]: Sending request(xid=35): Delete(path='/id/bgp/route-targets/0008000002', version=-1)
12/02/2014 01:37:06 PM [schema]: Connection dropped: outstanding heartbeat ping not received
12/02/2014 01:37:06 PM [schema]: Transition to CONNECTING
12/02/2014 01:37:06 PM [schema]: Zookeeper connection lost

There is no log of connecting back

Changed in juniperopenstack:
status: Incomplete → New
Revision history for this message
Vedamurthy Joshi (vedujoshi) wrote :

Sachin,
 even from my earlier logs...it had not reconnected for more than a day...i had put arrow marks pointing to the same

11/20/2014 08:39:44 AM [schema]: Zookeeper connection lost <<<<<<<<<<<<<<<<<<<<<<<<<<
11/21/2014 04:59:51 PM [schema]: Connecting to 10.204.216.201:2181

Revision history for this message
Sandip Dey (sandipd) wrote :

We hit this issue again in the following sanity run

[Build CentOS release 6.5 2.0-10~icehouse.el6] Single Node Sanity Report

tags: removed: blocker
Revision history for this message
Sachin Bansal (sbansal) wrote :

On zookeeper connection loss, schema transformer exits and is restarted by supervisord. However, in this case, it looks like supervisord took more than a day to detect that the process has exited.

From schema-zk.log:
11/20/2014 08:39:44 AM [schema]: Zookeeper connection lost
11/21/2014 04:59:51 PM [schema]: Connecting to 10.204.216.201:2181

From supervisord-config.log:
2014-11-20 08:31:49,806 INFO gave up: contrail-device-manager entered FATAL state, too many start retries too quickly
2014-11-21 16:59:50,562 INFO stopped: contrail-schema (terminated by SIGKILL)
2014-11-21 16:59:50,906 INFO spawned: 'contrail-schema' with pid 17685

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : master

Review in progress for https://review.opencontrail.org/8858
Submitter: Sachin Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/8858
Committed: http://github.org/Juniper/contrail-controller/commit/44d559187d99b3c04438d184dabd4536e78fb610
Submitter: Zuul
Branch: master

commit 44d559187d99b3c04438d184dabd4536e78fb610
Author: Sachin Bansal <email address hidden>
Date: Thu Apr 2 11:50:05 2015 -0700

Use gevent.sleep as sleep_func for KazooRetry

By default, KazooRetry uses time.sleep as sleep_func, which makes the whole
process when called. This causes the call to become blocking. We should use
gevent.sleep to avoid this.

Change-Id: I3b9ef36b5da2bbb56f5f96db68fa254678890b33
Closes-Bug: 1395319

Changed in juniperopenstack:
status: New → Fix Committed
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.1

Review in progress for https://review.opencontrail.org/8999
Submitter: Sachin Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : R2.0

Review in progress for https://review.opencontrail.org/9000
Submitter: Sachin Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/8999
Committed: http://github.org/Juniper/contrail-controller/commit/6a70a658dca9915969ecd31bd42fe1da3f19dbb4
Submitter: Zuul
Branch: R2.1

commit 6a70a658dca9915969ecd31bd42fe1da3f19dbb4
Author: Sachin Bansal <email address hidden>
Date: Thu Apr 2 11:50:05 2015 -0700

Use gevent.sleep as sleep_func for KazooRetry

By default, KazooRetry uses time.sleep as sleep_func, which makes the whole
process when called. This causes the call to become blocking. We should use
gevent.sleep to avoid this.

Change-Id: I3b9ef36b5da2bbb56f5f96db68fa254678890b33
Closes-Bug: 1395319
(cherry picked from commit 44d559187d99b3c04438d184dabd4536e78fb610)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/9000
Committed: http://github.org/Juniper/contrail-controller/commit/78248ca63d10be6e9b8380c634136f683864c78f
Submitter: Zuul
Branch: R2.0

commit 78248ca63d10be6e9b8380c634136f683864c78f
Author: Sachin Bansal <email address hidden>
Date: Thu Apr 2 11:50:05 2015 -0700

Use gevent.sleep as sleep_func for KazooRetry

By default, KazooRetry uses time.sleep as sleep_func, which makes the whole
process when called. This causes the call to become blocking. We should use
gevent.sleep to avoid this.

Change-Id: I3b9ef36b5da2bbb56f5f96db68fa254678890b33
Closes-Bug: 1395319
(cherry picked from commit 44d559187d99b3c04438d184dabd4536e78fb610)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.