contrail-api should not give up trying to connect to zookeeper

Bug #1691541 reported by Vedamurthy Joshi
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Juniper Openstack
Status tracked in Trunk
R3.2
Fix Committed
High
Sahil Sabharwal
R4.0
Fix Committed
High
Sahil Sabharwal
Trunk
Fix Committed
High
Sahil Sabharwal

Bug Description

R4.0 Build 6 Ubuntu 16.04.2 contrail Container cluster

On this setup, there are 3 controller containers,
During provisioning, it so happened that the first controller got provisioned correctly, and the other two came up few hrs later. So the zk cluster was up once all the 3 containers came up fine.

From the contrail-api logs on the first container , it seems that contrail-api tries a number of times to connect to the zk nodes, but keeps failing.. and after about an hour or so, it stops retrying.

contrail-api should retry connecting to zk forever periodically

May 17 08:38:12 nodec1 contrail-api[3909]: INFO:api-0:Connecting to 10.204.216.59:2181
May 17 08:38:12 nodec1 contrail-api[3909]: WARNING:api-0:Connection dropped: socket connection error: Connection refused
May 17 08:38:12 nodec1 contrail-api[3909]: INFO:api-0:Connecting to 10.204.216.60:2181
May 17 08:38:12 nodec1 contrail-api[3909]: WARNING:api-0:Connection dropped: socket connection error: Connection refused
May 17 08:38:12 nodec1 contrail-api[3909]: INFO:api-0:Connecting to 10.204.216.58:2181
May 17 08:38:12 nodec1 contrail-api[3909]: DEBUG:api-0:Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, time_out=400000, session_id=0, passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', read_only=None)
May 17 08:38:12 nodec1 contrail-api[3909]: WARNING:api-0:Connection dropped: socket connection broken
---------------

May 17 09:52:44 nodec1 contrail-api[3909]: INFO:api-0:Connecting to 10.204.216.59:2181
May 17 09:52:44 nodec1 contrail-api[3909]: WARNING:api-0:Connection dropped: socket connection error: Connection refused
May 17 09:52:44 nodec1 contrail-api[3909]: INFO:api-0:Connecting to 10.204.216.60:2181
May 17 09:52:44 nodec1 contrail-api[3909]: WARNING:api-0:Connection dropped: socket connection error: Connection refused
May 17 09:52:44 nodec1 contrail-api[3909]: INFO:api-0:Connecting to 10.204.216.58:2181
May 17 09:52:44 nodec1 contrail-api[3909]: DEBUG:api-0:Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, time_out=400000, session_id=0, passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', read_only=None)
May 17 09:52:44 nodec1 contrail-api[3909]: WARNING:api-0:Connection dropped: socket connection broken
May 17 09:52:47 nodec1 contrail-api[3909]: ERROR:contrail-api:Session Event: TCP Connect Fail
May 17 09:52:47 nodec1 contrail-api[3909]: ERROR:contrail-api:SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec1 process_status = [ << module_id = contrail-api instance_id = 0 state = Non-Functional connection_infos = [ << type = Zookeeper name = Zookeeper server_addrs = [ 10.204.216.58:2181, 10.204.216.59:2181, 10.204.216.60:2181, ] status = Down description = >>, << type = Collector name = server_addrs = [ 10.204.216.58:8086, ] status = Initializing description = Idle to Connect on EvIdleHoldTimerExpired >>, ] description = Zookeeper:Zookeeper[], Collector connection down >>, ] >>
May 17 09:52:47 nodec1 contrail-api[3909]: ERROR:contrail-api:SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec1 process_status = [ << module_id = contrail-api instance_id = 0 state = Non-Functional connection_infos = [ << type = Zookeeper name = Zookeeper server_addrs = [ 10.204.216.58:2181, 10.204.216.59:2181, 10.204.216.60:2181, ] status = Down description = >>, << type = Collector name = server_addrs = [ 10.204.216.58:8086, ] status = Down description = Connect to Idle on EvTcpConnectFail >>, ] description = Zookeeper:Zookeeper[], Collector connection down >>, ] >>
May 17 09:52:47 nodec1 contrail-api[3909]: ERROR:contrail-api:SANDESH: [DROP: WrongClientSMState] SandeshModuleClientTrace: data = << name = nodec1:Config:contrail-api:0 client_info = << status = Idle successful_connections = 0 pid = 3909 http_port = 8084 start_time = 1495010291999275 collector_name = collector_ip = 10.204.216.58:8086 collector_list = [ 10.204.216.60:8086, 10.204.216.58:8086, 10.204.216.59:8086, ] >> sm_queue_count = 1 max_sm_queue_count = 3 >>
May 17 09:52:49 nodec1 contrail-api[3909]: WARNING:api-0:Failed connecting to Zookeeper within the connection retry policy.
May 17 09:52:49 nodec1 contrail-api[3909]: INFO:api-0:Zookeeper session lost, state: CLOSED
May 17 09:52:51 nodec1 contrail-api[3909]: ERROR:contrail-api:Session Event: TCP Connect Fail
May 17 09:52:51 nodec1 contrail-api[3909]: ERROR:contrail-api:SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec1 process_status = [ << module_id = contrail-api instance_id = 0 state = Non-Functional connection_infos = [ << type = Zookeeper name = Zookeeper server_addrs = [ 10.204.216.58:2181, 10.204.216.59:2181, 10.204.216.60:2181, ] status = Down description = >>, << type = Collector name = server_addrs = [ 10.204.216.59:8086, ] status = Initializing description = Idle to Connect on EvIdleHoldTimerExpired >>, ] description = Zookeeper:Zookeeper[], Collector connection down >>, ] >>
May 17 09:52:51 nodec1 contrail-api[3909]: ERROR:contrail-api:SANDESH: [DROP: WrongClientSMState] NodeStatusUVE: data = << name = nodec1 process_status = [ << module_id = contrail-api instance_id = 0 state = Non-Functional connection_infos = [ << type = Zookeeper name = Zookeeper server_addrs = [ 10.204.216.58:2181, 10.204.216.59:2181, 10.204.216.60:2181, ] status = Down description = >>, << type = Collector name = server_addrs = [ 10.204.216.59:8086, ] status = Down description = Connect to Idle on EvTcpConnectFail >>, ] description = Zookeeper:Zookeeper[], Collector connection down >>, ] >>

Tags: config
Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/34433
Submitter: Sachin Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/34444
Submitter: Sachin Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] master

Review in progress for https://review.opencontrail.org/34433
Submitter: Sachin Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R4.0

Review in progress for https://review.opencontrail.org/34444
Submitter: Sachin Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : [Review update] R3.2

Review in progress for https://review.opencontrail.org/34461
Submitter: Sachin Bansal (<email address hidden>)

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote : A change has been merged

Reviewed: https://review.opencontrail.org/34433
Committed: http://github.com/Juniper/contrail-controller/commit/c05167b4f6d5bdeadefb95f6d42270a7f6d4a2e2
Submitter: Zuul (<email address hidden>)
Branch: master

commit c05167b4f6d5bdeadefb95f6d42270a7f6d4a2e2
Author: Sachin Bansal <email address hidden>
Date: Wed Aug 9 18:07:17 2017 -0700

Must call zkclient.close after every connect attempt

Zookeeper client opens a socketpair and sends a single byte message.
There is no receive on this socket until it is connected successfully.
If it never connects, we will eventually run out of socket buffer space.
For this reason, we should call close() to close these sockets after
each connection attempt.

Change-Id: Iefdc8481d07bf243d70fb8f7e22792f4c65a6cbf
Closes-Bug: 1691541

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/34444
Committed: http://github.com/Juniper/contrail-controller/commit/4fb7c67320212a97af55e11b5c8a82adec4bf42e
Submitter: Zuul (<email address hidden>)
Branch: R4.0

commit 4fb7c67320212a97af55e11b5c8a82adec4bf42e
Author: Sachin Bansal <email address hidden>
Date: Wed Aug 9 18:07:17 2017 -0700

Must call zkclient.close after every connect attempt

Zookeeper client opens a socketpair and sends a single byte message.
There is no receive on this socket until it is connected successfully.
If it never connects, we will eventually run out of socket buffer space.
For this reason, we should call close() to close these sockets after
each connection attempt.

Change-Id: Iefdc8481d07bf243d70fb8f7e22792f4c65a6cbf
Closes-Bug: 1691541

Revision history for this message
OpenContrail Admin (ci-admin-f) wrote :

Reviewed: https://review.opencontrail.org/34461
Committed: http://github.com/Juniper/contrail-controller/commit/df3d5ddb9245f4d47bca9bcdba3178061b439c3b
Submitter: Zuul (<email address hidden>)
Branch: R3.2

commit df3d5ddb9245f4d47bca9bcdba3178061b439c3b
Author: Sachin Bansal <email address hidden>
Date: Wed Aug 9 18:07:17 2017 -0700

Must call zkclient.close after every connect attempt

Zookeeper client opens a socketpair and sends a single byte message.
There is no receive on this socket until it is connected successfully.
If it never connects, we will eventually run out of socket buffer space.
For this reason, we should call close() to close these sockets after
each connection attempt.

Change-Id: Iefdc8481d07bf243d70fb8f7e22792f4c65a6cbf
Closes-Bug: 1691541

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.