DCS causes ZooKeeper to break when more MXOSRVRs are started

Bug #1369042 reported by Chirag Bhalgami
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Invalid
Medium
Tharak Capirala

Bug Description

ZooKeeper log (Actual IP address has been replaced with <ip_address>):
-----
1:46:30.809 AM INFO org.apache.zookeeper.server.NIOServerCnxnFactory
Accepted socket connection from /<ip_address>:42030
1:46:30.809 AM INFO org.apache.zookeeper.server.ZooKeeperServer
Client attempting to establish new session at /<ip_address>:42030
1:46:30.826 AM INFO org.apache.zookeeper.server.ZooKeeperServer
Established session 0x1486b5eb8b4056f with negotiated timeout 30000 for client /<ip_address>:42030
1:46:30.837 AM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /<ip_address> - max is 150
1:46:32.290 AM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /<ip_address> - max is 150
1:46:33.766 AM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /<ip_address> - max is 150
1:46:35.732 AM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /<ip_address> - max is 150
1:46:37.612 AM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /<ip_address> - max is 150
1:46:39.601 AM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /<ip_address> - max is 150
1:46:41.198 AM WARN org.apache.zookeeper.server.NIOServerCnxnFactory
Too many connections from /<ip_address> - max is 150
1:46:41.309 AM INFO org.apache.zookeeper.server.PrepRequestProcessor
Processed session termination for sessionid: 0x1486b5eb8b4056f
1:46:41.317 AM INFO org.apache.zookeeper.server.NIOServerCnxn
Closed socket connection for client /<ip_address>:42030 which had sessionid 0x1486b5eb8b4056f
-----

Less # of MXOSRVRs does not break ZooKeeper:

Step 1:
more /opt/trafodion/trafodion/dcs-0.9.0/conf/servers
n007 65

Step 2:
/opt/trafodion/trafodion/dcs-0.9.0/bin/stop-dcs.sh
stopping master.
n007: no server to stop because kill -0 of pid 18464 failed with status 1

Step 3:
vi /opt/trafodion/trafodion/dcs-0.9.0/conf/servers
<update with less # of MXOSRVRs>
more /opt/trafodion/trafodion/dcs-0.9.0/conf/servers
n007 32

Step 4:
/opt/trafodion/trafodion/dcs-0.9.0/bin/start-dcs.sh

*** Information about Trafodion/DCS:
------------------------------------
Trafodion Build : trafodion-20140909_0830.tar.gz
DCS Build : dcs-0.9.0

select major_version, minor_version from trafodion."_MD_".versions where version_type = 'METADATA';

MAJOR_VERSION MINOR_VERSION
-------------------- --------------------
                   2 3

Contents from /opt/trafodion/trafodion/dcs-0.9.0/conf/dcs-env.sh
export DCS_OPTS="-XX:+UseConcMarkSweepGC"
export DCS_MANAGES_ZK=false
export DCS_USER_PROGRAM_HOME=$MY_SQROOT

*** Additional Details:
-----------------------
Hadoop Distro : Cloudera CDH 4.5.0
ZooKepper Version : Zookeeper version: 3.4.5-cdh4.5.0--1

Value of maximum Client Connections in ZooKeeper configuration (maxClientCnxns) is set to 150.

>> ulimit -u
100000

>> /usr/sbin/sshd -T | grep -i max
maxauthtries 6
maxsessions 100
clientalivecountmax 3
maxstartups 200:30:200

tags: added: connectivity-dcs
description: updated
description: updated
description: updated
description: updated
Xu Jian (jian-xu5)
Changed in trafodion:
assignee: nobody → Xu Jian (jian-xu5)
Revision history for this message
Chirag Bhalgami (chirag-bhalgami) wrote :

Additional details:
----------------------------

DCS starts all MXOSRVRs based on the number defined in /opt/trafodion/trafodion/dcs-0.9.0/conf/servers but the moment connections are made from the 3rd party app, not all connections gets established and few of them goes into “Connecting” state.

Increasing value of maxClientCnxns from Cloudera Manager (Cloudera Manager > ZooKeeper > Configuration > Server Default Group > maxClientCnxns) didn't help.

This issue occurs on a single-node cluster but NOT multi-node cluster.

Revision history for this message
Matt Brown (mattbrown-2) wrote :

Would be good to have the stats on your zookeeper. Is it possible to use one of the "four letter" zookeeper commands, see below, so we can see how many connections (sessions) your zookeeer has? You should only see 1 connection each for DcsMaster/DcsServer and MXOSRVR. The example below is for a config with a servers file of "localhost 4".

dcs-0.9.0>echo stat | nc <ip address of server> 2182
Zookeeper version: 3.4.5-1392090, built on 09/30/2012 17:52 GMT
Clients:
 /0:0:0:0:0:0:0:1:36057[1](queued=0,recved=167,sent=167) //This is MXOSRVR
 /0:0:0:0:0:0:0:1:36075[1](queued=0,recved=167,sent=167) //This is MXOSRVR
 /127.0.0.1:44862[1](queued=0,recved=50,sent=50) //This is DcsServer
 /<ip address of server>:35503[0](queued=0,recved=1,sent=0) // Unknown
 /0:0:0:0:0:0:0:1:36067[1](queued=0,recved=167,sent=167) //This is MXOSRVR
 /127.0.0.1:44856[1](queued=0,recved=54,sent=62) //This is DcsMaster
 /0:0:0:0:0:0:0:1:36077[1](queued=0,recved=165,sent=165) //This is MXOSRVR

Latency min/avg/max: 0/0/1
Received: 8
Sent: 8
Connections: 7
Outstanding: 0
Zxid: 0x38
Mode: standalone
Node count: 16

BTW, Here's some other very useful zookeeper commands

ZooKeeper Commands: The Four Letter Words
ZooKeeper responds to a small set of commands. Each command is composed of four letters. You issue the commands to ZooKeeper via telnet or nc, at the client port.

dump
Lists the outstanding sessions and ephemeral nodes. This only works on the leader.

envi
Print details about serving environment

kill
Shuts down the server. This must be issued from the machine the ZooKeeper server is running on.

reqs
List outstanding requests

ruok
Tests if server is running in a non-error state. The server will respond with imok if it is running. Otherwise it will not respond at all.

srst
Reset statistics returned by stat command.

stat
Lists statistics about performance and connected clients.

Here's an example of the ruok command:

$ echo ruok | nc 127.0.0.1 5111
imok

Xu Jian (jian-xu5)
Changed in trafodion:
status: New → In Progress
Changed in trafodion:
importance: High → Medium
Changed in trafodion:
assignee: Xu Jian (jian-xu5) → Tharak Capirala (capirala-tharaknath)
milestone: none → r1.1
Revision history for this message
Anuradha (anuradha-hegde) wrote :

Not reproducible as per Chirag.

From: Bhalgami, Chirag
Sent: Tuesday, February 03, 2015 8:48 AM
To: Hegde, Anuradha
Subject: RE: bug 1369042

Hi Anu,

I haven’t seen it and I was not reproduce it later... So you can close it.

Thanks,
- Chirag

Changed in trafodion:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.