Sqstart following ckillall stalls due to orphaned Dcsserver and Dcsmaster

Bug #1411475 reported by Sean Broeder
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
Medium
Matt Brown

Bug Description

Hi Sean,

Please see below for output of DcsServer log with embedded commentary. Config is 1 DcsServer with 4 child mxosrvrs. I tried the sqstart->ckillall->sqstart sequence. Net of this is a subsequent sqstart won’t succeed since DcsServer is busy trying to restart its child mxosrvrs. I think you should file Launchpad bug for this and assign to me. I have a fix in DCS to “sqcheck” Trafodion to make sure it’s running prior to launching any mxosrvrs. With it the sequence works without problem.

Matt

<matt> This is the beginning of DcsServer process startup
Thu Jan 15 20:40:52 UTC 2015 Starting server on g4t3016.houston.hp.com
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 515196
max locked memory (kbytes, -l) 49595556
max memory size (kbytes, -m) unlimited
open files (-n) 32767
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 267263
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
2015-01-15 20:40:52,943 INFO org.trafodion.dcs.util.VersionInfo: Dcs 0.9.1
2015-01-15 20:40:52,943 INFO org.trafodion.dcs.util.VersionInfo: Subversion git://g4t3016.houston.hp.com/opt/home/matbrown/git/core/sqf/sql/local_hadoop/src/dcs -r c267963fbd9442ae051041f006b7261a75fc5e2c
2015-01-15 20:40:52,943 INFO org.trafodion.dcs.util.VersionInfo: Compiled by matbrown on Tue Jan 13 21:09:49 UTC 2015
2015-01-15 20:40:54,105 INFO org.trafodion.dcs.server.DcsServer: Connected to ZooKeeper
2015-01-15 20:40:54,152 INFO org.trafodion.dcs.util.DcsNetworkConfiguration: Using local host [g4t3016.houston.hp.com,16.235.163.124]
2015-01-15 20:40:54,152 INFO org.trafodion.dcs.util.DcsNetworkConfiguration: Checking Cloud environment
2015-01-15 20:40:54,155 INFO org.trafodion.dcs.util.DcsNetworkConfiguration: sys_shell.py exec [nova list | grep -v '^+' | grep -w `hostname` | sed 's/.*=\([0-9.]*\), \([0-9.]*\).*$/\1,\2/']
2015-01-15 20:40:54,983 INFO org.trafodion.dcs.script.ScriptManager: Compiling script sys_shell.py
2015-01-15 20:40:57,104 INFO org.trafodion.dcs.util.DcsNetworkConfiguration: sys_shell.py exit code [0], stderr [/bin/sh: nova: command not found]
2015-01-15 20:40:57,104 INFO org.trafodion.dcs.util.DcsNetworkConfiguration: Cloud environment not found
2015-01-15 20:40:57,128 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2015-01-15 20:40:57,166 INFO org.trafodion.dcs.http.HttpServer: Port returned by webServer.getConnectors()[0].getLocalPort() before open() is -1. Opening the listener on 0
2015-01-15 20:40:57,168 INFO org.trafodion.dcs.http.HttpServer: listener.getLocalPort() returned 55485 webServer.getConnectors()[0].getLocalPort() returned 55485
2015-01-15 20:40:57,168 INFO org.trafodion.dcs.http.HttpServer: Jetty bound to port 55485
2015-01-15 20:40:57,168 INFO org.mortbay.log: jetty-6.1.26
2015-01-15 20:40:57,626 INFO org.mortbay.log: Started SelectChannelConnector@0.0.0.0:55485
2015-01-15 20:40:57,634 INFO org.trafodion.dcs.server.ServerManager: User program enabled
2015-01-15 20:40:57,648 INFO org.trafodion.dcs.server.ServerManager: Created znode [/matbrown/dcs/servers/running/g4t3016.houston.hp.com:1:0:1421354457634]
2015-01-15 20:40:57,651 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:1] is not running
2015-01-15 20:40:57,652 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:1 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:40:57,652 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:2] is not running
2015-01-15 20:40:57,653 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:2 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:40:57,654 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:3] is not running
2015-01-15 20:40:57,655 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:3 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:40:57,655 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:4] is not running
2015-01-15 20:40:57,655 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:4 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:40:58,597 INFO org.trafodion.dcs.server.ServerManager: User program exit [0]
2015-01-15 20:40:58,597 INFO org.trafodion.dcs.server.ServerManager: exit code [0]
2015-01-15 20:40:58,597 INFO org.trafodion.dcs.server.ServerManager: User program exit [0]
2015-01-15 20:40:58,597 INFO org.trafodion.dcs.server.ServerManager: exit code [0]
2015-01-15 20:40:58,603 INFO org.trafodion.dcs.server.ServerManager: User program exit [0]
2015-01-15 20:40:58,603 INFO org.trafodion.dcs.server.ServerManager: exit code [0]
2015-01-15 20:40:58,674 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:3] is not running
2015-01-15 20:40:58,674 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:2] is not running
2015-01-15 20:40:58,675 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:4] is not running
2015-01-15 20:40:58,684 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:3 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:40:58,684 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:4 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:40:58,684 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:2 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:40:59,555 INFO org.trafodion.dcs.server.ServerManager: User program exit [0]
2015-01-15 20:40:59,555 INFO org.trafodion.dcs.server.ServerManager: exit code [0]
2015-01-15 20:40:59,562 INFO org.trafodion.dcs.server.ServerManager: User program exit [0]
2015-01-15 20:40:59,562 INFO org.trafodion.dcs.server.ServerManager: exit code [0]
2015-01-15 20:40:59,612 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:3] is not running
2015-01-15 20:40:59,618 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:3 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:40:59,619 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:2] is not running
2015-01-15 20:40:59,624 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:2 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:41:00,513 INFO org.trafodion.dcs.server.ServerManager: User program exit [0]
2015-01-15 20:41:00,513 INFO org.trafodion.dcs.server.ServerManager: exit code [0]
2015-01-15 20:41:00,566 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:3] is not running
2015-01-15 20:41:00,570 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:3 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]

<matt> Everything up to this point is expected as part of startup

/opt/home/matbrown/git/core/sqf>sqcheck
Checking if processes are up.
Checking attempt: 1; user specified max: 2. Execution time in seconds: 0.

The SQ environment is up!

Process Configured Actual Down
------- ---------- ------ ----
DTM 2 2
RMS 4 4
MXOSRVR 4 4

Processing cluster.conf on local host g4t3016.houston.hp.com
[$Z000B10] Shell/shell Version 1.0.1 Release 0.9.1 (Build debug [matbrown], date 13Jan15)
[$Z000B10] %ps
[$Z000B10] NID,PID(os) PRI TYPE STATES NAME PARENT PROGRAM
[$Z000B10] ------------ --- ---- ------- ----------- ----------- ---------------
[$Z000B10] 000,00004934 000 WDG ES--A-- $WDT000 NONE sqwatchdog
[$Z000B10] 000,00004935 000 PSD ES--A-- $PSD000 NONE pstartd
[$Z000B10] 000,00004978 001 DTM ES--A-- $TM0 NONE tm
[$Z000B10] 000,00009264 001 GEN ES--A-- $ZSC000 NONE mxsscp
[$Z000B10] 000,00009296 001 SSMP ES--A-- $ZSM000 NONE mxssmp
[$Z000B10] 000,00010524 001 GEN ES--A-- $Z0008KP NONE mxosrvr
[$Z000B10] 000,00010639 001 GEN ES--A-- $Z0008NZ NONE mxosrvr
[$Z000B10] 000,00010727 001 GEN ES--A-- $Z0008RH NONE mxosrvr
[$Z000B10] 000,00010775 001 GEN ES--A-- $Z0008SV NONE mxosrvr
[$Z000B10] 000,00010900 001 GEN ES--A-- $ZLOBSRV0 NONE mxlobsrvr
[$Z000B10] 000,00013510 001 GEN ES--A-- $Z000B10 NONE shell
[$Z000B10] 001,00004936 000 WDG ES--A-- $WDT001 NONE sqwatchdog
[$Z000B10] 001,00004937 000 PSD ES--A-- $PSD001 NONE pstartd
[$Z000B10] 001,00005195 001 DTM ES--A-- $TM1 NONE tm
[$Z000B10] 001,00009280 001 GEN ES--A-- $ZSC001 NONE mxsscp
[$Z000B10] 001,00009314 001 SSMP ES--A-- $ZSM001 NONE mxssmp
[$Z000B10] 001,00010908 001 GEN ES--A-- $ZLOBSRV1 NONE mxlobsrvr
<matt> Executed “ckillall” here
2015-01-15 20:44:47,688 INFO org.trafodion.dcs.server.ServerManager: User program exit [137]
2015-01-15 20:44:47,688 INFO org.trafodion.dcs.server.ServerManager: User program exit [137]
2015-01-15 20:44:47,688 INFO org.trafodion.dcs.server.ServerManager: exit code [137], stderr [/bin/sh: line 1: 10775 Killed mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:3 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:47,688 INFO org.trafodion.dcs.server.ServerManager: exit code [137], stderr [/bin/sh: line 1: 10727 Killed mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:2 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:47,689 INFO org.trafodion.dcs.server.ServerManager: User program exit [137]
2015-01-15 20:44:47,689 INFO org.trafodion.dcs.server.ServerManager: exit code [137], stderr [/bin/sh: line 1: 10639 Killed mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:4 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:47,691 INFO org.trafodion.dcs.server.ServerManager: User program exit [137]
2015-01-15 20:44:47,691 INFO org.trafodion.dcs.server.ServerManager: exit code [137], stderr [/bin/sh: line 1: 10524 Killed mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:1 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
<matt> DcsServer recognizes that all have died and tries to restart them
2015-01-15 20:44:47,745 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:2] is not running
2015-01-15 20:44:47,746 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:1] is not running
2015-01-15 20:44:47,746 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:3] is not running
2015-01-15 20:44:47,747 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:4] is not running
2015-01-15 20:44:47,758 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:2 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:47,763 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:1 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:47,770 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:3 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:47,770 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:4 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
<matt> Again they start to fail
2015-01-15 20:44:48,862 INFO org.trafodion.dcs.server.ServerManager: User program exit [137]
2015-01-15 20:44:48,862 INFO org.trafodion.dcs.server.ServerManager: User program exit [137]
2015-01-15 20:44:48,862 INFO org.trafodion.dcs.server.ServerManager: exit code [137], stderr [Fatal error ($<unkwn>-17040-17040): process-startup: error returned by monitor, fserr=53/bin/sh: line 1: 17040 Killed mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:3 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:48,862 INFO org.trafodion.dcs.server.ServerManager: exit code [137], stderr [Fatal error ($<unkwn>-17035-17035): process-startup: error returned by monitor, fserr=53/bin/sh: line 1: 17035 Killed mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:4 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:48,863 INFO org.trafodion.dcs.server.ServerManager: User program exit [137]
2015-01-15 20:44:48,863 INFO org.trafodion.dcs.server.ServerManager: exit code [137], stderr [Fatal error ($<unkwn>-17024-17024): process-startup: error returned by monitor, fserr=53/bin/sh: line 1: 17024 Killed mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:1 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:48,863 INFO org.trafodion.dcs.server.ServerManager: User program exit [137]
2015-01-15 20:44:48,863 INFO org.trafodion.dcs.server.ServerManager: exit code [137], stderr [Fatal error ($<unkwn>-17009-17009): process-startup: error returned by monitor, fserr=53/bin/sh: line 1: 17009 Killed mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:2 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:48,864 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:3] is not running
2015-01-15 20:44:48,864 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:4] is not running
2015-01-15 20:44:48,865 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:3 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:48,865 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:1] is not running
2015-01-15 20:44:48,865 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:4 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:48,865 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:2] is not running
2015-01-15 20:44:48,865 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:1 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:48,865 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/matbrown/git/core/sqf;. sqenv.sh;mxosrvr -ZKHOST localhost:32570 -RZ g4t3016.houston.hp.com:1:2 -ZKPNODE "/matbrown" -CNGTO 60 -ZKSTO 180 -EADSCO 0 -TCPADD 16.235.163.124 -MAXHEAPPCT 0 -STATISTICSINTERVAL 60 -STATISTICSLIMIT 60 -STATISTICSTYPE aggregated -STATISTICSENABLE false -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-15 20:44:51,076 INFO org.trafodion.dcs.server.ServerManager: User program exit [3]
2015-01-15 20:44:51,077 INFO org.trafodion.dcs.server.ServerManager: exit code [3], stderr [Fatal error ($<unkwn>-17277-17277): process-startup: error returned by monitor, fserr=53]
2015-01-15 20:44:51,077 INFO org.trafodion.dcs.server.ServerManager: User program exit [3]
2015-01-15 20:44:51,077 INFO org.trafodion.dcs.server.ServerManager: exit code [3], stderr [Fatal error ($<unkwn>-17268-17268): process-startup: error returned by monitor, fserr=53]
2015-01-15 20:44:51,077 ERROR org.trafodion.dcs.server.ServerManager: Trafodion is not running
2015-01-15 20:44:51,077 ERROR org.trafodion.dcs.server.ServerManager: Trafodion is not running
2015-01-15 20:44:51,078 INFO org.trafodion.dcs.server.ServerManager: User program exit [3]
2015-01-15 20:44:51,078 INFO org.trafodion.dcs.server.ServerManager: exit code [3], stderr [Fatal error ($<unkwn>-17270-17270): process-startup: error returned by monitor, fserr=53]
2015-01-15 20:44:51,078 ERROR org.trafodion.dcs.server.ServerManager: Trafodion is not running
2015-01-15 20:44:51,078 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:2] is not running
2015-01-15 20:44:51,078 INFO org.trafodion.dcs.server.ServerManager: Server handler [1:1] is not running
2015-01-15 20:44:51,079 INFO org.trafodion.dcs.server.ServerManager: User program exit [3]
2015-01-15 20:44:51,079 INFO org.trafodion.dcs.server.ServerManager: exit code [3], stderr [Fatal error ($<unkwn>-17273-17273): process-startup: error returned by monitor, fserr=53]
2015-01-15 20:44:51,079 ERROR org.trafodion.dcs.server.ServerManager: Trafodion is not running
<matt> This just goes on and on

-Matt

From: Broeder, Sean
Sent: Thursday, January 15, 2015 11:32 AM
To: Brown, Matt (Trafodion); Hegde, Anuradha; Capirala, Tharaknath; Narain, Arvind
Subject: RE: orphaned DcsMaster and DcsServer

Hi Matt,
Would it be possible to emit these messages when encountered in sqstart so that people don’t wonder why it is hanging? I don’t recall seeing them.

Thanks,
Sean

master 17274. Stop it first.
localhost: server 17350. Stop it first.

From: Brown, Matt (Trafodion)
Sent: Thursday, January 15, 2015 10:16 AM
To: Broeder, Sean; Hegde, Anuradha; Capirala, Tharaknath; Narain, Arvind
Subject: RE: orphaned DcsMaster and DcsServer

Hi

DCS should be thought of like the underlying Cloudera/HortonWorks HBase install in that DcsMaster/DcsServers may continue to run without Trafodion. Mxosrvrs spawned by DcsServer die when they detect, in zookeeper, that some ephemeral DCS specific znodes have disappeared.

It's DcsServer job to restart any dead mxosrvr. That logic contains delays between attempts so we don’t hammer the cluster.
When sqstart is performed it also tries to start DCS. You should see an error as DcsMaster/DcsServer are already running e.g. below I try to start DCS on my workstation when it's already running.

/opt/home/matbrown/git/dcs-0.9.1>jps | grep Dcs
30614 DcsQuorumPeer
17274 DcsMaster
17350 DcsServer
/opt/home/matbrown/git/dcs-0.9.1>bin/start-dcs.sh --config myconf
master 17274. Stop it first.
localhost: server 17350. Stop it first.

We try very hard not to couple DCS in Trafodion scripts. I’d rather we fix any bugs we find that prevent mxosrvr from continuing to run or shutting down gracefully when something bad happens in Trafodion.

Hope this helps

-----Original Message-----
From: Broeder, Sean
Sent: Thursday, January 15, 2015 9:53 AM
To: Hegde, Anuradha; Capirala, Tharaknath; Narain, Arvind; Brown, Matt (Trafodion)
Subject: RE: orphaned DcsMaster and DcsServer

Hi Anu,
I have not tried using Dcstop. My use of ckillall is to try to emulate a system crash, so stopping processes gracefully doesn't fit into that paradigm very well, but I can give it a try.

Thanks,
Sean

-----Original Message-----
From: Hegde, Anuradha
Sent: Thursday, January 15, 2015 9:44 AM
To: Broeder, Sean; Capirala, Tharaknath; Narain, Arvind; Brown, Matt (Trafodion)
Subject: RE: orphaned DcsMaster and DcsServer

Hi Sean,
   Have u tried using dcstop command ? That works most of the time.

We have had this lingering mxosrvr problem where mxosrvrs won't go away even with sqstop after establishing a connection Just yesterday Tharak delivered a fix for it and it is in gate check currently..

I don’t think we need to include kill of all those mxosrvr processes in ckillall script. Others might have a different opinion...

Anu
-----Original Message-----
From: Broeder, Sean
Sent: Thursday, January 15, 2015 9:37 AM
To: Hegde, Anuradha; Capirala, Tharaknath; Narain, Arvind
Subject: orphaned DcsMaster and DcsServer

Hi,
I have noticed that often when my environment crashes or if I issue a ckillall the DcsMaster and DcsServer processes are left running. A subsequest sqstart seems to cause a system hang and if I remember correctly mxosrvr cores. Certainly I see gdb very active for quite some time. If I kill the DcsMaster and DcsServer processes there is no issue at all.

Do you think it would be appropriate/possible to do a jps and kill these processes as part of ckillall?

Thanks,
Sean

Matt Brown (mattbrown-2)
Changed in trafodion:
status: New → Confirmed
importance: Undecided → Medium
milestone: none → r1.1
Matt Brown (mattbrown-2)
Changed in trafodion:
status: Confirmed → In Progress
Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix proposed to dcs (master)

Fix proposed to branch: master
Review: https://review.trafodion.org/1164

Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix merged to dcs (master)

Reviewed: https://review.trafodion.org/1164
Committed: https://github.com/trafodion/dcs/commit/fbdd97e918f28631a8efcf556064fef117c05457
Submitter: Trafodion Jenkins
Branch: master

commit fbdd97e918f28631a8efcf556064fef117c05457
Author: matbrown <email address hidden>
Date: Fri Feb 20 21:10:21 2015 +0000

    DcsServer checks if Trafodion is running before mxosrvr restart

    DcsServer aggresively tried to restart its child mxosrvrs.
    In scenarios where Trafodion had been stopped abruptly e.g.,
    when "ckillall" command was executed, it would cause many
    mxosrvr processes to start then quickly die. This change
    adds check to ensure that Trafodion is running before ever
    attempting a restart of any mxosrvr.

    Change-Id: If430ce94e6071e402786d73d544d2ab23306d1fe
    Closes-Bug: #1411475

Changed in trafodion:
status: In Progress → Fix Committed
Matt Brown (mattbrown-2)
Changed in trafodion:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.