mxosrvrs that dont start sometimes during dcsstart due to port not available, start up during dcsstop and look for non existent DCSMaster

Bug #1414188 reported by Aruna Sadashiva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
High
Anuradha

Bug Description

Frequently, when we have a large number of mxosrvrs (I had 96 pernode), all mxosrvrs dont come up during dcsstart. This seems to happen when the instance is stopped/restarted few times. Have to use ckillall.

The mxosrvrs that didnt start up with dcsstart, started when dcsstop was done and ports freed up and was looking for DCSMaster, which was gone.

Took a gcore of one of the mxosrvr process that was not stopping:

Core was generated by `mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0015.houston.hp.com:3:61 -Z'.
#0 0x00007ffff4afc4cd in accept () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install boost-filesystem-1.41.0-11.el6_1.2.x86_64 boost-program-options-1.41.0-11.el6_1.2.x86_64 boost-system-1.41.0-11.el6_1.2.x86_64 cyrus-sasl-lib-2.1.23-13.el6.x86_64 glibc-2.12-1.107.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.9-33.el6.x86_64 libcom_err-1.41.12-12.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libstdc++-4.4.6-4.el6.x86_64 libuuid-2.17.2-12.7.el6.x86_64 nspr-4.9.2-1.el6.x86_64 nss-3.14.0.0-12.el6.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.14.0.0-2.el6.x86_64 openldap-2.4.23-26.el6.x86_64 openssl-1.0.0-20.el6_2.5.x86_64 qpid-cpp-client-0.14-22.el6_3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0 0x00007ffff4afc4cd in accept () from /lib64/libc.so.6
#1 0x00007ffff73639c5 in SB_Trans::Sock_Listener::accept (this=0xeafe40) at sock.cpp:618
#2 0x00007ffff736a516 in SB_Trans::Sock_Stream_Accept_Thread::run (this=0xeb0de0) at sockstream.cpp:2085
#3 0x00007ffff736a243 in sock_stream_accept_thread_fun (pp_arg=0xeb0de0) at sockstream.cpp:2019
#4 0x00007ffff47f5b9f in SB_Thread::Thread::disp (this=0xeb0de0, pp_arg=0xeb0de0) at thread.cpp:211
#5 0x00007ffff47f5ff7 in thread_fun (pp_arg=0xeb0de0) at thread.cpp:307
#6 0x00007ffff47f9290 in sb_thread_sthr_disp (pp_arg=0xeb0f10) at threadl.cpp:253
#7 0x00007ffff45c5851 in start_thread () from /lib64/libpthread.so.0
#8 0x00007ffff4afb90d in clone () from /lib64/libc.so.6
(gdb) thread apply all bt

Thread 6 (Thread 0x7ffff7fba7e0 (LWP 13244)):
#0 0x00007ffff45c943c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ffff63c4f73 in wait_sync_completion (sc=0xeb6640) at src/mt_adaptor.c:85
#2 0x00007ffff63bac80 in zoo_wexists (zh=0xeb6120, path=0xeb6848 "/trafodion/dcs/master", watcher=0, watcherCtx=0x0,
    stat=0x7fffffff3550) at src/zookeeper.c:3516
#3 0x00000000004dc204 in main (argc=29, argv=0x7fffffff3948, envp=<value optimized out>) at SrvrMain.cpp:361

Thread 5 (Thread 0x7fffe5667700 (LWP 13249)):
#0 0x00007ffff45c943c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ffff63c526b in do_completion (v=0xeb6120) at src/mt_adaptor.c:463
#2 0x00007ffff45c5851 in start_thread () from /lib64/libpthread.so.0
#3 0x00007ffff4afb90d in clone () from /lib64/libc.so.6

Thread 4 (Thread 0x7fffe6068700 (LWP 13248)):
#0 0x00007ffff4af2253 in poll () from /lib64/libc.so.6
#1 0x00007ffff63c5482 in do_io (v=0xeb6120) at src/mt_adaptor.c:387
#2 0x00007ffff45c5851 in start_thread () from /lib64/libpthread.so.0
#3 0x00007ffff4afb90d in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x7fffe6a69700 (LWP 13247)):
#0 0x00007ffff4a4699d in sigtimedwait () from /lib64/libc.so.6
#1 0x00007ffff73285f4 in local_monitor_reader (pp_arg=0x7af2) at ../../../monitor/linux/clio.cxx:130
#2 0x00007ffff45c5851 in start_thread () from /lib64/libpthread.so.0
#3 0x00007ffff4afb90d in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x7fffe938d700 (LWP 13246)):
#0 0x00007ffff45c943c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00007ffff47f8656 in SB_Thread::CV::wait (this=0xeb1798)
    at /home/jenkins/workspace/traf-pub-release/trafodion/core/sqf/export/include/seabed/int/thread.inl:552
#2 0x00007ffff47f8732 in SB_Thread::CV::wait (this=0xeb1798, pv_lock=false)
    at /home/jenkins/workspace/traf-pub-release/trafodion/core/sqf/export/include/seabed/int/thread.inl:591
#3 0x00007ffff735cc77 in SB_Sig_Queue::remove (this=0xeb1740) at queue.inl:473
#4 0x00007ffff736a9a7 in SB_Trans::Sock_Stream_Helper_Thread::run (this=0xeb11f0) at sockstream.cpp:2179
#5 0x00007ffff736a26a in sock_helper_thread_fun (pp_arg=0xeb11f0) at sockstream.cpp:2029
#6 0x00007ffff47f5b9f in SB_Thread::Thread::disp (this=0xeb11f0, pp_arg=0xeb11f0) at thread.cpp:211
#7 0x00007ffff47f5ff7 in thread_fun (pp_arg=0xeb11f0) at thread.cpp:307
#8 0x00007ffff47f9290 in sb_thread_sthr_disp (pp_arg=0xeb1900) at threadl.cpp:253
---Type <return> to continue, or q <return> to quit---
#9 0x00007ffff45c5851 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff4afb90d in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7fffe9d8e700 (LWP 13245)):
#0 0x00007ffff4afc4cd in accept () from /lib64/libc.so.6
#1 0x00007ffff73639c5 in SB_Trans::Sock_Listener::accept (this=0xeafe40) at sock.cpp:618
#2 0x00007ffff736a516 in SB_Trans::Sock_Stream_Accept_Thread::run (this=0xeb0de0) at sockstream.cpp:2085
#3 0x00007ffff736a243 in sock_stream_accept_thread_fun (pp_arg=0xeb0de0) at sockstream.cpp:2019
#4 0x00007ffff47f5b9f in SB_Thread::Thread::disp (this=0xeb0de0, pp_arg=0xeb0de0) at thread.cpp:211
#5 0x00007ffff47f5ff7 in thread_fun (pp_arg=0xeb0de0) at thread.cpp:307
#6 0x00007ffff47f9290 in sb_thread_sthr_disp (pp_arg=0xeb0f10) at threadl.cpp:253
#7 0x00007ffff45c5851 in start_thread () from /lib64/libpthread.so.0
#8 0x00007ffff4afb90d in clone () from /lib64/libc.so.6
(gdb)
(gdb) q

The server log has msgs like these:

2015-01-22 22:45:26,522 INFO org.trafodion.dcs.server.ServerManager: Server handler [3:10] is not running
2015-01-22 22:45:26,522 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/trafodion/traf_jan18;. s
qenv.sh;mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0015.houston.hp.com:3:10 -ZKPNODE "/trafodion" -CNGTO 60 -ZK
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-22 22:45:26,523 INFO org.trafodion.dcs.server.ServerManager: Server handler [3:11] is not running
2015-01-22 22:45:26,524 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/trafodion/traf_jan18;. s
qenv.sh;mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0015.houston.hp.com:3:11 -ZKPNODE "/trafodion" -CNGTO 60 -ZK
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-22 22:45:26,525 INFO org.trafodion.dcs.server.ServerManager: Server handler [3:12] is not running
2015-01-22 22:45:26,525 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/trafodion/traf_jan18;. s
qenv.sh;mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0015.houston.hp.com:3:12 -ZKPNODE "/trafodion" -CNGTO 60 -ZK
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-22 22:45:26,526 INFO org.trafodion.dcs.server.ServerManager: Server handler [3:13] is not running
2015-01-22 22:45:26,527 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/trafodion/traf_jan18;. s
qenv.sh;mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0015.houston.hp.com:3:13 -ZKPNODE "/trafodion" -CNGTO 60 -ZK
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-22 22:45:26,527 INFO org.trafodion.dcs.server.ServerManager: Server handler [3:14] is not running
2015-01-22 22:45:26,528 INFO org.trafodion.dcs.server.ServerManager: User program exec [cd /opt/home/trafodion/traf_jan18;. s
qenv.sh;mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0015.houston.hp.com:3:14 -ZKPNODE "/trafodion" -CNGTO 60 -ZK
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]

summary: mxosrvrs that dont start sometimes during dcsstart due to port not
- available, start up during dcsstart and look for non existent DCSMaster
+ available, start up during dcsstop and look for non existent DCSMaster
Revision history for this message
Anuradha (anuradha-hegde) wrote :

Aruna, Is this still an issue?

Changed in trafodion:
milestone: none → r1.1
Changed in trafodion:
assignee: nobody → Anuradha (anuradha-hegde)
Revision history for this message
Matt Brown (mattbrown-2) wrote :

I recently committed a fix for #1411475" Sqstart following ckillall stalls due to orphaned Dcsserver and Dcsmaster". This was a case where ckillall is executed and DcsServer aggressively tried to restart mxosrvrs. If mxosrvr starts and cannot find znode for DcsMaster then it should die. Retry logic of DcsServer should kick in if trafodion is running. Let's discuss this

Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

Have not seen this issue in recent builds, closing it for now.

Changed in trafodion:
status: New → Fix Released
Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

Have a feeling the fix for non-standard domain names might have fixed this issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.