mxosrvrs that dont start sometimes during dcsstart due to port not available, start up during dcsstop and look for non existent DCSMaster
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Trafodion |
Fix Released
|
High
|
Anuradha |
Bug Description
Frequently, when we have a large number of mxosrvrs (I had 96 pernode), all mxosrvrs dont come up during dcsstart. This seems to happen when the instance is stopped/restarted few times. Have to use ckillall.
The mxosrvrs that didnt start up with dcsstart, started when dcsstop was done and ports freed up and was looking for DCSMaster, which was gone.
Took a gcore of one of the mxosrvr process that was not stopping:
Core was generated by `mxosrvr -ZKHOST n013:2181,
#0 0x00007ffff4afc4cd in accept () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install boost-filesyste
(gdb) bt
#0 0x00007ffff4afc4cd in accept () from /lib64/libc.so.6
#1 0x00007ffff73639c5 in SB_Trans:
#2 0x00007ffff736a516 in SB_Trans:
#3 0x00007ffff736a243 in sock_stream_
#4 0x00007ffff47f5b9f in SB_Thread:
#5 0x00007ffff47f5ff7 in thread_fun (pp_arg=0xeb0de0) at thread.cpp:307
#6 0x00007ffff47f9290 in sb_thread_sthr_disp (pp_arg=0xeb0f10) at threadl.cpp:253
#7 0x00007ffff45c5851 in start_thread () from /lib64/
#8 0x00007ffff4afb90d in clone () from /lib64/libc.so.6
(gdb) thread apply all bt
Thread 6 (Thread 0x7ffff7fba7e0 (LWP 13244)):
#0 0x00007ffff45c943c in pthread_
#1 0x00007ffff63c4f73 in wait_sync_
#2 0x00007ffff63bac80 in zoo_wexists (zh=0xeb6120, path=0xeb6848 "/trafodion/
stat=
#3 0x00000000004dc204 in main (argc=29, argv=0x7fffffff
Thread 5 (Thread 0x7fffe5667700 (LWP 13249)):
#0 0x00007ffff45c943c in pthread_
#1 0x00007ffff63c526b in do_completion (v=0xeb6120) at src/mt_
#2 0x00007ffff45c5851 in start_thread () from /lib64/
#3 0x00007ffff4afb90d in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7fffe6068700 (LWP 13248)):
#0 0x00007ffff4af2253 in poll () from /lib64/libc.so.6
#1 0x00007ffff63c5482 in do_io (v=0xeb6120) at src/mt_
#2 0x00007ffff45c5851 in start_thread () from /lib64/
#3 0x00007ffff4afb90d in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7fffe6a69700 (LWP 13247)):
#0 0x00007ffff4a4699d in sigtimedwait () from /lib64/libc.so.6
#1 0x00007ffff73285f4 in local_monitor_
#2 0x00007ffff45c5851 in start_thread () from /lib64/
#3 0x00007ffff4afb90d in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fffe938d700 (LWP 13246)):
#0 0x00007ffff45c943c in pthread_
#1 0x00007ffff47f8656 in SB_Thread::CV::wait (this=0xeb1798)
at /home/jenkins/
#2 0x00007ffff47f8732 in SB_Thread::CV::wait (this=0xeb1798, pv_lock=false)
at /home/jenkins/
#3 0x00007ffff735cc77 in SB_Sig_
#4 0x00007ffff736a9a7 in SB_Trans:
#5 0x00007ffff736a26a in sock_helper_
#6 0x00007ffff47f5b9f in SB_Thread:
#7 0x00007ffff47f5ff7 in thread_fun (pp_arg=0xeb11f0) at thread.cpp:307
#8 0x00007ffff47f9290 in sb_thread_sthr_disp (pp_arg=0xeb1900) at threadl.cpp:253
---Type <return> to continue, or q <return> to quit---
#9 0x00007ffff45c5851 in start_thread () from /lib64/
#10 0x00007ffff4afb90d in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fffe9d8e700 (LWP 13245)):
#0 0x00007ffff4afc4cd in accept () from /lib64/libc.so.6
#1 0x00007ffff73639c5 in SB_Trans:
#2 0x00007ffff736a516 in SB_Trans:
#3 0x00007ffff736a243 in sock_stream_
#4 0x00007ffff47f5b9f in SB_Thread:
#5 0x00007ffff47f5ff7 in thread_fun (pp_arg=0xeb0de0) at thread.cpp:307
#6 0x00007ffff47f9290 in sb_thread_sthr_disp (pp_arg=0xeb0f10) at threadl.cpp:253
#7 0x00007ffff45c5851 in start_thread () from /lib64/
#8 0x00007ffff4afb90d in clone () from /lib64/libc.so.6
(gdb)
(gdb) q
The server log has msgs like these:
2015-01-22 22:45:26,522 INFO org.trafodion.
2015-01-22 22:45:26,522 INFO org.trafodion.
qenv.sh;mxosrvr -ZKHOST n013:2181,
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-22 22:45:26,523 INFO org.trafodion.
2015-01-22 22:45:26,524 INFO org.trafodion.
qenv.sh;mxosrvr -ZKHOST n013:2181,
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-22 22:45:26,525 INFO org.trafodion.
2015-01-22 22:45:26,525 INFO org.trafodion.
qenv.sh;mxosrvr -ZKHOST n013:2181,
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-22 22:45:26,526 INFO org.trafodion.
2015-01-22 22:45:26,527 INFO org.trafodion.
qenv.sh;mxosrvr -ZKHOST n013:2181,
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
2015-01-22 22:45:26,527 INFO org.trafodion.
2015-01-22 22:45:26,528 INFO org.trafodion.
qenv.sh;mxosrvr -ZKHOST n013:2181,
STO 180 -EADSCO 0 -TCPADD 16.235.158.30 -MAXHEAPPCT 0 -STATISTICSINTERVAL 5 -STATISTICSLIMIT 5 -STATISTICSTYPE aggregat
ed -STATISTICSENABLE true -PORTMAPTOSECS -1 -PORTBINDTOSECS -1]
summary: |
mxosrvrs that dont start sometimes during dcsstart due to port not - available, start up during dcsstart and look for non existent DCSMaster + available, start up during dcsstop and look for non existent DCSMaster |
Changed in trafodion: | |
assignee: | nobody → Anuradha (anuradha-hegde) |
Aruna, Is this still an issue?