Before fetching resultset of query, if idle timeout expires, mxosrvr cores. There is also an mxssmp core created just before the mxosrvr core most of the time, I did not see the mxssmp core when stats was enabled. To recreate, connect, prepare/execute, but don't fetch the data. After about 15 minutes, server times out and this core is generated.
Core was generated by `mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0014.houston.hp.com:2:87 -'.
Program terminated with signal 6, Aborted.
#0 0x00007ffff4a458a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install boost-filesystem-1.41.0-11.el6_1.2.x86_64 boost-program-options-1.41.0-11.el6_1.2.x86_64 boost-system-1.41.0-11.el6_1.2.x86_64 cyrus-sasl-lib-2.1.23-13.el6.x86_64 glibc-2.12-1.107.el6.x86_64 hadoop-2.3.0+cdh5.1.3+824-1.cdh5.1.3.p0.13.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.9-33.el6.x86_64 libcom_err-1.41.12-12.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libstdc++-4.4.6-4.el6.x86_64 libuuid-2.17.2-12.7.el6.x86_64 nspr-4.9.2-1.el6.x86_64 nss-3.14.0.0-12.el6.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.14.0.0-2.el6.x86_64 openldap-2.4.23-26.el6.x86_64 openssl-1.0.0-20.el6_2.5.x86_64 qpid-cpp-client-0.14-22.el6_3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0 0x00007ffff4a458a5 in raise () from /lib64/libc.so.6
#1 0x00007ffff4a4700d in abort () from /lib64/libc.so.6
#2 0x00007ffff11ab7f0 in assert_botch_abend (
f=0x7ffff439087f "../cli/Statement.cpp", l=982,
m=0x7ffff4390620 "Timeout waiting for control broker.",
c=<value optimized out>) at ../export/NAAbort.cpp:282
#3 0x00007ffff43741ca in CliStatement::releaseTransaction (
this=0x7fffcfcc48d0, allWorkRequests=1,
alwaysSendReleaseMsg=<value optimized out>,
statementRemainsOpen=<value optimized out>) at ../cli/Statement.cpp:982
#4 0x00007ffff4375a9c in CliStatement::close (this=<value optimized out>,
diagsArea=..., inRollback=0) at ../cli/Statement.cpp:1214
#5 0x00007ffff43251d7 in SQLCLI_PerformTasks(CliGlobals *, ULng32, SQLSTMT_ID *, SQLDESC_ID *, SQLDESC_ID *, Lng32, Lng32, typedef __va_list_tag __va_list_tag *, SQLCLI_PTR_PAIRS *, SQLCLI_PTR_PAIRS *) (cliGlobals=0xeead40, tasks=1800,
statement_id=0x4887008, input_descriptor=0x0, output_descriptor=0x0,
num_input_ptr_pairs=0, num_output_ptr_pairs=0, ap=0x0,
input_ptr_pairs=0x0, output_ptr_pairs=0x0) at ../cli/Cli.cpp:3456
#6 0x00007ffff4325a74 in SQLCLI_CloseStmt (cliGlobals=<value optimized out>,
statement_id=<value optimized out>) at ../cli/Cli.cpp:3509
#7 0x00007ffff438b697 in SQL_EXEC_CloseStmt (statement_id=0x4887008)
at ../cli/CliExtern.cpp:1410
#8 0x00007ffff6874028 in SRVR::releaseCachedObject (
internalStmt=<value optimized out>, mxsrvr_substate=NDCS_CONN_IDLE)
at srvrcommon.cpp:673
#9 0x00000000004c5366 in SRVR::connIdleTimerExpired (
timer_tag=<value optimized out>) at SrvrConnect.cpp:3995
#10 0x000000000042b442 in BUILD_TIMER_MSG_CALL (
call_id_=<value optimized out>, request=<value optimized out>,
countRead=<value optimized out>, receive_info=<value optimized out>)
at ../Common/FileSystemSrvr.cpp:598
#11 0x00000000004436ce in CNSKListener::CheckReceiveMessage (this=0xda6240,
cc=@0x7fffffff0764, countRead=16, call_id=<value optimized out>)
at ../Common/Listener.cpp:269
---Type <return> to continue, or q <return> to quit---
#12 0x0000000000462dee in CNSKListenerSrvr::runProgram (this=0xda6240,
TcpProcessName=<value optimized out>, port=<value optimized out>,
TransportTrace=<value optimized out>)
at Interface/linux/Listener_srvr_ps.cpp:494
#13 0x00000000004dcc8d in main (argc=29, argv=0x7fffffff3948,
envp=<value optimized out>) at SrvrMain.cpp:869
(gdb)
Core was generated by `mxssmp SQMON1.1 00003 00003 009146 $ZSM003 15.25.112.49:37638 00011 00003 00005'.
#0 0x0000003d986e971d in accept () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64 glibc-2.12-1.132.el6_5.2.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-15.el6_5.1.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libstdc++-4.4.7-4.el6.x86_64 nspr-4.10.2-1.el6_5.x86_64 nss-3.15.3-6.el6_5.x86_64 nss-softokn-freebl-3.14.3-10.el6_5.x86_64 nss-util-3.15.3-1.el6_5.x86_64 openldap-2.4.23-34.el6_5.1.x86_64 openssl-1.0.1e-16.el6_5.14.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000003d986e971d in accept () from /lib64/libc.so.6
#1 0x00007fa4610be9b5 in SB_Trans::Sock_Listener::accept (this=0x12b3b80)
at sock.cpp:618
#2 0x00007fa4610c5506 in SB_Trans::Sock_Stream_Accept_Thread::run (
this=0x12ba7e0) at sockstream.cpp:2085
#3 0x00007fa4610c5233 in sock_stream_accept_thread_fun (pp_arg=0x12ba7e0)
at sockstream.cpp:2019
#4 0x00007fa460798b9f in SB_Thread::Thread::disp (this=0x12ba7e0,
pp_arg=0x12ba7e0) at thread.cpp:211
#5 0x00007fa460798ff7 in thread_fun (pp_arg=0x12ba7e0) at thread.cpp:307
#6 0x00007fa46079c290 in sb_thread_sthr_disp (pp_arg=0x12ba910)
at threadl.cpp:253
#7 0x0000003d98a079d1 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003d986e8b5d in clone () from /lib64/libc.so.6
(gdb)
MS tracing and gdb confirm that the following is happening:
Thread 1 opens $ZSM000 with 0x400 option, to obtain fnum 2
Thread 1 does BWRITEREADX on fnum 2 (“query started message”) and starts executing the query.
Thread 2 does BWRITEREADX on fnum 2 (“query finished message”)
Thread 2 does BAWAITIOXTS on fnum -2 and gets the reply to the “query finished message”.
Thread 2 does BAWAITIOXTS on fnum -2 and on fnum 2 and eventually times out, and aborts leaving the stack trace show in the bug description.
The final BAWAITIOXTS will not get the completion of the “query started message”, even though the server has replied. This is consistent with thread safe completion, as confirmed by Dave.