mxosrvr cores in CliStatement::releaseTransaction when idle timeout expires when query resultset is not fetched

Bug #1411532 reported by Aruna Sadashiva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
New
High
Tharak Capirala

Bug Description

Before fetching resultset of query, if idle timeout expires, mxosrvr cores. There is also an mxssmp core created just before the mxosrvr core most of the time, I did not see the mxssmp core when stats was enabled. To recreate, connect, prepare/execute, but don't fetch the data. After about 15 minutes, server times out and this core is generated.

Core was generated by `mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0014.houston.hp.com:2:87 -'.
Program terminated with signal 6, Aborted.
#0 0x00007ffff4a458a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install boost-filesystem-1.41.0-11.el6_1.2.x86_64 boost-program-options-1.41.0-11.el6_1.2.x86_64 boost-system-1.41.0-11.el6_1.2.x86_64 cyrus-sasl-lib-2.1.23-13.el6.x86_64 glibc-2.12-1.107.el6.x86_64 hadoop-2.3.0+cdh5.1.3+824-1.cdh5.1.3.p0.13.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.9-33.el6.x86_64 libcom_err-1.41.12-12.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libstdc++-4.4.6-4.el6.x86_64 libuuid-2.17.2-12.7.el6.x86_64 nspr-4.9.2-1.el6.x86_64 nss-3.14.0.0-12.el6.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.14.0.0-2.el6.x86_64 openldap-2.4.23-26.el6.x86_64 openssl-1.0.0-20.el6_2.5.x86_64 qpid-cpp-client-0.14-22.el6_3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0 0x00007ffff4a458a5 in raise () from /lib64/libc.so.6
#1 0x00007ffff4a4700d in abort () from /lib64/libc.so.6
#2 0x00007ffff11ab7f0 in assert_botch_abend (
    f=0x7ffff439087f "../cli/Statement.cpp", l=982,
    m=0x7ffff4390620 "Timeout waiting for control broker.",
    c=<value optimized out>) at ../export/NAAbort.cpp:282
#3 0x00007ffff43741ca in CliStatement::releaseTransaction (
    this=0x7fffcfcc48d0, allWorkRequests=1,
    alwaysSendReleaseMsg=<value optimized out>,
    statementRemainsOpen=<value optimized out>) at ../cli/Statement.cpp:982
#4 0x00007ffff4375a9c in CliStatement::close (this=<value optimized out>,
    diagsArea=..., inRollback=0) at ../cli/Statement.cpp:1214
#5 0x00007ffff43251d7 in SQLCLI_PerformTasks(CliGlobals *, ULng32, SQLSTMT_ID *, SQLDESC_ID *, SQLDESC_ID *, Lng32, Lng32, typedef __va_list_tag __va_list_tag *, SQLCLI_PTR_PAIRS *, SQLCLI_PTR_PAIRS *) (cliGlobals=0xeead40, tasks=1800,
    statement_id=0x4887008, input_descriptor=0x0, output_descriptor=0x0,
    num_input_ptr_pairs=0, num_output_ptr_pairs=0, ap=0x0,
    input_ptr_pairs=0x0, output_ptr_pairs=0x0) at ../cli/Cli.cpp:3456
#6 0x00007ffff4325a74 in SQLCLI_CloseStmt (cliGlobals=<value optimized out>,
    statement_id=<value optimized out>) at ../cli/Cli.cpp:3509
#7 0x00007ffff438b697 in SQL_EXEC_CloseStmt (statement_id=0x4887008)
    at ../cli/CliExtern.cpp:1410
#8 0x00007ffff6874028 in SRVR::releaseCachedObject (
    internalStmt=<value optimized out>, mxsrvr_substate=NDCS_CONN_IDLE)
    at srvrcommon.cpp:673
#9 0x00000000004c5366 in SRVR::connIdleTimerExpired (
    timer_tag=<value optimized out>) at SrvrConnect.cpp:3995
#10 0x000000000042b442 in BUILD_TIMER_MSG_CALL (
    call_id_=<value optimized out>, request=<value optimized out>,
    countRead=<value optimized out>, receive_info=<value optimized out>)
    at ../Common/FileSystemSrvr.cpp:598
#11 0x00000000004436ce in CNSKListener::CheckReceiveMessage (this=0xda6240,
    cc=@0x7fffffff0764, countRead=16, call_id=<value optimized out>)
    at ../Common/Listener.cpp:269
---Type <return> to continue, or q <return> to quit---
#12 0x0000000000462dee in CNSKListenerSrvr::runProgram (this=0xda6240,
    TcpProcessName=<value optimized out>, port=<value optimized out>,
    TransportTrace=<value optimized out>)
    at Interface/linux/Listener_srvr_ps.cpp:494
#13 0x00000000004dcc8d in main (argc=29, argv=0x7fffffff3948,
    envp=<value optimized out>) at SrvrMain.cpp:869
(gdb)

Core was generated by `mxssmp SQMON1.1 00003 00003 009146 $ZSM003 15.25.112.49:37638 00011 00003 00005'.
#0 0x0000003d986e971d in accept () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64 glibc-2.12-1.132.el6_5.2.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-15.el6_5.1.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libstdc++-4.4.7-4.el6.x86_64 nspr-4.10.2-1.el6_5.x86_64 nss-3.15.3-6.el6_5.x86_64 nss-softokn-freebl-3.14.3-10.el6_5.x86_64 nss-util-3.15.3-1.el6_5.x86_64 openldap-2.4.23-34.el6_5.1.x86_64 openssl-1.0.1e-16.el6_5.14.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000003d986e971d in accept () from /lib64/libc.so.6
#1 0x00007fa4610be9b5 in SB_Trans::Sock_Listener::accept (this=0x12b3b80)
    at sock.cpp:618
#2 0x00007fa4610c5506 in SB_Trans::Sock_Stream_Accept_Thread::run (
    this=0x12ba7e0) at sockstream.cpp:2085
#3 0x00007fa4610c5233 in sock_stream_accept_thread_fun (pp_arg=0x12ba7e0)
    at sockstream.cpp:2019
#4 0x00007fa460798b9f in SB_Thread::Thread::disp (this=0x12ba7e0,
    pp_arg=0x12ba7e0) at thread.cpp:211
#5 0x00007fa460798ff7 in thread_fun (pp_arg=0x12ba7e0) at thread.cpp:307
#6 0x00007fa46079c290 in sb_thread_sthr_disp (pp_arg=0x12ba910)
    at threadl.cpp:253
#7 0x0000003d98a079d1 in start_thread () from /lib64/libpthread.so.0
#8 0x0000003d986e8b5d in clone () from /lib64/libc.so.6
(gdb)

Changed in trafodion:
assignee: nobody → Sandhya Sundaresan (sandhya-sundaresan)
description: updated
Revision history for this message
Mike Hanlon (mike-hanlon) wrote :

MS tracing and gdb confirm that the following is happening:

Thread 1 opens $ZSM000 with 0x400 option, to obtain fnum 2
Thread 1 does BWRITEREADX on fnum 2 (“query started message”) and starts executing the query.
Thread 2 does BWRITEREADX on fnum 2 (“query finished message”)
Thread 2 does BAWAITIOXTS on fnum -2 and gets the reply to the “query finished message”.
Thread 2 does BAWAITIOXTS on fnum -2 and on fnum 2 and eventually times out, and aborts leaving the stack trace show in the bug description.

The final BAWAITIOXTS will not get the completion of the “query started message”, even though the server has replied. This is consistent with thread safe completion, as confirmed by Dave.

Changed in trafodion:
assignee: Sandhya Sundaresan (sandhya-sundaresan) → Mike Hanlon (mike-hanlon)
tags: added: sql-exe
removed: connectivity-mxosrvr
Changed in trafodion:
milestone: r1.0 → r1.1
Revision history for this message
Mike Hanlon (mike-hanlon) wrote :

Since we are using thread-safe file completion, SQL and SQF implicitly require that the same thread that starts a message also completes the message. SQL applications will need to execute a given query in no more than one thread, to prevent a hang such as the one seen in this bug. In the case of this bug, thread 1 starts the query, and thread 2 attempt to cleanup via SRVR::connIdleTimerExpired. To avoid the hand, it will be necessary for the cleanup to happen in thread 1.

tags: added: connectivity-dcs
removed: sql-exe
Changed in trafodion:
assignee: Mike Hanlon (mike-hanlon) → nobody
Changed in trafodion:
assignee: nobody → Tharak Capirala (capirala-tharaknath)
Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

Did not see this core with 3/22/2015 build.

Changed in trafodion:
status: New → Fix Released
Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

Reopening this bug again, I am able to reproduce this issue with 5/12/15 daily build.

Changed in trafodion:
status: Fix Released → New
Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

This bug can be reproduced from odbctest or an odbc program, but not with DBViz/SQuirrel.

Connect using odbctest, execute a query and leave it idle for > 10 mins. You will find the mxosrvr and mxssmp cores.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.