mxosrvr core generated during select count(*)

Bug #1436994 reported by Julie Thai
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Invalid
High
Daniel Lu

Bug Description

Encountered on cluster with 20150324_0830 build.

From client-side, trafci session:
SQL>select count(*) from nosalt1reg;
*** ERROR[1] The message id: problem_with_server_read
*** ERROR[1] The message id: header_not_long_enough
*** ERROR[1] The message id: problem_with_server_read
*** ERROR[1] The message id: header_not_long_enough

Stack trace from core:
Core was generated by `mxosrvr -ZKHOST rhel-qa1.hpl.hp.com:2181 -RZ rhel-qa2.hpl.hp.com:5:1 -ZKPNODE /'.
Program terminated with signal 6, Aborted.
#0 0x0000003e8d632925 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64 glibc-2.12-1.132.el6_5.2.x86_64 hadoop-2.5.0+cdh5.3.2+813-1.cdh5.3.2.p0.17.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-15.el6_5.1.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libstdc++-4.4.7-4.el6.x86_64 nspr-4.10.2-1.el6_5.x86_64 nss-3.15.3-6.el6_5.x86_64 nss-softokn-freebl-3.14.3-10.el6_5.x86_64 nss-util-3.15.3-1.el6_5.x86_64 openldap-2.4.23-34.el6_5.1.x86_64 openssl-1.0.1e-16.el6_5.14.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) where
#0 0x0000003e8d632925 in raise () from /lib64/libc.so.6
#1 0x0000003e8d63408d in abort () from /lib64/libc.so.6
#2 0x00007f5441f6ca55 in os::abort(bool) ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#3 0x00007f54420ecf87 in VMError::report_and_die() ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#4 0x00007f5441f7196f in JVM_handle_linux_signal ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#5 <signal handler called>
#6 0x00000000004c4b88 in odbc_SQLSrvr_ExecDirect_ame_ (objtag_=0x2af42b0,
    call_id_=0x2af4308, dialogueId=479758385, stmtLabel=0x2fcda50 "SQL_CUR_2",
    cursorName=<value optimized out>, stmtExplainLabel=<value optimized out>,
    stmtType=0, sqlStmtType=1,
    sqlString=0x2fcda24 "select count(*) from nosalt1reg", sqlAsyncEnable=0,
    queryTimeout=0, inputRowCnt=0, txnID=0, holdableCursor=0)
    at SrvrConnect.cpp:7785
#7 0x0000000000493aa6 in SQLEXECUTE_IOMessage (objtag_=0x2af42b0,
    call_id_=0x2af4308, operation_id=3012) at Interface/odbcs_srvr.cpp:1728
#8 0x0000000000493b54 in DISPATCH_TCPIPRequest (objtag_=0x2af42b0,
    call_id_=0x2af4308, operation_id=<value optimized out>)
    at Interface/odbcs_srvr.cpp:1793
#9 0x0000000000433872 in BUILD_TCPIP_REQUEST (pnode=0x2af42b0)
    at ../Common/TCPIPSystemSrvr.cpp:603
#10 0x000000000043420d in PROCESS_TCPIP_REQUEST (pnode=0x2af42b0)
    at ../Common/TCPIPSystemSrvr.cpp:581
#11 0x00000000004623f6 in CNSKListenerSrvr::tcpip_listener (arg=0x29bd0d0)
    at Interface/linux/Listener_srvr_ps.cpp:400
#12 0x00007f5441522290 in sb_thread_sthr_disp (pp_arg=0x2acde60)
    at threadl.cpp:253
#13 0x0000003e8da079d1 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003e8d6e8b5d in clone () from /lib64/libc.so.6

Arvind's initial investigation:
Seems like statement allocate failed – pSrvrStmt is null.

2015-03-25 13:44:28,728, ERROR, MXOSRVR, Node Number: 4, CPU: 4, PIN:26182, Process Name:$Z040LD2 , , ,A SQL/MX query failed. SQLCODE: SQL statement allocate failed, sqlcode = -8811, sql error = *** ERROR[8811] Trying to close a statement that is either not in the open state or has not reached EOF. [2015-03-25 13:44:28].
Error Text: <5>.

(gdb) p pSrvrStmt
$18 = (SRVR_STMT_HDL *) 0x0
(gdb)

Revision history for this message
Julie Thai (julie-y-thai) wrote :

Not able to reproduce the issue at will or with any regularity.

Changed in trafodion:
importance: Critical → Medium
Revision history for this message
Julie Thai (julie-y-thai) wrote :

Encountered mxosrvr core again on cluster with 20150329_0830 build:
Core was generated by `mxosrvr -ZKHOST rhel-qa1.hpl.hp.com:2181 -RZ rhel-qa2.hpl.hp.com:5:1 -ZKPNODE /'.
Program terminated with signal 6, Aborted.
#0 0x0000003e8d632925 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.23-13.el6_3.1.x86_64 glibc-2.12-1.132.el6_5.2.x86_64 hadoop-2.5.0+cdh5.3.2+813-1.cdh5.3.2.p0.17.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.10.3-15.el6_5.1.x86_64 libcom_err-1.41.12-18.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 libstdc++-4.4.7-4.el6.x86_64 nspr-4.10.2-1.el6_5.x86_64 nss-3.15.3-6.el6_5.x86_64 nss-softokn-freebl-3.14.3-10.el6_5.x86_64 nss-util-3.15.3-1.el6_5.x86_64 openldap-2.4.23-34.el6_5.1.x86_64 openssl-1.0.1e-16.el6_5.14.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 0x0000003e8d632925 in raise () from /lib64/libc.so.6
#1 0x0000003e8d63408d in abort () from /lib64/libc.so.6
#2 0x00007f2ac1877a55 in os::abort(bool) ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#3 0x00007f2ac19f7f87 in VMError::report_and_die() ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#4 0x00007f2ac187c96f in JVM_handle_linux_signal ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#5 <signal handler called>
#6 0x00000000004c4c58 in odbc_SQLSrvr_ExecDirect_ame_ (objtag_=0x2422c50,
    call_id_=0x2422ca8, dialogueId=1347132364,
    stmtLabel=0x342d230 "SQL_CUR_2", cursorName=<value optimized out>,
    stmtExplainLabel=<value optimized out>, stmtType=0, sqlStmtType=1,
    sqlString=0x342d204 "select count(*) from nosalt1reg", sqlAsyncEnable=0,
    queryTimeout=0, inputRowCnt=0, txnID=0, holdableCursor=0)
    at SrvrConnect.cpp:7805
#7 0x0000000000493ab6 in SQLEXECUTE_IOMessage (objtag_=0x2422c50,
    call_id_=0x2422ca8, operation_id=3012) at Interface/odbcs_srvr.cpp:1728
#8 0x0000000000493b64 in DISPATCH_TCPIPRequest (objtag_=0x2422c50,
    call_id_=0x2422ca8, operation_id=<value optimized out>)
    at Interface/odbcs_srvr.cpp:1793
#9 0x0000000000433882 in BUILD_TCPIP_REQUEST (pnode=0x2422c50)
    at ../Common/TCPIPSystemSrvr.cpp:603
#10 0x000000000043421d in PROCESS_TCPIP_REQUEST (pnode=0x2422c50)
    at ../Common/TCPIPSystemSrvr.cpp:581
#11 0x0000000000462406 in CNSKListenerSrvr::tcpip_listener (arg=0x11c9080)
    at Interface/linux/Listener_srvr_ps.cpp:400
#12 0x00007f2ac0e2d2e0 in sb_thread_sthr_disp (pp_arg=0x12d9df0)
    at threadl.cpp:253
#13 0x0000003e8da079d1 in start_thread () from /lib64/libpthread.so.0
#14 0x0000003e8d6e8b5d in clone () from /lib64/libc.so.6

Changed in trafodion:
importance: Medium → High
Changed in trafodion:
assignee: nobody → Daniel Lu (ping-lu)
Daniel Lu (ping-lu)
Changed in trafodion:
status: New → In Progress
jiegao (jie-gao)
summary:
Revision history for this message
Daniel Lu (ping-lu) wrote :

here is some analysis.

1. in the stack trace, the source line from where the exception is thrown out (at SrvrConnect.cpp:7805) is not relevant to current source code, the code in there do not include any signal handler code corresponding to the stack trace.
2. the exception is from JVM (JVM signal handler), mxosrvr do not call any java code directly.
3. the query is a very regular "select count(*)". mxosrvr should not trigger such kind of JVM exception directly.
4. can't reproduce in workstation.

mark as invalid for now, if it is still reproducible in cluster with latest build, pls reopen it with more information.

Changed in trafodion:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.