With stats enabled, OE/DC perf tests generating coring in mxosrvr, ResStatisticsStatement::SendQueryStats():2499

Bug #1410928 reported by Aruna Sadashiva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
Critical
Tharak Capirala

Bug Description

During YCSB perf tests on Amethyst and it generated a bunch of mxosrvr cores.

Here is a stack trace of one core:

Core was generated by `mxosrvr -ZKHOST n013:2181,n014:2181,n015:2181 -RZ g4q0013.houston.hp.com:1:6 -Z'.
Program terminated with signal 6, Aborted.
#0 0x00007ffff4c488a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install boost-filesystem-1.41.0-11.el6_1.2.x86_64 boost-program-options-1.41.0-11.el6_1.2.x86_64 boost-system-1.41.0-11.el6_1.2.x86_64 cyrus-sasl-lib-2.1.23-13.el6.x86_64 glibc-2.12-1.107.el6.x86_64 hadoop-2.3.0+cdh5.1.3+824-1.cdh5.1.3.p0.13.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 krb5-libs-1.9-33.el6.x86_64 libcom_err-1.41.12-12.el6.x86_64 libgcc-4.4.6-4.el6.x86_64 libselinux-2.0.94-5.3.el6.x86_64 libstdc++-4.4.6-4.el6.x86_64 libuuid-2.17.2-12.7.el6.x86_64 nspr-4.9.2-1.el6.x86_64 nss-3.14.0.0-12.el6.x86_64 nss-softokn-freebl-3.12.9-11.el6.x86_64 nss-util-3.14.0.0-2.el6.x86_64 openldap-2.4.23-26.el6.x86_64 openssl-1.0.0-20.el6_2.5.x86_64 qpid-cpp-client-0.14-22.el6_3.x86_64 zlib-1.2.3-27.el6.x86_64
(gdb) bt
#0 0x00007ffff4c488a5 in raise () from /lib64/libc.so.6
#1 0x00007ffff4c4a00d in abort () from /lib64/libc.so.6
#2 0x00007ffff5d51a55 in os::abort(bool) ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#3 0x00007ffff5ed1f87 in VMError::report_and_die() ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#4 0x00007ffff5d5696f in JVM_handle_linux_signal ()
   from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#5 <signal handler called>
#6 0x00007ffff4d48d5f in __strlen_sse42 () from /lib64/libc.so.6
#7 0x00007ffff6a5e696 in length (this=0xecc7f0, bStart=true,
    pSrvrStmt=0x2023c60, inSqlError=0x0, inSqlErrorLength=0)
    at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/char_traits.h:263
#8 assign (this=0xecc7f0, bStart=true, pSrvrStmt=0x2023c60, inSqlError=0x0,
    inSqlErrorLength=0)
    at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:975
#9 operator= (this=0xecc7f0, bStart=true, pSrvrStmt=0x2023c60,
    inSqlError=0x0, inSqlErrorLength=0)
    at /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/bits/basic_string.h:519
#10 ResStatisticsStatement::SendQueryStats (this=0xecc7f0, bStart=true,
---Type <return> to continue, or q <return> to quit---
    pSrvrStmt=0x2023c60, inSqlError=0x0, inSqlErrorLength=0)
    at ResStatisticsStatement.cpp:2499
#11 0x000000000042b5b2 in BUILD_TIMER_MSG_CALL (
    call_id_=<value optimized out>, request=<value optimized out>,
    countRead=<value optimized out>, receive_info=<value optimized out>)
    at ../Common/FileSystemSrvr.cpp:598
#12 0x00000000004437f6 in CNSKListener::CheckReceiveMessage (this=0xda93f0,
    cc=@0x7ffffffef324, countRead=16, call_id=<value optimized out>)
    at ../Common/Listener.cpp:269
#13 0x000000000046304e in CNSKListenerSrvr::runProgram (this=0xda93f0,
    TcpProcessName=<value optimized out>, port=<value optimized out>,
    TransportTrace=<value optimized out>)
    at Interface/linux/Listener_srvr_ps.cpp:494
#14 0x00000000004dd72b in main (argc=29, argv=0x7fffffff24a8,
    envp=<value optimized out>) at SrvrMain.cpp:897
(gdb)

Changed in trafodion:
assignee: nobody → Judy Zhao (hongxia-zhao)
Changed in trafodion:
importance: High → Critical
Changed in trafodion:
status: New → Fix Committed
Changed in trafodion:
importance: Critical → High
Changed in trafodion:
status: Fix Committed → In Progress
Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

Bunch of cores generated during YCSB test with 1/17 build, bumping this back to critical.

Changed in trafodion:
importance: High → Critical
description: updated
Changed in trafodion:
assignee: Judy Zhao (hongxia-zhao) → Tharak Capirala (capirala-tharaknath)
Revision history for this message
Tharak Capirala (capirala-tharaknath) wrote :

The core seems to be occurring because of a lack of synchronization between the timer thread and main thread accessing a global statement object. Adding this synchronization to the current design without impacting performance will be a challenge and time consuming since it may need a lot of rework.

Changing the priority from critical to high. Propose that for now we disable the timer based query statistics code (the root cause of the issue) to avoid coring of mxosrvr and defer the fix in the next patch/release. This currently will result in losing periodic statistics update of long running queries (by default greater than 60 secs).

Changed in trafodion:
importance: Critical → High
Changed in trafodion:
importance: High → Critical
Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix proposed to core (master)

Fix proposed to branch: master
Review: https://review.trafodion.org/1019

Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix merged to core (master)

Reviewed: https://review.trafodion.org/1019
Committed: https://github.com/trafodion/core/commit/93df4ee8593ec4d505a2295cc7ba36035ba9dc6e
Submitter: Trafodion Jenkins
Branch: master

commit 93df4ee8593ec4d505a2295cc7ba36035ba9dc6e
Author: Tharaknath Capirala <email address hidden>
Date: Thu Jan 22 17:50:55 2015 +0000

    Fix for bug #1410928. Mxosrvr coring during performance tests.

    The core seems to be occurring because of a lack of synchronization
    between the timer thread and main thread accessing a global statement
    object. When the statement is dropped the associated pSrvrStmt
    is also deleted and hence the global pointer is invalidated and causes
    the core when accessed. The fix now also nulls the global statement
    pointer so that the timer thread ignores the dropped
    statement.

    Fixes bug #1410928

    Change-Id: I06b15b90325a7b405d4adcca871b58c9dba51729

Changed in trafodion:
status: In Progress → Fix Committed
Changed in trafodion:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.