MXOSRVR abort/core during ramp-up period

Bug #1425650 reported by Gary Hall
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Committed
High
Arvind Narain

Bug Description

This problem occurs randomly, and only during rampup of a performance test with 1024 parallel streams being used. The mxosrvr process aborts, resulting in the client receiving the unexpected exception message "There was a problem reading from the server".

The latest two cores can be found on Zircon4, and have the following stack trace info.

echo /local/cores/1005/core.1424833806.n018.28943.mxosrvr
echo

[Thread debugging using libthread_db enabled]
Core was generated by `mxosrvr -ZKHOST n014.cm.cluster:2181,n015.cm.cluster:2181,n016.cm.cluster:2181'.
Program terminated with signal 6, Aborted.
#0 0x00007ffff4a328a5 in raise () from /lib64/libc.so.6
#0 0x00007ffff4a328a5 in raise () from /lib64/libc.so.6
#1 0x00007ffff4a3400d in abort () from /lib64/libc.so.6
#2 0x00007ffff4a707b7 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff4a760e6 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007ffff6a2ecc9 in SRVR_STMT_HDL::cleanupAll (this=0x29a7770) at csrvrstmt.cpp:632
#5 0x00000000004ccbb6 in SessionWatchDog (arg=<value optimized out>) at SrvrConnect.cpp:853
#6 0x00007ffff45b2851 in start_thread () from /lib64/libpthread.so.0
#7 0x00007ffff4ae890d in clone () from /lib64/libc.so.6

echo /local/cores/1005/core.1424834536.n018.50441.mxosrvr
echo

[Thread debugging using libthread_db enabled]
Core was generated by `mxosrvr -ZKHOST n014.cm.cluster:2181,n015.cm.cluster:2181,n016.cm.cluster:2181'.
Program terminated with signal 6, Aborted.
#0 0x00007ffff4a328a5 in raise () from /lib64/libc.so.6
#0 0x00007ffff4a328a5 in raise () from /lib64/libc.so.6
#1 0x00007ffff4a3400d in abort () from /lib64/libc.so.6
#2 0x00007ffff5d51a55 in os::abort(bool) () from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#3 0x00007ffff5ed1f87 in VMError::report_and_die() () from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#4 0x00007ffff5d5696f in JVM_handle_linux_signal () from /usr/java/jdk1.7.0_67/jre/lib/amd64/server/libjvm.so
#5 <signal handler called>
#6 0x00007ffff4a7b6ec in free () from /lib64/libc.so.6
#7 0x00007ffff6a2ecad in SRVR_STMT_HDL::cleanupAll (this=0x2935780) at csrvrstmt.cpp:627
#8 0x00000000004ccbb6 in SessionWatchDog (arg=<value optimized out>) at SrvrConnect.cpp:853
#9 0x00007ffff45b2851 in start_thread () from /lib64/libpthread.so.0
#10 0x00007ffff4ae890d in clone () from /lib64/libc.so.6
echo

Arvind Narain has provided addition research detail on the above cores:

In both cases pSrvrStmt is trashed.

As per the messages this is happening while setting CQD to disable transactions in repository context.

/opt/hp/squser4/tr101hdlUTT0224/tmp> grep 28943 *
monitor.map.7.n018:BEGIN Wed Feb 25 03:05:39 2015 $Z070NLY (7, 28943:14174) P(-1, -1:-1) mxosrvr
monitor.map.7.n018:END Wed Feb 25 03:10:06 2015 $Z070NLY (7, 28943:14174) P(-1, -1:-1) mxosrvr

2015-02-25 03:10:06,242, ERROR, SQL, Node Number: 0, CPU: 7, PIN: 28943, Process Name: $Z070NLY, SQLCODE: 8804,, *** ERROR[8804] The provided input statement does not exist in the current context.
2015-02-25 03:10:06,242, ERROR, SQL, Node Number: 0, CPU: 7, PIN: 28943, Process Name: $Z070NLY, SQLCODE: 8804,, *** ERROR[8804] The provided input statement does not exist in the current context.
2015-02-25 03:10:06,242, ERROR, MXOSRVR, Node Number: 7, CPU: 7, PIN:28943, Process Name:$Z070NLY , , ,A NonStop Process Service error Failed to skip transaction - *** ERROR[8804] The provided input statement does not exist in the current context. [2015-02-25 03:10:06] has occurred.

/opt/hp/squser4/tr101hdlUTT0224/tmp> grep 50441 *
monitor.map.7.n018:BEGIN Wed Feb 25 03:17:49 2015 $Z071666 (7, 50441:15278) P(-1, -1:-1) mxosrvr
monitor.map.7.n018:END Wed Feb 25 03:22:16 2015 $Z071666 (7, 50441:15278) P(-1, -1:-1) mxosrvr

2015-02-25 03:22:16,023, ERROR, SQL, Node Number: 0, CPU: 7, PIN: 50441, Process Name: $Z071666, SQLCODE: 8804,, *** ERROR[8804] The provided input statement does not exist in the current context.
2015-02-25 03:22:16,023, ERROR, SQL, Node Number: 0, CPU: 7, PIN: 50441, Process Name: $Z071666, SQLCODE: 8804,, *** ERROR[8804] The provided input statement does not exist in the current context.
2015-02-25 03:22:16,023, ERROR, MXOSRVR, Node Number: 7, CPU: 7, PIN:50441, Process Name:$Z071666 , , ,A NonStop Process Service error Failed to skip transaction - *** ERROR[8804] The provided input statement does not exist in the current context. [2015-02-25 03:22:16] has occurred.

Changed in trafodion:
importance: Undecided → High
Changed in trafodion:
assignee: nobody → Arvind Narain (arvind-narain)
milestone: none → r1.0.1
milestone: r1.0.1 → r1.1
Revision history for this message
Arvind Narain (arvind-narain) wrote :

 in the UTT given earlier to address second core seen in bug Bug 1422894 mutex was put in while managing the stmt list maintained by mxosrvr. mutex was being done while adding or deleting from the list.

Now mutex is also done in getSrvrStmt and equivalent where currentStatement gets setup.

Changed in trafodion:
status: New → In Progress
Revision history for this message
Arvind Narain (arvind-narain) wrote :
Changed in trafodion:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.