mxosrvr core in SRVR::FETCH2bulk when columns in particular order

Bug #1438775 reported by Aruna Sadashiva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
Critical
Tharak Capirala

Bug Description

First select crashes, but the second one works:

Select exec_start_utc_ts, exec_end_utc_ts, query_id, query_text from “_REPOS_”.metric_query_table;
Select query_text, exec_start_utc_ts, exec_end_utc_ts, query_id from “_REPOS_”.metric_query_table;

This system had about 350 rows in metric_query_table. It worked ok on a system where metric_query_table had 40 rows.

Stack trace:
#0 0x00007ffff4829625 in raise () from /lib64/libc.so.6
#1 0x00007ffff482ad8d in abort () from /lib64/libc.so.6
#2 0x00007ffff4867537 in __libc_message () from /lib64/libc.so.6
#3 0x00007ffff486ce66 in malloc_printerr () from /lib64/libc.so.6
#4 0x00007ffff486f9b3 in _int_free () from /lib64/libc.so.6
#5 0x00007ffff685f53b in SRVR::FETCH2bulk (pSrvrStmt=0x1ca8650) at sqlinterface.cpp:6111
#6 0x00007ffff688dbda in odbc_SQLSrvr_FetchPerf_sme_ (objtag_=<optimized out>, call_id_=<optimized out>,
    returnCode=0x7fffe4fde67c, dialogueId=<optimized out>, stmtLabel=0xed7f04 "SQL_CUR_2", maxRowCnt=100, maxRowLen=0,
    sqlAsyncEnable=0, queryTimeout=0, rowsAffected=0x7fffe4fde678, outValuesFormat=0x7fffe4fde674,
    outputDataValue=0x7fffe4fde650, sqlWarningOrErrorLength=0x7fffe4fde670, sqlWarningOrError=@0x7fffe4fde668: 0x0)
    at srvrothers.cpp:5815
#7 0x00000000004b6907 in odbc_SQLSrvr_Fetch_ame_ (objtag_=0xed8160, call_id_=0xed81b8, dialogueId=1449409261,
    operation_id=<optimized out>, sqlAsyncEnable=0, queryTimeout=0, stmtHandle=0, stmtLabel=0xed7f04 "SQL_CUR_2",
    maxRowCnt=100, maxRowLen=0) at SrvrConnect.cpp:7498
#8 0x000000000048b852 in SQLFETCH_IOMessage (objtag_=0xed8160, call_id_=0xed81b8, operation_id=3009)
    at Interface/odbcs_srvr.cpp:1021
#9 0x0000000000493d7e in DISPATCH_TCPIPRequest (objtag_=0xed8160, call_id_=0xed81b8, operation_id=<optimized out>)
    at Interface/odbcs_srvr.cpp:1803
#10 0x0000000000433872 in BUILD_TCPIP_REQUEST (pnode=0xed8160) at ../Common/TCPIPSystemSrvr.cpp:603
#11 0x000000000043420d in PROCESS_TCPIP_REQUEST (pnode=0xed8160) at ../Common/TCPIPSystemSrvr.cpp:581
#12 0x00000000004623f6 in CNSKListenerSrvr::tcpip_listener (arg=0xda1a50) at Interface/linux/Listener_srvr_ps.cpp:400
#13 0x00007ffff45dd290 in sb_thread_sthr_disp (pp_arg=0xed74d0) at threadl.cpp:253
#14 0x00007ffff43a99d1 in start_thread () from /lib64/libpthread.so.0
#15 0x00007ffff48df8fd in clone () from /lib64/libc.so.6

According to Arvind, crash happens in this part of the code (sqlinterface.cpp):

        //Changes due to cursor issue
        if (pSrvrStmt->maxRowCnt > 0)
        {
                if (pSrvrStmt->outputDescVarBufferLen > 0)
                {
                        if( srvrGlobal->m_FetchBufferSize/pSrvrStmt->outputDescVarBufferLen < pSrvrStmt->maxRowCnt )
                        {
                                if (pSrvrStmt->outputDescVarBuffer != NULL)
                                        delete pSrvrStmt->outputDescVarBuffer;
                                pSrvrStmt->outputDescVarBuffer = NULL;
                                markNewOperator,pSrvrStmt->outputDescVarBuffer =
 new BYTE[pSrvrStmt->maxRowCnt*pSrvrStmt->outputDescVarBufferLen];
                                if (pSrvrStmt->outputDescVarBuffer == NULL)

Tags: sql-exe
Changed in trafodion:
status: New → In Progress
assignee: nobody → Tharak Capirala (capirala-tharaknath)
Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

Vito did more testing and found more repo table queries that cause mxosrvr cores:

select EXEC_START_UTC_TS,QUERY_ID,USER_NAME,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,USER_NAME,EXPLAIN_PLAN from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,ROLE_NAME,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,ROLE_NAME,EXPLAIN_PLAN from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,SESSION_ID,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,SESSION_ID,EXPLAIN_PLAN from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,STATEMENT_ID,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,STATEMENT_ID,EXPLAIN_PLAN from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,STATEMENT_TYPE,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,STATEMENT_TYPE,EXPLAIN_PLAN from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,STATEMENT_SUBTYPE,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,STATEMENT_SUBTYPE,EXPLAIN_PLAN from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,QUERY_STATUS,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;
select EXEC_START_UTC_TS,QUERY_ID,QUERY_SUB_STATUS,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;

Changed in trafodion:
importance: High → Critical
Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

Same behavior observed on an upgraded system and clean install.

Revision history for this message
Tharak Capirala (capirala-tharaknath) wrote :

This seems to be a case of memory corruption, which happens right after SQL_EXEC_Fetch() is called. There are a couple of things that seems to trigger this.

1. Executing a query like below where the query_text column is the last in the select list.
select EXEC_START_UTC_TS,QUERY_ID,USER_NAME,QUERY_TEXT from TRAFODION."_REPOS_".METRIC_QUERY_TABLE;

AND

2. The number of rows in the table is greater than or equal to 100.

The corruption occurs along the fetch buffer that is passed to SQL via WSQL_EXEC_SetDescItem() with desc type of SQLDESC_ROWWISE_ROWSET_PTR. The first byte right after the fetch buffer ends seems to be overwritten with a null character. If that particular byte happens to be already allocated and in use then it would cause a core. Below is an output from a gdb session.

(gdb) p pSrvrStmt->outputDescVarBuffer[(pSrvrStmt->maxRowCnt*pSrvrStmt->outputDescVarBufferLen)+1]
$81 = 28 '\034'

6157 retcode = WSQL_EXEC_Fetch(pStmt, pDesc, 0);
(gdb)
[New Thread 0x7fffd1ae5700 (LWP 25017)]
6158 if (retcode < SQL_SUCCESS)

(gdb) p pSrvrStmt->outputDescVarBuffer[(pSrvrStmt->maxRowCnt*pSrvrStmt->outputDescVarBufferLen)+1]
$82 = 0 '\000'

The pSrvrStmt->outputDescVarBuffer is allocated for pSrvrStmt->maxRowCnt*pSrvrStmt->outputDescVarBufferLen.

As can be seen above the byte right after where the buffer ends had a valid value '\034' before the call to fetch and after the call that byte is overwritten by a null character.

Note, this issue does not occur if the query_text column is not the last in the select list even though the table may have > 100 rows.

Since the memory corruption is happening in the CLI call assigning this bug the SQL executor group.

Changed in trafodion:
assignee: Tharak Capirala (capirala-tharaknath) → Sandhya Sundaresan (sandhya-sundaresan)
tags: added: sql-exe
removed: connectivity-mxosrvr
Revision history for this message
Anoop Sharma (anoop-sharma) wrote :

what is the datatype of query_text in the calling app?
Is it null terminated varchar?
what happens if you turn off rowsets and retrieve
as non-rowset fetches?

Revision history for this message
Sandhya Sundaresan (sandhya-sundaresan) wrote :

The calculation of the rowset row sizze needs to be looked at in connectivity. Seems like it's lower than needed.

Row length being moved : 200436

So here is what should be allocated :

11 + (timestamp)
160 + (query_id
  = 1 byte alignment
2+256 + (null indicator +username)

2+4+200000 (null indicator + VC indicator length + query_text length)

Instead we are allocating is 200434 for each row in conn side

Changed in trafodion:
assignee: Sandhya Sundaresan (sandhya-sundaresan) → nobody
Changed in trafodion:
assignee: nobody → Tharak Capirala (capirala-tharaknath)
Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix proposed to core (master)

Fix proposed to branch: master
Review: https://review.trafodion.org/1524

Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix merged to core (master)

Reviewed: https://review.trafodion.org/1524
Committed: https://github.com/trafodion/core/commit/7352e31841c1aac60373e0939fdeb493c6e07f76
Submitter: Trafodion Jenkins
Branch: master

commit 7352e31841c1aac60373e0939fdeb493c6e07f76
Author: Tharaknath Capirala <email address hidden>
Date: Fri Apr 17 06:09:17 2015 +0000

    Fix for bug 1438775

    The fetch buffer size calculation did not account for varchar indicator
    length for columns greater than 32K. The indicator length in this case is 4 bytes instead of 2. Currently, the buffer length calculation was using 2, which resulted in allocating insufficient memory leading to a corruption.

    Fixes bug 1438775

    Change-Id: Ib16b6644ca3c7f36d96687a33ea36ad4f0ffe903

Changed in trafodion:
status: In Progress → Fix Committed
Revision history for this message
Aruna Sadashiva (aruna-sadashiva) wrote :

we are not seeing these cores now, tested with rc1.

Changed in trafodion:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.