Trafodion

HPIT benchmark select statement stalls in a scan for a very long time

Bug #1439407 reported by Joanie Cooper on 2015-04-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Trafodion	In Progress	High	Joanie Cooper	Trafodion r2.0

Bug Description

When running the HPIT benchmark, a select query can stall in a scan operation for an extremely long time.

This problem is being tracked for the Zircon4 system.

Some stalled query information was captured on 031215.
Core files were taken to help determine where in the tdm_arkesp process the stall was occurring.

Query #10 this time – ran for 7 minutes 25 seconds. Average response time for this query is 17.4 seconds. All other streams had finished their final query by 22:35:10, so we had a five-minute window where no queries were being done. I initiated Guy’s mass-core script after the system had been quiet for a couple of minutes – as you can see from the core timestamps below, cores were being generated over the next five minutes, so well after the query had finished. Hope this will have what we need!

2015-03-12 22:33:12 INFO - 8> Query Number : 36 , Query Type : query01 , Records Returned : 5 , Elasped Time (sec) : 71.613129
2015-03-12 22:40:37 INFO - 8> Query Number : 37 , Query Type : query10 , Records Returned : 193 , Elasped Time (sec) : 445.132296

/opt/hp/squser4/git150312/export/limited-support-tools/LSO> offender -s in_sql
Trafodion Conversational Interface 1.1.0
(c) Copyright 2014 Hewlett-Packard Development Company, LP.
>>>>>>>>+>+>+>+>+>+>+>+>+>+>+>+>+>
CURRENT_TIMESTAMP TIME_IN_SECONDS QUERY_ID EXECUTE_STATE SOURCE_TEXT
-------------------------- -------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ------------------------------ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

2015-03-12 22:35:32.464218 143 MXID11005058756212292958769385386000000000206U3333300_1472_SQL_CUR_1 OPEN "select dmth.FISC_YR_QTR_CD as Fiscal_Quarter, prod.PROD_LN_ID as Product_Line, SUM(fact.SO_NET_EXT_US_DLR_AM) K_Dol_Shipment from trafodion.javabench.HPIT_PERF_SUM_F fact, trafodion.javabench.HPIT_DT_MTH_D dmth, trafodion.javabench.HPIT_PROD_LN_D prod, t

--- 1 row(s) selected.
>>

End of MXCI Session

/home/squser4/trafodion_tools> ls -la /home/squser4/trafodion_tools/core*
total 41128452
drwxr-x--- 2 squser4 seaquest 4096 Mar 12 22:42 .
drwxr-x--- 9 squser4 seaquest 4096 Mar 12 22:37 ..
-rw-r----- 1 squser4 seaquest 976295080 Mar 12 22:37 core.n011.31576.tdm_arkesp.31576
-rw-r----- 1 squser4 seaquest 1158368056 Mar 12 22:39 core.n011.31577.tdm_arkesp.31577
-rw-r----- 1 squser4 seaquest 1086642432 Mar 12 22:41 core.n011.44859.tdm_arkesp.44859
-rw-r----- 1 squser4 seaquest 1088708272 Mar 12 22:42 core.n011.44944.tdm_arkesp.44944
-rw-r----- 1 squser4 seaquest 969386464 Mar 12 22:37 core.n012.50278.tdm_arkesp.50278
-rw-r----- 1 squser4 seaquest 1141359920 Mar 12 22:39 core.n012.50279.tdm_arkesp.50279
-rw-r----- 1 squser4 seaquest 1088861744 Mar 12 22:41 core.n012.61995.tdm_arkesp.61995
-rw-r----- 1 squser4 seaquest 1089256656 Mar 12 22:42 core.n012.62084.tdm_arkesp.62084
-rw-r----- 1 squser4 seaquest 1162251440 Mar 12 22:38 core.n013.42936.tdm_arkesp.42936
-rw-r----- 1 squser4 seaquest 1144073632 Mar 12 22:40 core.n013.42937.tdm_arkesp.42937
-rw-r----- 1 squser4 seaquest 1089757208 Mar 12 22:41 core.n013.55731.tdm_arkesp.55731
-rw-r----- 1 squser4 seaquest 1092558152 Mar 12 22:42 core.n013.55809.tdm_arkesp.55809
-rw-r----- 1 squser4 seaquest 1088999328 Mar 12 22:41 core.n014.11097.tdm_arkesp.11097
-rw-r----- 1 squser4 seaquest 1157633808 Mar 12 22:38 core.n014.65209.tdm_arkesp.65209
-rw-r----- 1 squser4 seaquest 1146787720 Mar 12 22:40 core.n014.65210.tdm_arkesp.65210
-rw-r----- 1 squser4 seaquest 1099806256 Mar 12 22:41 core.n015.16154.tdm_arkesp.16154
-rw-r----- 1 squser4 seaquest 976765400 Mar 12 22:37 core.n015.4867.tdm_arkesp.4867
-rw-r----- 1 squser4 seaquest 1147236960 Mar 12 22:39 core.n015.4868.tdm_arkesp.4868
-rw-r----- 1 squser4 seaquest 1148154688 Mar 12 22:40 core.n016.29164.tdm_arkesp.29164
-rw-r----- 1 squser4 seaquest 1101342856 Mar 12 22:41 core.n016.29165.tdm_arkesp.29165
-rw-r----- 1 squser4 seaquest 1101221912 Mar 12 22:42 core.n016.42291.tdm_arkesp.42291
-rw-r----- 1 squser4 seaquest 1323801368 Mar 12 22:39 core.n016.58756.mxosrvr.58756
-rw-r----- 1 squser4 seaquest 1162402856 Mar 12 22:38 core.n017.50710.tdm_arkesp.50710
-rw-r----- 1 squser4 seaquest 1130369512 Mar 12 22:40 core.n017.50711.tdm_arkesp.50711
-rw-r----- 1 squser4 seaquest 1092408640 Mar 12 22:41 core.n017.62594.tdm_arkesp.62594
-rw-r----- 1 squser4 seaquest 1088998728 Mar 12 22:42 core.n017.62741.tdm_arkesp.62741
-rw-r----- 1 squser4 seaquest 1166113968 Mar 12 22:38 core.n018.35152.tdm_arkesp.35152
-rw-r----- 1 squser4 seaquest 1127533264 Mar 12 22:40 core.n018.35153.tdm_arkesp.35153
-rw-r----- 1 squser4 seaquest 1097168672 Mar 12 22:41 core.n018.47005.tdm_arkesp.47005
-rw-r----- 1 squser4 seaquest 1086168032 Mar 12 22:42 core.n018.47141.tdm_arkesp.47141
-rw-r----- 1 squser4 seaquest 1163245432 Mar 12 22:38 core.n019.36590.tdm_arkesp.36590
-rw-r----- 1 squser4 seaquest 1152075416 Mar 12 22:40 core.n019.36591.tdm_arkesp.36591
-rw-r----- 1 squser4 seaquest 1089905864 Mar 12 22:41 core.n019.47700.tdm_arkesp.47700
-rw-r----- 1 squser4 seaquest 1089974176 Mar 12 22:42 core.n019.47847.tdm_arkesp.47847
-rw-r----- 1 squser4 seaquest 952381552 Mar 12 22:37 core.n020.51741.tdm_arkesp.51741
-rw-r----- 1 squser4 seaquest 1120766072 Mar 12 22:39 core.n020.51742.tdm_arkesp.51742
-rw-r----- 1 squser4 seaquest 1088406128 Mar 12 22:41 core.n020.63628.tdm_arkesp.63628
-rw-r----- 1 squser4 seaquest 1086942176 Mar 12 22:42 core.n020.63775.tdm_arkesp.63775

However, not all tdm_arkesp processes were captured…here are some that appear to have been missed…wonder if the query woke up before we hit these?

n015: /home/squser4/trafodion_tools/cores.Z051CYR.150312223718/core.n015.16155.tdm_arkesp.1q4uQr:4: Error in sourced command file:
n015: ptrace: No such process.
n015: gcore: failed to create /home/squser4/trafodion_tools/cores.Z051CYR.150312223718/core.n015.16155.tdm_arkesp.16155
n014: /home/squser4/trafodion_tools/cores.Z051CYR.150312223718/core.n014.11103.tdm_arkesp.qkqI0w:4: Error in sourced command file:
n014: ptrace: No such process.
n014: gcore: failed to create /home/squser4/trafodion_tools/cores.Z051CYR.150312223718/core.n014.11103.tdm_arkesp.11103
n016: /home/squser4/trafodion_tools/cores.Z051CYR.150312223718/core.n016.42292.tdm_arkesp.p9kNa5:4: Error in sourced command file:
n016: ptrace: No such process.
n016: gcore: failed to create /home/squser4/trafodion_tools/cores.Z051CYR.150312223718/core.n016.42292.tdm_arkesp.42292

Additional query information was captured on 040115 by Gary Hall:

OK, I’m back to researching the “stalled” HPIT issue again – now using git150331 with a UTT from Joanie in the hopes of capturing more information. I’ve captured three different “stall” situations tonight, but in all cases the query appears to have completed before all tdm_arkesp process cores had been written, so I’m not sure we’ve captured what is needed to identify the reason for the “stall”. Here are the details for the first run – I can supply the details on the other two “stalled” runs if of value to you. Looking to Joanie and Mile primarily for deciphering of the captured information.

Run window:
Streams running : 2015-04-01 02:00:06
Telling all streams to stop : 2015-04-01 02:05:06
02:16:13 – Message “There was a problem reading from the server” processed in client (lost connection). Assuming it was because of all cores being gathered at that time…possible?

Delayed queries as logged in driver log:
2015-04-01 02:10:40 INFO - 10> Query Number : 5 , Query Type : query07 , Records Returned : 5548 , Elasped Time (sec) : 464.210515
2015-04-01 02:16:12 INFO - 7> Query Number : 12 , Query Type : query02 , Records Returned : 506 , Elasped Time (sec) : 737.760386
2015-04-01 02:16:13 ERROR -
7>----------------------------------------------------------
7> Unexpected EXCEPTION : There was a problem reading from the server
7> Stream Number : 7
7>----------------------------------------------------------

org.trafodion.jdbc.t4.HPT4Exception: There was a problem reading from the server
        at org.trafodion.jdbc.t4.HPT4Messages.createSQLException(HPT4Messages.java:284)
        at org.trafodion.jdbc.t4.InputOutput.doIO(InputOutput.java:373)
        at org.trafodion.jdbc.t4.T4Connection.getReadBuffer(T4Connection.java:154)
        at org.trafodion.jdbc.t4.T4Connection.TerminateDialogue(T4Connection.java:278)
        at org.trafodion.jdbc.t4.InterfaceConnection.close(InterfaceConnection.java:1184)
        at org.trafodion.jdbc.t4.TrafT4Connection.close(TrafT4Connection.java:1474)
        at org.trafodion.jdbc.t4.TrafT4Connection.close(TrafT4Connection.java:97)
        at DatabaseConnector.closeDatabaseConnection(DatabaseConnector.java:167)
        at HPITWorkload2.close(HPITWorkload2.java:537)
        at WorkloadDriver.run(WorkloadDriver.java:693)

Cores stored in /home/squser4/trafodion_tools/cores.Z070EWZ.150401021248/ directory.

Query:
explain options 'f' select /* HPIT Query 02 using template Q2 */ Period.Customer, Period.Net_Revenue/1000 Net_Revenue, (case when Prior_Period.Net_Revenue = 0 then 0 else ((Period.Net_Revenue - Prior_Period.Net_Revenue)/Prior_Period.Net_Revenue) end) Prior_Period_Percent, (case when Prior_Year.Net_Revenue = 0 then 0 else ((Period.Net_Revenue - Prior_Year.Net_Revenue)/Prior_Year.Net_Revenue) end) YOY_Growth from ( select cust.CUST_AMID_LVL_4_NM as Customer, SUM(fact.REV_NET_REV_US_DLR_AM) Net_Revenue from HPIT_PERF_SUM_F fact, HPIT_DT_MTH_D dmth, HPIT_CUST_ACCT_HIER_D cust, HPIT_CTRY_D geo, HPIT_RTE_TO_MKT_D rtm, HPIT_PROD_LN_D prod where fact.DT_MTH_KY = dmth.DT_MTH_KY AND fact.SLDT_CUST_ACCT_HIER_KY = cust.CUST_ACCT_HIER_KY AND fact.CTRY_KY = geo.CTRY_KY AND fact.QTA_PROD_LN_KY = prod.PROD_LN_KY AND fact.RTE_TO_MKT_KY = rtm.RTE_TO_MKT_KY AND rtm.SLS_CHNL_RTE_NM = _UCS2'Direct' AND prod.PROD_HIER_VALU_VOL_NM = _UCS2'Volume' AND geo.ISO_CTRY_NM = _UCS2'United States' AND dmth.FISC_YR_QTR_CD = '2008Q4' Group By cust.CUST_AMID_LVL_4_NM ) Period, ( select cust.CUST_AMID_LVL_4_NM as Customer, SUM(fact.REV_NET_REV_US_DLR_AM) Net_Revenue from HPIT_PERF_SUM_F fact, HPIT_DT_MTH_D dmth, HPIT_CUST_ACCT_HIER_D cust, HPIT_CTRY_D geo, HPIT_RTE_TO_MKT_D rtm, HPIT_PROD_LN_D prod where fact.DT_MTH_KY = dmth.DT_MTH_KY AND fact.SLDT_CUST_ACCT_HIER_KY = cust.CUST_ACCT_HIER_KY AND fact.CTRY_KY = geo.CTRY_KY AND fact.QTA_PROD_LN_KY = prod.PROD_LN_KY AND fact.RTE_TO_MKT_KY = rtm.RTE_TO_MKT_KY AND rtm.SLS_CHNL_RTE_NM = _UCS2'Direct' AND prod.PROD_HIER_VALU_VOL_NM = _UCS2'Volume' AND geo.ISO_CTRY_NM = _UCS2'United States' AND dmth.FISC_YR_QTR_CD = '2008Q3' Group By cust.CUST_AMID_LVL_4_NM ) Prior_Period, ( select cust.CUST_AMID_LVL_4_NM as Customer, SUM(fact.REV_NET_REV_US_DLR_AM) Net_Revenue from HPIT_PERF_SUM_F fact, HPIT_DT_MTH_D dmth, HPIT_CUST_ACCT_HIER_D cust, HPIT_CTRY_D geo, HPIT_RTE_TO_MKT_D rtm, HPIT_PROD_LN_D prod where fact.DT_MTH_KY = dmth.DT_MTH_KY AND fact.SLDT_CUST_ACCT_HIER_KY = cust.CUST_ACCT_HIER_KY AND fact.CTRY_KY = geo.CTRY_KY AND fact.QTA_PROD_LN_KY = prod.PROD_LN_KY AND fact.RTE_TO_MKT_KY = rtm.RTE_TO_MKT_KY AND rtm.SLS_CHNL_RTE_NM = _UCS2'Direct' AND prod.PROD_HIER_VALU_VOL_NM = _UCS2'Volume' AND geo.ISO_CTRY_NM = _UCS2'United States' AND dmth.FISC_YR_QTR_CD = '2007Q4' Group By cust.CUST_AMID_LVL_4_NM ) Prior_Year where Period.Customer = Prior_Period.Customer AND Period.Customer = Prior_Year.Customer Order By 2 desc;

/opt/hp/squser4/git150331/export/limited-support-tools/LSO> offender -s in_sql
Trafodion Conversational Interface 1.1.0
(c) Copyright 2014 Hewlett-Packard Development Company, LP.
>>>>>>>>+>+>+>+>+>+>+>+>+>+>+>+>+>
CURRENT_TIMESTAMP TIME_IN_SECONDS QUERY_ID

EXECUTE_STATE SOURCE_T
EXT
-------------------------- -------------------- ----------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------ ------------------------------ --------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------

2015-04-01 02:15:33.939428 698 MXID11007018269212294613248322679000000000206U3333300_1754_SQL_CUR_1

OPEN "select Period.Customer, Period.Net_Revenue/1000 Net_Revenue, (case when Prior_Period.Net_Revenue = 0 then 0 else ((Period.Net_Revenue - Prior_Period.Net_Revenue)/Prior_Period.Net_Revenue) end) Prior_Period_Percent, (case when Prior_Year.Net_Revenue = 0 t

--- 1 row(s) selected.

Analysis of the query for 040115 confirmed that this select query was a non-DTM-transactional query.

We are trying to reproduce this problem again on Zircon4.

We have an hbase-trx UTT in place to provide additional DTM logging for scans should the next stalled query be transactional. We also will run with SQL C++/JNI/JAVA logging. We want to confirm that the scan requests are successful through the SQL stack.

If it is, we can then focus on hbase DTM transactional or non-DTM-transactional RPC to determine if a call is not receiving a reply.

Tags:

Joanie Cooper (joanie-cooper) on 2015-04-01

Changed in trafodion:
status:	New → In Progress

Revision history for this message

Joanie Cooper (joanie-cooper) wrote on 2015-04-02:

Download full text (19.0 KiB)

The stall was repeated on zircon4 using an instrumented hbase-trx UTT for additional DTM logging.

The following is an email exchange recording details of the stall and the subsequent RS log analysis.

From: Cooper, Joanie
Sent: Thursday, April 02, 2015 12:44 PM
To: Hall, Gary; Hanlon, Mike; Groulx, Guy
Cc: Messavussu, John; Seaquest RD Systems Engineering Services; Sundaresan, Sandhya; Govindarajan, Selvaganes; Du, Justin; Mishra, Atanu; Goyal, Narendra (Trafodion R&D)
Subject: RE: ZIRCON4 - stalled HPIT query - new captured data

Hi,

The DTM instrumentation to the logs was very helpful.

The immediate review on the RS logs, looks like we have 88 openScanner, followed by 88 performScan, followed by 88 closeScanner TrxRegionEndpoint protobuf calls. The DTM trxClient protocol is openScanner/performScan/closeScanner. I have a matching ENTER and EXIT log for each call. As we have 88 ENTER's and 88 EXIT's for each type, I don't see anything that seems to be left hanging. I also only seem to have DTM-transactional scans for the TRAFODION.JAVABENCH.SB_HISTOGRAMS table.

I don’t have any HPIT table entries for a DTM-transactional scan operation.

With this info, we could move on to looking for general HBase logging.

For the time period we are interested in, “2015-04-02 17:46:27” to “2015-04-02 17:57:18”, we have a ten minute period.

We also have numerous RPC scanner timeout messages for the time period in question.

n014: hbase-hbase-regionserver-n014.out:2015-04-02 17:50:02,455 WARN [B.Default
RpcServer.handler=101,queue=23,port=60020] ipc.RpcServer: (responseTooSlow): {"p
rocessingtimems":60804,"call":"Scan(org.apache.hadoop.hbase.protobuf.generated.C
lientProtos$ScanRequest)","client":"172.31.0.17:58349","starttimems":14279969416
51,"queuetimems":0,"class":"HRegionServer","responsesize":1748,"method":"Scan"}

In the past, we have seen some lengthy garbage collection around the same time that
we see timeout messages.

But, I didn’t spot any in this time period. There were a couple earlier, but not around our ten
minute window:

n019: hbase-hbase-regionserver-n019.out.2:2015-04-01 14:27:40,089 WARN [RS_OPEN
_REGION-n019:60020-2.oldTransactionCleaner] util.Sleeper: We slept 70804ms inste
ad of 60000ms, this is likely due to a long garbage collecting pause and it's us
ually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

I thought the “10 minute” time frame might match up to our usual hbase configuration
settings for "hbase.client.scanner.timeout.period" and "hbase.rpc.timeout". But, on zircon4,
they are both set to one hour.

I did spot a couple of WAL warnings in the same time period:

The stall was repeated on zircon4 using an instrumented hbase-trx UTT for additional DTM logging.

The following is an email exchange recording details of the stall and the subsequent RS log analysis.

From: Cooper, Joanie 
Sent: Thursday, April 02, 2015 12:44 PM
To: Hall, Gary; Hanlon, Mike; Groulx, Guy
Cc: Messavussu, John; Seaquest RD Systems Engineering Services; Sundaresan, Sandhya; Govindarajan, Selvaganes; Du, Justin; Mishra, Atanu; Goyal, Narendra (Trafodion R&D)
Subject: RE: ZIRCON4 - stalled HPIT query - new captured data

Hi,

The DTM instrumentation to the logs was very helpful.

The immediate review on the RS logs, looks like we have 88 openScanner, followed by 88 performScan, followed by 88 closeScanner TrxRegionEndpoint protobuf calls.  The DTM trxClient protocol is openScanner/performScan/closeScanner.  I have a matching ENTER and EXIT log for each call.  As we have 88 ENTER's and 88 EXIT's for each type, I don't see anything that seems to be left hanging.  I also only seem to have DTM-transactional scans for the TRAFODION.JAVABENCH.SB_HISTOGRAMS table.

I don’t have any HPIT table entries for a DTM-transactional scan operation.

With this info, we could move on to looking for general HBase logging.

For the time period we are interested in, “2015-04-02 17:46:27” to “2015-04-02 17:57:18”, we have a ten minute period.

We also have numerous RPC scanner timeout messages for the time period in question.

In the past, we have seen some lengthy garbage collection around the same time that
we see timeout messages.

But, I didn’t spot any in this time period.  There were a couple earlier, but not around our ten
minute window:

I did spot a couple of WAL warnings in the same time period:

n013: hbase-hbase-regionserver-n013.out:2015-04-02 17:56:13,398 WARN  [MemStoreFlusher.0] wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [911a68cbd0b3f0d526246cfcfddd792d]
n018: hbase-hbase-regionserver-n018.out:2015-04-02 17:55:25,685 WARN  [MemStoreFlusher.0] wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [ce2a2de6c9fe9215ae6dc39071286b93]
n014: hbase-hbase-regionserver-n014.out:2015-04-02 17:55:19,772 WARN  [MemStoreFlusher.1] wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [ddac489a0a945b3d85a8050f20f4fa36]
n014: hbase-hbase-regionserver-n014.out:2015-04-02 17:55:22,241 WARN  [MemStoreFlusher.0] wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [21c8e598bc9c85842c44d4b66b2f612b]
n012: hbase-hbase-regionserver-n012.out:2015-04-02 17:55:07,726 WARN  [MemStoreFlusher.1] wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [d1ac24c40bebfa10042e9e4e92361886]
n012: hbase-hbase-regionserver-n012.out:2015-04-02 17:55:14,271 WARN  [MemStoreFlusher.0] wal.FSHLog: Couldn't find oldest seqNum for the region we are about to flush: [1d1bf1ec76d796d27199b464e7331dd6]
n020: hbase-hbase-regionserver-n020.out:2015-04-02 17:55:11,761 WARN  [MemStoreFlusher.1] wal.FSHLog: Couldn't find oldest seqNum fo

But, the error message we are ultimately receiving does indicate we exhausted our RPC retries.  
This does match the warnings I was seeing in the RS logs for this timeframe.

I didn’t find any other WARN, INFO, ERROR or FATAL RS messages of note for this timeframe.

It may be that we are hitting an HBase issue causing the scan request to eventually timeout
and SQL correctly reports the error.

If this is confirmed in the SQL core files, then our only course of action is to do a little
more research on how RPC timeouts occur.  Setting the timeout to one hour did not
help in this situation.

Thoughts?

Thanks,
   Joanie

From: Hall, Gary 
Sent: Thursday, April 02, 2015 11:42 AM
To: Hanlon, Mike; Groulx, Guy; Cooper, Joanie
Cc: Messavussu, John; Seaquest RD Systems Engineering Services; Sundaresan, Sandhya; Govindarajan, Selvaganes; Du, Justin
Subject: ZIRCON4 - stalled HPIT query - new captured data

Joanie supplied a new UTT, I reran the test set, and now have a new set of cores to be evaluated.  It required repeating the test three times before a stall of sufficient length occurred, but the nice thing is all the cores had been written before the stall condition finished (so we should have a good core sample here).  Joanie also checked via DTMCI that there were no active trans IDs at the time of the stall.

The query in question just happened to be the same query as experienced yesterday – HPIT query02.  This is purely random – we have seen several of the HPIT transactions be involved in the stall situation.

Test failure core at /local/cores/1005/core.1427997439.n013.29693.mxosrvr
Cores capture by Guy’s script (the mxosrvr and all corresponding tdm_arkesp processes have been stored in /home/squser4/trafodion_tools/cores.Z020P8D.150402175341/

The stalled query occurred on driver stream #4 – looking at the time log for this stream, the stalled query was started at approx.: 2015-04-02 17:46:27 (immediately after the previous query for this stream had finished).  We do not see an end line item for the query, as this time we have an HBase error message returned at 17:57:18 stating retries exhausted for this query attempt.  The request to stop all queries was initiated at 17:49:05, so this query failed after being stalled for over 7 minutes (and possibly longer…the query started at 17:46:27, and the average response time for query02 is 15 seconds).

/home/squser4/JavaBench/test.zircon4/log/150402.1737.run.HPIT.releasetest> grep "\- 4>" 150402.1737.10A.driver_0.log4j.log 
2015-04-02 17:44:05 INFO  - 4>  Execution Order : 8 , 4 , 9 , 1 , 2 , 7 , 5 , 6 , 3 , 10 , 11
2015-04-02 17:44:35 INFO  - 4> Query Number : 1 ,  Query Type : query08 , Records Returned : 723 , Elasped Time (sec) :  30.175968
2015-04-02 17:44:38 INFO  - 4> Query Number : 2 ,  Query Type : query04 , Records Returned : 31 , Elasped Time (sec) :   2.830527
2015-04-02 17:44:53 INFO  - 4> Query Number : 3 ,  Query Type : query09 , Records Returned : 193 , Elasped Time (sec) :  14.807569
2015-04-02 17:46:27 INFO  - 4> Query Number : 4 ,  Query Type : query01 , Records Returned : 5 , Elasped Time (sec) :  94.396965

2015-04-02 17:57:18 ERROR - 
>----------------------------------------------------------
> Unexpected SQLEXCEPTION,  Error Code = -8448
>   Stream Number : 4 , Query Number : 5 ,  Query Type : query02
>   IT Query 2 : select Period.Customer, Period.Net_Revenue/1000 Net_Revenue, (case when Prior_Period.Net_Revenue = 0 then 0 else ((Period.Net_Revenue - Prior_Period.Net_Revenue)/Prior_Period.Net_Revenue) end) Prior_Period_Percent, (case when Prior_Year.Net_Revenue = 0 then 0 else ((Period.Net_Revenue - Prior_Year.Net_Revenue)/Prior_Year.Net_Revenue) end) YOY_Growth from ( select cust.CUST_AMID_LVL_4_NM as Customer, SUM(fact.REV_NET_REV_US_DLR_AM) Net_Revenue from trafodion.javabench.HPIT_PERF_SUM_F fact, trafodion.javabench.HPIT_DT_MTH_D dmth, trafodion.javabench.HPIT_CUST_ACCT_HIER_D cust, trafodion.javabench.HPIT_CTRY_D geo, trafodion.javabench.HPIT_RTE_TO_MKT_D rtm, trafodion.javabench.HPIT_PROD_LN_D prod where fact.DT_MTH_KY = dmth.DT_MTH_KY AND fact.SLDT_CUST_ACCT_HIER_KY = cust.CUST_ACCT_HIER_KY AND fact.CTRY_KY = geo.CTRY_KY AND fact.QTA_PROD_LN_KY = prod.PROD_LN_KY AND fact.RTE_TO_MKT_KY = rtm.RTE_TO_MKT_KY AND rtm.SLS_CHNL_RTE_NM = _UCS2'Direct' AND prod.PROD_HIER_VALU_VOL_NM = _UCS2'Volume' AND geo.ISO_CTRY_NM = _UCS2'United States' AND dmth.FISC_YR_QTR_CD = '2008Q4' Group By cust.CUST_AMID_LVL_4_NM ) Period, ( select cust.CUST_AMID_LVL_4_NM as Customer, SUM(fact.REV_NET_REV_US_DLR_AM) Net_Revenue from trafodion.javabench.HPIT_PERF_SUM_F fact, trafodion.javabench.HPIT_DT_MTH_D dmth, trafodion.javabench.HPIT_CUST_ACCT_HIER_D cust, trafodion.javabench.HPIT_CTRY_D geo, trafodion.javabench.HPIT_RTE_TO_MKT_D rtm, trafodion.javabench.HPIT_PROD_LN_D prod where fact.DT_MTH_KY = dmth.DT_MTH_KY AND fact.SLDT_CUST_ACCT_HIER_KY = cust.CUST_ACCT_HIER_KY AND fact.CTRY_KY = geo.CTRY_KY AND fact.QTA_PROD_LN_KY = prod.PROD_LN_KY AND fact.RTE_TO_MKT_KY = rtm.RTE_TO_MKT_KY AND rtm.SLS_CHNL_RTE_NM = _UCS2'Direct' AND prod.PROD_HIER_VALU_VOL_NM = _UCS2'Volume' AND geo.ISO_CTRY_NM = _UCS2'United States' AND dmth.FISC_YR_QTR_CD = '2008Q3' Group By cust.CUST_AMID_LVL_4_NM ) Prior_Period, ( select cust.CUST_AMID_LVL_4_NM as Customer, SUM(fact.REV_NET_REV_US_DLR_AM) Net_Revenue from trafodion.javabench.HPIT_PERF_SUM_F fact, trafodion.javabench.HPIT_DT_MTH_D dmth, trafodion.javabench.HPIT_CUST_ACCT_HIER_D cust, trafodion.javabench.HPIT_CTRY_D geo, trafodion.javabench.HPIT_RTE_TO_MKT_D rtm, trafodion.javabench.HPIT_PROD_LN_D prod where fact.DT_MTH_KY = dmth.DT_MTH_KY AND fact.SLDT_CUST_ACCT_HIER_KY = cust.CUST_ACCT_HIER_KY AND fact.CTRY_KY = geo.CTRY_KY AND fact.QTA_PROD_LN_KY = prod.PROD_LN_KY AND fact.RTE_TO_MKT_KY = rtm.RTE_TO_MKT_KY AND rtm.SLS_CHNL_RTE_NM = _UCS2'Direct' AND prod.PROD_HIER_VALU_VOL_NM = _UCS2'Volume' AND geo.ISO_CTRY_NM = _UCS2'United States' AND dmth.FISC_YR_QTR_CD = '2007Q4' Group By cust.CUST_AMID_LVL_4_NM ) Prior_Year where Period.Customer = Prior_Period.Customer AND Period.Customer = Prior_Year.Customer Order By 2 desc
> *** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause: 
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
Thu Apr 02 17:56:59 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2f5a497, java.io.IOException: Call to n020.cm.cluster
/172.31.0.20:60020 failed on local exception: java.io.IOException: Premature EOF from inputStream
Thu Apr 02 17:57:15 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2f5a497, org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location

java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:592)
. [2015-04-02 17:57:18]
>----------------------------------------------------------
org.trafodion.jdbc.t4.HPT4Exception: *** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause: 
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
Thu Apr 02 17:56:59 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2f5a497, java.io.IOException: Call to n020.cm.cluster/172.31.0.20:60020 failed on local exception: java.io.IOException: Premature EOF from inputStream
Thu Apr 02 17:57:15 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2f5a497, org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location

java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:592)
. [2015-04-02 17:57:18]
        at org.trafodion.jdbc.t4.HPT4Messages.throwSQLException(HPT4Messages.java:206)
        at org.trafodion.jdbc.t4.InterfaceResultSet.fetch(InterfaceResultSet.java:808)
        at org.trafodion.jdbc.t4.TrafT4ResultSet.next(TrafT4ResultSet.java:2870)
       at HPITWorkload2.runQuery(HPITWorkload2.java:422)
        at WorkloadDriver.run(WorkloadDriver.java:524)
2015-04-02 17:57:18 ERROR - 
4>----------------------------------------------------------
4> Unexpected EXCEPTION : *** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause: 
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
Thu Apr 02 17:56:59 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2f5a497, java.io.IOException: Call to n020.cm.cluster/172.31.0.20:60020 failed on local exception: java.io.IOException: Premature EOF from inputStream
Thu Apr 02 17:57:15 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2f5a497, org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location

java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:592)
. [2015-04-02 17:57:18]
4>   Stream Number : 4
4>----------------------------------------------------------

org.trafodion.jdbc.t4.HPT4Exception: *** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause: 
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.client.RetriesExhaustedException: Failed after attempts=2, exceptions:
Thu Apr 02 17:56:59 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2f5a497, java.io.IOException: Call to n020.cm.cluster
/172.31.0.20:60020 failed on local exception: java.io.IOException: Premature EOF from inputStream
Thu Apr 02 17:57:15 GMT 2015, org.apache.hadoop.hbase.client.RpcRetryingCaller@2f5a497, org.apache.hadoop.hbase.client.RetriesExhaustedException: Can't get the location

Offender feedback:

/opt/hp/squser4/git150401/export/limited-support-tools/LSO> offender -s in_sql
Trafodion Conversational Interface 1.1.0
(c) Copyright 2014 Hewlett-Packard Development Company, LP.
>>>>>>>>+>+>+>+>+>+>+>+>+>+>+>+>+>
CURRENT_TIMESTAMP           TIME_IN_SECONDS       QUERY_ID                                                                          
                                                                                                                                    
                                                                                                                                    
                                                                                                                                    
                                                                                                                                    
                                                                                            EXECUTE_STATE                   SOURCE_T
EXT
--------------------------  --------------------  ----------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------  ------------------------------  --------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------

2015-04-02 17:51:51.649450                   325  MXID11002029693212294756314431024000000000206U3333300_1629_SQL_CUR_1              
                                                                                                                                    
                                                                                                                                    
                                                                                                                                    
                                                                                                                                    
                                                                                            OPEN                            "select 
Period.Customer, Period.Net_Revenue/1000 Net_Revenue, (case when Prior_Period.Net_Revenue = 0 then 0 else ((Period.Net_Revenue - Pri
or_Period.Net_Revenue)/Prior_Period.Net_Revenue) end) Prior_Period_Percent, (case when Prior_Year.Net_Revenue = 0 t

--- 1 row(s) selected.
>>

Atanu Mishra (atanu-mishra) on 2015-04-28

Changed in trafodion:
milestone:	none → r2.0
tags:	added: sql-exe

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.