ExpHbaseInterface::nextRow sees OutOfOrderScannerNextException on hbase 0.98

Bug #1378127 reported by Weishiun Tsai on 2014-10-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
Critical
Atanu Mishra

Bug Description

When running QA regression tests, ExpHbaseInterface::nextRow() would often run into hbase OutOfOrderScannerNextException on hbase 0.98. The following shows an example of this exception:

SQL>Insert Into BTA1P001
Values (
'ABAA', 0, 0, 'CAAAAAAA', -- (0)
68, 'AAAA', 1, 0,
2, 'AA', 2,
11, 3, 'BCAAHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA',
11,
.0, .2, 'ABAA', .6, .0,
0, 8, 8, 'AA', 68, -- (5)
6, 'CBAAAAAA', 6,
1, .1,
'BIAAAAABAAAAAAAA', 1.1, 11, 626,
'BCAAHAAAAAAAAAAA', .0,
'BIAAAAAB', 8, 8, 10,
'ABAA', 10, .00011, .00011 , 11,
'CA', 626,
'AB', 6, 12,
'ABAA', 1968, 468, 69, 1, 9, 9,
'CB', 1, .03,
'ABAA', 11, .01, .06, 6.26,
'BIAAAAAB', 1968, 8,
'CBAAAAAAAAAAAAAA', 68, 2369,
'CBAAAAAAAAAAAAAA', 18, 18, 3,
'AAAAAAAA', 1, 1.1, 1.1,20,
'ABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
)
;
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:451)
. [2014-10-05 21:47:01]
*** ERROR[4082] Object TRAFODION.ARKCASE_ARKT1114.BTA1P001 does not exist or is inaccessible. [2014-10-05 21:47:01]

This is seen on the v1004 build installed on clusters. It may require some new hbase settings so that Trafodion would work more smoothly on hbase 0.98. At this moment, more research is needed to figure out what these settings are.

This is failing our tests quite frequently and quite randomly, so a critical bug report is created to track this problem.

Changed in trafodion:
assignee: nobody → khaled Bouaziz (khaled-bouaziz)
Changed in trafodion:
assignee: khaled Bouaziz (khaled-bouaziz) → Suresh Subbiah (suresh-subbiah)
Suresh Subbiah (suresh-subbiah) wrote :

From https://issues.apache.org/jira/browse/HBASE-11295

So I think I figured out the problem. It is that when a scan request takes too long to process the RPC connection times out. It is not a client timeout issue as there are retries form the client, and it seems like when another RPC connection is reestablished the nextCallSeq information on the client side is lost. Increasing RPC timeout and decreasing scanner caching both work but they also impose performance penalty so I am working to find a way around that.

For Trafodion we could try with these 2 settings in hbase-site.xml

hbase.client.scanner.timeout.period
300000

hbase.rpc.timeout
300000

Also this cqd should help

cqd hbase_num_cache_rows_max '1000' ; -- we could also try 5000 here, current default is 10,000

Changed in trafodion:
status: New → In Progress
Suresh Subbiah (suresh-subbiah) wrote :

From $MYSQROOT/logs/mvqr.log we see about 94 occurrences of this error on the test machine over the last 2 days. The instabce recorded here resulted in a table not error from Trafodion implying that we were scanning the OBJECTS table or its index. This table has about 1000 rows in the test machine and therefore it seems unlikely that any of the workarounds suggested above could actually help. The regionserver is showing no signs of this error in its logs. The 94 errors occurred in 2 waves, the first from 10/05 15:20 - 15:45, and the second from 10/05 22: 46 to 10/06 00:26. These could correspond simply periods when the test was run or more significantly when the system was under some kind of stress.

Next time tests are run it will help if we had a record of every time the error was raised, so that we know which table was being accessed. Also we can run with the setting log4j.category.HBase=DEBUG in $MY_SQROOT/logs/log4cpp.mvqr.config, to collect more debug information

Changed in trafodion:
assignee: Suresh Subbiah (suresh-subbiah) → Atanu Mishra (atanu-mishra)
milestone: none → r0.9
tags: added: dtm
removed: sql-exe
tags: added: hbase
removed: dtm
Changed in trafodion:
status: In Progress → Triaged
milestone: r0.9 → none
Weishiun Tsai (wei-shiun-tsai) wrote :

The QA testing is not seeing this problem anymore ever since the problem https://bugs.launchpad.net/trafodion/+bug/1378926 "SQL statements return error 8448 with java.lang.OutOfMemoryError" was fixed. The theory is that the OutOfMemory problem had somehow stressed the system resource to cause this problem. This case will be closed for now. It can be reopened if it is seen again in the future.

Changed in trafodion:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers