Trafodion

ExpHbaseInterface::nextRow sees OutOfOrderScannerNextException on hbase 0.98

Bug #1378127 reported by Weishiun Tsai on 2014-10-06

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Trafodion	Fix Released	Critical	Atanu Mishra

Bug Description

When running QA regression tests, ExpHbaseInterface::nextRow() would often run into hbase OutOfOrderScannerNextException on hbase 0.98. The following shows an example of this exception:

SQL>Insert Into BTA1P001
Values (
'ABAA', 0, 0, 'CAAAAAAA', -- (0)
68, 'AAAA', 1, 0,
2, 'AA', 2,
11, 3, 'BCAAHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA',
11,
.0, .2, 'ABAA', .6, .0,
0, 8, 8, 'AA', 68, -- (5)
6, 'CBAAAAAA', 6,
1, .1,
'BIAAAAABAAAAAAAA', 1.1, 11, 626,
'BCAAHAAAAAAAAAAA', .0,
'BIAAAAAB', 8, 8, 10,
'ABAA', 10, .00011, .00011 , 11,
'CA', 626,
'AB', 6, 12,
'ABAA', 1968, 468, 69, 1, 9, 9,
'CB', 1, .03,
'ABAA', 11, .01, .06, 6.26,
'BIAAAAAB', 1968, 8,
'CBAAAAAAAAAAAAAA', 68, 2369,
'CBAAAAAAAAAAAAAA', 18, 18, 3,
'AAAAAAAA', 1, 1.1, 1.1,20,
'ABAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'
)
;
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:451)
. [2014-10-05 21:47:01]
*** ERROR[4082] Object TRAFODION.ARKCASE_ARKT1114.BTA1P001 does not exist or is inaccessible. [2014-10-05 21:47:01]

This is seen on the v1004 build installed on clusters. It may require some new hbase settings so that Trafodion would work more smoothly on hbase 0.98. At this moment, more research is needed to figure out what these settings are.

This is failing our tests quite frequently and quite randomly, so a critical bug report is created to track this problem.

Tags:

Sandhya Sundaresan (sandhya-sundaresan) on 2014-10-06

Changed in trafodion:
assignee:	nobody → khaled Bouaziz (khaled-bouaziz)

Suresh Subbiah (suresh-subbiah) on 2014-10-07

Changed in trafodion:
assignee:	khaled Bouaziz (khaled-bouaziz) → Suresh Subbiah (suresh-subbiah)

Revision history for this message

Suresh Subbiah (suresh-subbiah) wrote on 2014-10-07:

From https://issues.apache.org/jira/browse/HBASE-11295

So I think I figured out the problem. It is that when a scan request takes too long to process the RPC connection times out. It is not a client timeout issue as there are retries form the client, and it seems like when another RPC connection is reestablished the nextCallSeq information on the client side is lost. Increasing RPC timeout and decreasing scanner caching both work but they also impose performance penalty so I am working to find a way around that.

For Trafodion we could try with these 2 settings in hbase-site.xml

hbase.client.scanner.timeout.period
300000

hbase.rpc.timeout
300000

Also this cqd should help

cqd hbase_num_cache_rows_max '1000' ; -- we could also try 5000 here, current default is 10,000

Changed in trafodion:
status:	New → In Progress

Revision history for this message

Suresh Subbiah (suresh-subbiah) wrote on 2014-10-07:

From $MYSQROOT/logs/mvqr.log we see about 94 occurrences of this error on the test machine over the last 2 days. The instabce recorded here resulted in a table not error from Trafodion implying that we were scanning the OBJECTS table or its index. This table has about 1000 rows in the test machine and therefore it seems unlikely that any of the workarounds suggested above could actually help. The regionserver is showing no signs of this error in its logs. The 94 errors occurred in 2 waves, the first from 10/05 15:20 - 15:45, and the second from 10/05 22: 46 to 10/06 00:26. These could correspond simply periods when the test was run or more significantly when the system was under some kind of stress.

Next time tests are run it will help if we had a record of every time the error was raised, so that we know which table was being accessed. Also we can run with the setting log4j.category.HBase=DEBUG in $MY_SQROOT/logs/log4cpp.mvqr.config, to collect more debug information

Atanu Mishra (atanu-mishra) on 2014-10-10

Changed in trafodion:
assignee:	Suresh Subbiah (suresh-subbiah) → Atanu Mishra (atanu-mishra)
milestone:	none → r0.9
tags:	added: dtm removed: sql-exe

Atanu Mishra (atanu-mishra) on 2014-10-14

tags:

added: hbase
removed: dtm

Atanu Mishra (atanu-mishra) on 2014-10-14

Changed in trafodion:
status:	In Progress → Triaged
milestone:	r0.9 → none

Revision history for this message

Weishiun Tsai (wei-shiun-tsai) wrote on 2014-10-15:

The QA testing is not seeing this problem anymore ever since the problem https://bugs.launchpad.net/trafodion/+bug/1378926 "SQL statements return error 8448 with java.lang.OutOfMemoryError" was fixed. The theory is that the OutOfMemory problem had somehow stressed the system resource to cause this problem. This case will be closed for now. It can be reopened if it is seen again in the future.

Changed in trafodion:
status:	Triaged → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.