Random update statistics failures with hbase OutOfOrderScannerNextException

Bug #1391271 reported by Weishiun Tsai
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
High
Barry Fritchman

Bug Description

During the QA table population with the v1029_0830 build and the v1107_0830 build on a Cloudera cluster, update statistics encountered this error 4 times. This OutOfOrderScannerNextException has not shown up for quite some time. Some recent changes in update statics may have caused the error to return again. This bug report is to document this problem, so that some research can be done in this area to figure out what in update statistics may have stressed hbase to return this error again.

Here is the 1 seen from the v1029_0830 build:

SQL>update statistics for table cbase on every column sample random 5 percent;
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_WISC32.CBASE encountered an error (8448) from statement Process_Query. [2014-10-31 09:34:00]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
. [2014-10-31 09:34:00]

Here are the 3 seen from the v1107_0830 build:

SQL>update statistics for table trafodion.g_tpcds1x.inventory on every column sample;
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_TPCDS1X.INVENTORY encountered an error (8448) from statement Process_Query. [2014-11-08 13:01:55]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
. [2014-11-08 13:01:55]

SQL>update statistics for table trafodion.g_tpcds1x.store_sales on every column sample;
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_TPCDS1X.STORE_SALES encountered an error (8448) from statement Process_Query. [2014-11-08 13:14:29]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
. [2014-11-08 13:14:29]

SQL>update statistics for table cbase on every column sample random 5 percent;
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_WISC32.CBASE encountered an error (8448) from statement Process_Query. [2014-11-08 16:15:06]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
. [2014-11-08 16:15:06]

Tags: sql-cmp
description: updated
description: updated
Changed in trafodion:
assignee: nobody → Barry Fritchman (barry-fritchman)
Revision history for this message
Weishiun Tsai (wei-shiun-tsai) wrote :

This is seen on the v1119_0830 build again:

SQL>update statistics for table cbase on every column sample random 5 percent;
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_WISC32.CBASE encountered an error (8448) from statement Process_Query. [2014-11-20 14:00:01]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
. [2014-11-20 14:00:01]

Revision history for this message
Weishiun Tsai (wei-shiun-tsai) wrote :

This seen on the v1209_0830 build on amethyst9 (a Cloudera machine) again:

SQL>update statistics for table trafodion.g_tpcds1x.inventory on every column sample;
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_TPCDS1X.INVENTORY encountered an error (8448) from statement Process_Query. [2014-12-10 02:33:05]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
. [2014-12-10 02:33:05]

*** MISMATCH *** expecting: '--- SQL operation complete.'

SQL>update statistics for table trafodion.g_tpcds1x.store_sales on every column sample;
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_TPCDS1X.STORE_SALES encountered an error (8448) from statement Process_Query. [2014-12-10 02:44:37]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
. [2014-12-10 02:44:37]

*** MISMATCH *** expecting: '--- SQL operation complete.'

SQL>update statistics for table bbase on every column sample random 5 percent;
*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_WISC32.BBASE encountered an error (8448) from statement Process_Query. [2014-12-10 05:06:29]
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
. [2014-12-10 05:06:29]

*** MISMATCH *** expecting: '--- SQL operation complete.'

Revision history for this message
Weishiun Tsai (wei-shiun-tsai) wrote :

Saw it two more times on the v1209_0830 build on amethyst9 (a Cloudera machine) while trying to rerun update stats on g_tpch2x.lineitem. The 3rd try finally succeeded. The g_tpch2x.lineitem table was already populated last night. Seeing this error this morning rules out the possibility that it has anything to do with how close we run update stats with the actual table populating.

>>update statistics for table lineitem on every column sample random 10 percent;

*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_TPCH2X.LINEITEM encountered an error (8448) from statement Process_Query.

*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
.

--- SQL operation failed with errors.
>>
>>update statistics for table lineitem on every column sample random 10 percent;

*** ERROR[9200] UPDATE STATISTICS for table TRAFODION.G_TPCH2X.LINEITEM encountered an error (8448) from statement Process_Query.

*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: org.apache.hadoop.hbase.DoNotRetryIOException: Failed after retry of OutOfOrderScannerNextException: was there a rpc timeout?
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
.

--- SQL operation failed with errors.
>>update statistics for table lineitem on every column sample random 10 percent;

--- SQL operation complete.

Changed in trafodion:
milestone: none → r1.0
Changed in trafodion:
milestone: r1.0 → r1.1
Revision history for this message
Weishiun Tsai (wei-shiun-tsai) wrote :

QA has not seen this problem in the later builds prior to the R1.0.0 release. Presumably this problem has been alleviated by several TM fixes done in R1.0.0. Dev has reported seeing OutOfOrderScannerNextException when running the dev CATMAN test. But it is hard to tell if they are due to the same problem. For now, we will lower the severity of this bug report from Critical to High, and continue to monitor it until the release of R1.1.0. If the problem does not show up again, the bug report will be closed at that time.

Changed in trafodion:
importance: Critical → High
Changed in trafodion:
milestone: r1.1 → r2.0
Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix proposed to core (master)

Fix proposed to branch: master
Review: https://review.trafodion.org/1702

Changed in trafodion:
status: New → In Progress
Revision history for this message
Trafodion-Gerrit (neo-devtools) wrote : Fix merged to core (master)

Reviewed: https://review.trafodion.org/1702
Committed: https://github.com/trafodion/core/commit/40c23313ed94096c4b720b6f001f8f6f3f8e2e57
Submitter: Trafodion Jenkins
Branch: master

commit 40c23313ed94096c4b720b6f001f8f6f3f8e2e57
Author: Barry Fritchman <email address hidden>
Date: Wed May 27 16:23:49 2015 +0000

    Avoid scanner timeout for Update Statistics

    For performance reasons, Update Stats pushes sampling down into HBase,
    using a filter that returns only randomly selected rows. When the
    sampling rate is very low, as is the case when the default sampling
    protocol (which includes a sample limit of a million rows) is used on
    a very large table, a long time can be taken in the region server
    before returning to Trafodion, with the resultant risk of an
    OutOfOrderScannerNextException. To avoid these timeouts, this fix
    reduces the scanner cache size (the number of rows accumulated before
    returning) used by a given scan based on the sampling rate. If an
    adequate return time can not be achieved in this manner without
    going below the scanner cache minimum prescribed by the
    HBASE_NUM_CACHE_ROWS_MIN cqd, then the scanner cache reduction is
    complemented by a modification of the sampling rate used in HBase.
    The sampling rate used in HBase is increased, but the overall rate
    is maintained by doing supplementary sampling of the returned rows in
    Trafodion. For example, if the original sampling rate is .000001,
    and reducing the scanner cache to the minimum still results in an
    excessive average time spent in the region server, the sampling
    may be split into a .00001 rate in HBase and a .01 rate in Trafodion,
    resulting in the same effective .000001 overall rate.

    Change-Id: Id05ab5063c2c119c21b5c6c002ba9554501bb4e1
    Closes-Bug: #1391271

Changed in trafodion:
status: In Progress → Fix Committed
Revision history for this message
Weishiun Tsai (wei-shiun-tsai) wrote :

This problem has not been seen for a while. Close the bug report.

Changed in trafodion:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.