Trafodion

Getting TM error 97 when our tables split or get moved.

Bug #1274651 reported by Guy Groulx on 2014-01-30

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Trafodion	Fix Released	Critical	John de Roo	Trafodion r1.1

Bug Description

Testing with transaction enabled.
Our system is using HortonWorks and is running HBASE 0.94 across 12 nodes.
Our hbase max store size was 1GB.

We noticed during loading of large tables, that we would get error 97 returned from the TM and that the batch of rows were not added.

Turns out that our table was being split and that the TM is not handling this at the moment.
We also found out that after a split, the hbase balancer would move the new region to another region server. When this happened, we got more error 97.

WORKAROUND:
- We changed the MAX STORE SIZE to 100GB.
- We changed the SPLIT POLICY to org.apache.hadoop.hbase.regionserver.ConstantSizeRegionSplitPolicy which causes split to happen only once the max size if reached. Default for HBASE 0.94 and up is new POWERof2 policy which causes splits more often.
- We turned off HBASE BALANCER via hbase shell.

Tags:

Guy Groulx (guy-groulx) on 2014-01-31

tags:

added: transaction

John de Roo (john-deroo) on 2014-04-05

Changed in trafodion:
status:	New → In Progress
assignee:	nobody → John de Roo (john-deroo)

Revision history for this message

John de Roo (john-deroo) wrote on 2014-04-05:

Revision 39062 Partial fix merged from seatrans_2 to datalake_64 branch: Added code to region close code to suspend region splits until there are no active transactions in the region. This can be configured through an environment variable, default is on. The code already prohibits region splits while transactions are in phase 2 to ensure DB consistency.

Weishiun Tsai (wei-shiun-tsai) on 2014-04-09

tags:

added: dtm
removed: transaction

Guy Groulx (guy-groulx) on 2014-05-21

Changed in trafodion:
importance:	High → Critical

Steve Varnau (steve-varnau) on 2014-06-10

information type:

Proprietary → Public

Atanu Mishra (atanu-mishra) on 2014-07-30

Changed in trafodion:
milestone:	none → r1.0

Atanu Mishra (atanu-mishra) on 2015-01-07

Changed in trafodion:
milestone:	r1.0 → r1.1

Revision history for this message

Guy Groulx (guy-groulx) wrote on 2015-01-28:

January 28th. During a longevity test, client got:
1>----------------------------------------------------------
1> Unexpected EXCEPTION : *** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: java.io.IOException: PerformScan error on coprocessor call, scannerID: 3082879
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:419)
. [2015-01-28 19:02:59]
1> Stream Number : 1
1>----------------------------------------------------------

102>----------------------------------------------------------
102> Unexpected EXCEPTION : *** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: java.io.IOException: PerformScan error on coprocessor call, scannerID: 0
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:419)
. [2015-01-28 19:03:00]
102> Stream Number : 102
102>----------------------------------------------------------

Regionserver log and trafodion.dtm.log files shows that a split occurred on a table.
Zookeeper recovery information was updated with a region and does not get cleared.

Other clients continued to work.

Revision history for this message

Oliver Bucaojit (oliver-bucaojit) wrote on 2015-03-31:

Our latest changes to set the closing flag seemed to have fixed the latest split issues that we have been seeing. I also have not heard of any new problems related to this bug so I’ll go ahead and mark it as resolved.

The closing flag change was checked into the stable 1.0.1 branch on Feb17 and into mainline on Feb14 so it is in the released code.

Changed in trafodion:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.