Regionserver looping on active transaction.

Bug #1319965 reported by Guy Groulx
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Released
High
Oliver Bucaojit

Bug Description

I was trying to create a scenario for Trina on the errors we are getting when a split and/or balance happens.
So I tried to use the split command on a table to force a split. It looked like the split command was not working. I started looking at the various logs and found that the regionserver containing my table goes into a loop about a transaction not completing.

Here’s my table.
Table Regions
Name Region Server Start Key End Key Requests
Name Region Server Start Key End Key Requests
TRAFODION.MXOLTP.TBL500,,1400169591672.616691fe476f230bfebe6b7a3907f0b8. n006.cm.cluster:60030 \x00\x07\xBF\x14
TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753. n006.cm.cluster:60030 \x00\x07\xBF\x14
IE: Currently both on n006.

Starting trafci:
SQL>set schema trafodion.mxoltp;

--- SQL operation complete.

SQL>prepare cmd from select count(*) from TBL500;

--- SQL command prepared.

SQL>execute cmd;

(EXPR)
--------------------
             1015500

--- 1 row(s) selected.

SQL>

In hbase shell, I enabled balancer.
split 'TRAFODION.MXOLTP.TBL500’, ‘\x00\x07\x00\x00’

And the regionserver on n006 starts looping on the following:
2014-05-15 18:20:02,910 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to split region TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.
2014-05-15 18:20:02,911 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:03,911 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:04,912 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:05,912 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:06,913 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:07,913 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:08,913 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:09,914 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:10,914 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:11,915 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
^C
[root@n006 hbase]#

Even if I stop my hpdci session, the msgs above continue.

I then have to bounce the regionserver to recover from this. And the spit never actually happens.

Tags: dtm
Changed in trafodion:
assignee: nobody → Oliver (oliver-bucaojit)
Revision history for this message
Oliver Bucaojit (oliver-bucaojit) wrote :

In the transactional region, we delay the split from occurring until there are no active transactions. The transactionstate is held by each region and work would need to be done to move the necessary data to the daughter region on split if we are to support region splitting.

Our current fix for this issue is to have regions delay splitting when there are preparing or active transactions during a split, which is why the log message is being repeated that it is stuck Preparing to close the transactional region. One solution that can be done on the user side is to check for open sqlci or jdbc connections with transactions running and close them. This will allow the region to split immediately.

Another solution that I've implemented would disable the split delaying and transactions that are in flight will then be aborted since they will not be able to communicate with the region after it has been split and relocated. This would be useful in development or if we want to split and wouldn't want the region to get stuck in a loop. I have seen cases where we get stuck in a loop where the C++ side DTM aborts or gets killed and a transaction remains on the HBase region. This property to disable the split delay is below, and is added to conf/hbase-site.xml:
      <property>
        <name>hbase.regionserver.region.split.delay</name>
        <value>false</value>
      </property>

If there is a case where there are no sqlci or transactions running and we see that the HBase region is stuck in this loop waiting for active transactions, and the TM is still running, then there is a bug and we will need to gather more logging and process information to debug this issue. One way to easily check if there is a transaction running from the TM perspective is through dtmci. Running dtmci and using the 'list' command will print out the current transactions and it's state.

Bouncing the system will also get HBase back into a normal state because there will be no active transactions at that point. If there were any prepared transactions then it would go through the recovery flow and get redriven to abort or commit.

Revision history for this message
Atanu Mishra (atanu-mishra) wrote :

Analysis by Narendra --

Later today, I would be checking in these changes to the datalake branch. Basically, only use the TransactionalRegion when the transid is not 0. That way, we will not ‘implicitly’ start a transaction with transid=0. This should help with the Launchpad bug 1319965.

What happens is that once we start a transaction (in the region server) with transid=0, it stays put (as the user did not start a transaction with ‘transid=0’), and hence when the user tries to split the region manually, the splitting does not happen (as the object transactionsById is not empty)

[I applied the same code updates to the other aggregator methods in this class (getMax/Min/Sum…)]

I am hoping that it would help with the bug 1309121 too.

Changed in trafodion:
status: New → In Progress
Changed in trafodion:
status: In Progress → Fix Committed
information type: Proprietary → Public
Alice Chen (alchen)
Changed in trafodion:
milestone: none → r0.8
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.