I was trying to create a scenario for Trina on the errors we are getting when a split and/or balance happens.
So I tried to use the split command on a table to force a split. It looked like the split command was not working. I started looking at the various logs and found that the regionserver containing my table goes into a loop about a transaction not completing.
Here’s my table.
Table Regions
Name Region Server Start Key End Key Requests
Name Region Server Start Key End Key Requests
TRAFODION.MXOLTP.TBL500,,1400169591672.616691fe476f230bfebe6b7a3907f0b8. n006.cm.cluster:60030 \x00\x07\xBF\x14
TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753. n006.cm.cluster:60030 \x00\x07\xBF\x14
IE: Currently both on n006.
Starting trafci:
SQL>set schema trafodion.mxoltp;
--- SQL operation complete.
SQL>prepare cmd from select count(*) from TBL500;
--- SQL command prepared.
SQL>execute cmd;
(EXPR)
--------------------
1015500
--- 1 row(s) selected.
SQL>
In hbase shell, I enabled balancer.
split 'TRAFODION.MXOLTP.TBL500’, ‘\x00\x07\x00\x00’
And the regionserver on n006 starts looping on the following:
2014-05-15 18:20:02,910 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to split region TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.
2014-05-15 18:20:02,911 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:03,911 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:04,912 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:05,912 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:06,913 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:07,913 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:08,913 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:09,914 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:10,914 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
2014-05-15 18:20:11,915 INFO org.apache.hadoop.hbase.regionserver.transactional.TransactionalRegion: Preparing to close transactional region [TRAFODION.MXOLTP.TBL500,\x00\x07\xBF\x14,1400169591672.dc2f54cc3e9aee81aa5635e704d17753.], but still have [0] transactions that are pending commit. And [ 1] active transactions. Sleeping
^C
[root@n006 hbase]#
Even if I stop my hpdci session, the msgs above continue.
I then have to bounce the regionserver to recover from this. And the spit never actually happens.
In the transactional region, we delay the split from occurring until there are no active transactions. The transactionstate is held by each region and work would need to be done to move the necessary data to the daughter region on split if we are to support region splitting.
Our current fix for this issue is to have regions delay splitting when there are preparing or active transactions during a split, which is why the log message is being repeated that it is stuck Preparing to close the transactional region. One solution that can be done on the user side is to check for open sqlci or jdbc connections with transactions running and close them. This will allow the region to split immediately.
Another solution that I've implemented would disable the split delaying and transactions that are in flight will then be aborted since they will not be able to communicate with the region after it has been split and relocated. This would be useful in development or if we want to split and wouldn't want the region to get stuck in a loop. I have seen cases where we get stuck in a loop where the C++ side DTM aborts or gets killed and a transaction remains on the HBase region. This property to disable the split delay is below, and is added to conf/hbase- site.xml:
<name> hbase.regionser ver.region. split.delay< /name>
<value> false</ value>
<property>
</property>
If there is a case where there are no sqlci or transactions running and we see that the HBase region is stuck in this loop waiting for active transactions, and the TM is still running, then there is a bug and we will need to gather more logging and process information to debug this issue. One way to easily check if there is a transaction running from the TM perspective is through dtmci. Running dtmci and using the 'list' command will print out the current transactions and it's state.
Bouncing the system will also get HBase back into a normal state because there will be no active transactions at that point. If there were any prepared transactions then it would go through the recovery flow and get redriven to abort or commit.