Comment 1 for bug 1319965

Revision history for this message
Oliver Bucaojit (oliver-bucaojit) wrote :

In the transactional region, we delay the split from occurring until there are no active transactions. The transactionstate is held by each region and work would need to be done to move the necessary data to the daughter region on split if we are to support region splitting.

Our current fix for this issue is to have regions delay splitting when there are preparing or active transactions during a split, which is why the log message is being repeated that it is stuck Preparing to close the transactional region. One solution that can be done on the user side is to check for open sqlci or jdbc connections with transactions running and close them. This will allow the region to split immediately.

Another solution that I've implemented would disable the split delaying and transactions that are in flight will then be aborted since they will not be able to communicate with the region after it has been split and relocated. This would be useful in development or if we want to split and wouldn't want the region to get stuck in a loop. I have seen cases where we get stuck in a loop where the C++ side DTM aborts or gets killed and a transaction remains on the HBase region. This property to disable the split delay is below, and is added to conf/hbase-site.xml:
      <property>
        <name>hbase.regionserver.region.split.delay</name>
        <value>false</value>
      </property>

If there is a case where there are no sqlci or transactions running and we see that the HBase region is stuck in this loop waiting for active transactions, and the TM is still running, then there is a bug and we will need to gather more logging and process information to debug this issue. One way to easily check if there is a transaction running from the TM perspective is through dtmci. Running dtmci and using the 'list' command will print out the current transactions and it's state.

Bouncing the system will also get HBase back into a normal state because there will be no active transactions at that point. If there were any prepared transactions then it would go through the recovery flow and get redriven to abort or commit.