Comment 2 for bug 1412487

Revision history for this message
Joanie Cooper (joanie-cooper) wrote :

We are able to reproduce this problem on a smaller cluster using a smaller number of streams.
HBase and DTM logging was enabled to help in the analysis of this problem.
A unique constraint error occurs when SQL receives a "false" return from a checkAndPut
trx-hbase coprocessor call to the region.
When a regionserver is killed, DTM recovery will begin for the transactions
in progress for the regionserver that is no longer running. HBase will begin
its recovery process as well.
When subsequent requests are received and full HBase and DTM recovery has not been
completed, the "checkAndPut" TrxRegionEndpoint coprocessor method can
encounter exceptions, such as java.io.IOException: NewTransactionStartedBeforeRecoveryCompleted.
Previously, the "checkAndPut" client/server coprocessor calls did not handle all exceptions,
but simply posted "false" to the call. This return causes SQL to post a unique constraint error.
The "checkAndPut" and "checkAndDelete" client/server coprocessor calls have been enhanced
to capture all exceptions and return them instead of a "true" or "false" result.
This allows SQL to return an HBase error rather than a "unique constraint" error.
This fix was delivered, along with other DTM recovery fixes as a part of the commit of job
1021 to the core thread.