HBASE regionserver failure causes DB to become inconsistent.

Bug #1412487 reported by Guy Groulx
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Committed
Critical
Oliver Bucaojit

Bug Description

Running version of ORDERENTRY on our system with high concurrency.
Because of memory issue, a regionserver eventually failed.

When we recovered the system, our OE table started returning "UNIQUE CONSTRAINT" errors telling us that our DB is now inconsistent.

Tags: dtm
Revision history for this message
Atanu Mishra (atanu-mishra) wrote :

This may have all the information, except for the hbase logs --

In addition to the info currently capture in sqcollectlogs, I’ve also captured the following:

a) Latest Hbase .out files for each node (placed in /home/squser4/logs/sqinfo.20150118.0543/hbase/ directory)
b) Current Trafodion.dtm.log files for each node (placed in /home/squser4/logs/sqinfo.20150118.0543/logs/ directory – search for Trafodion.dtm.n____.log files)
c) Latest gc file for each regionserver in each node (placed in /home/squser4/logs/sqinfo.20150118.0543/logs/ directory – search for gc.log-20151117*.n____ files)

/home/squser4/trafodion_tools> sqcollectlogs -a
Collection in progress...
cp: cannot stat `/opt/hp/squser4/git150117/sql/scripts/mon.env': No such file or directory
cp: cannot stat `/opt/hp/squser4/git150117/sql/scripts/shell.env': No such file or directory
Collecting dp2 pstacks
[1] 29914
Collecting monitor pstacks
[2] 29915
Collecting mxosrvr pstacks
[3] 29916
Collecting tdm_arkcmp pstacks
[4] 29917
Collecting tdm_arkesp pstacks
[5] 29918
Collecting tm pstacks
[6] 29919
[1] Done $SQPDSHA "sqnpstack dp2 $PWD/dp2" 2> /dev/null
[4] Done $SQPDSHA "sqnpstack tdm_arkcmp $PWD/tdm_arkcmp" 2> /dev/null
[5]- Done $SQPDSHA "sqnpstack tdm_arkesp $PWD/tdm_arkesp" 2> /dev/null
[2] Done $SQPDSHA "sqnpstack monitor $PWD/monitor" 2> /dev/null
[3]- Done $SQPDSHA "sqnpstack mxosrvr $PWD/mxosrvr" 2> /dev/null
[6]+ Done $SQPDSHA "sqnpstack tm $PWD/tm" 2> /dev/null
Logs and pstacks collected in /home/squser4/logs/sqinfo.20150118_0543

Changed in trafodion:
assignee: nobody → Oliver Bucaojit (oliver-bucaojit)
milestone: none → r1.0
Changed in trafodion:
status: New → In Progress
Revision history for this message
Joanie Cooper (joanie-cooper) wrote :

We are able to reproduce this problem on a smaller cluster using a smaller number of streams.
HBase and DTM logging was enabled to help in the analysis of this problem.
A unique constraint error occurs when SQL receives a "false" return from a checkAndPut
trx-hbase coprocessor call to the region.
When a regionserver is killed, DTM recovery will begin for the transactions
in progress for the regionserver that is no longer running. HBase will begin
its recovery process as well.
When subsequent requests are received and full HBase and DTM recovery has not been
completed, the "checkAndPut" TrxRegionEndpoint coprocessor method can
encounter exceptions, such as java.io.IOException: NewTransactionStartedBeforeRecoveryCompleted.
Previously, the "checkAndPut" client/server coprocessor calls did not handle all exceptions,
but simply posted "false" to the call. This return causes SQL to post a unique constraint error.
The "checkAndPut" and "checkAndDelete" client/server coprocessor calls have been enhanced
to capture all exceptions and return them instead of a "true" or "false" result.
This allows SQL to return an HBase error rather than a "unique constraint" error.
This fix was delivered, along with other DTM recovery fixes as a part of the commit of job
1021 to the core thread.

Revision history for this message
Oliver Bucaojit (oliver-bucaojit) wrote :

Fix 1021 has been merged Jan 22. Better exception handling for the client and server-side checkAndPut/checkAndDelete was added. This fixed the unique constraint error as it was not the correct error being propagated to the user. Now the correct exception text will be thrown as an IOException().

Changed in trafodion:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.