Trafodion

HBASE regionserver failure causes DB to become inconsistent.

Bug #1412487 reported by Guy Groulx on 2015-01-19

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Trafodion	Fix Committed	Critical	Oliver Bucaojit	Trafodion r1.0

Bug Description

Running version of ORDERENTRY on our system with high concurrency.
Because of memory issue, a regionserver eventually failed.

When we recovered the system, our OE table started returning "UNIQUE CONSTRAINT" errors telling us that our DB is now inconsistent.

Tags:

Revision history for this message

Atanu Mishra (atanu-mishra) wrote on 2015-01-19:

This may have all the information, except for the hbase logs --

In addition to the info currently capture in sqcollectlogs, I’ve also captured the following:

a) Latest Hbase .out files for each node (placed in /home/squser4/logs/sqinfo.20150118.0543/hbase/ directory)
b) Current Trafodion.dtm.log files for each node (placed in /home/squser4/logs/sqinfo.20150118.0543/logs/ directory – search for Trafodion.dtm.n____.log files)
c) Latest gc file for each regionserver in each node (placed in /home/squser4/logs/sqinfo.20150118.0543/logs/ directory – search for gc.log-20151117*.n____ files)

/home/squser4/trafodion_tools> sqcollectlogs -a
Collection in progress...
cp: cannot stat `/opt/hp/squser4/git150117/sql/scripts/mon.env': No such file or directory
cp: cannot stat `/opt/hp/squser4/git150117/sql/scripts/shell.env': No such file or directory
Collecting dp2 pstacks
[1] 29914
Collecting monitor pstacks
[2] 29915
Collecting mxosrvr pstacks
[3] 29916
Collecting tdm_arkcmp pstacks
[4] 29917
Collecting tdm_arkesp pstacks
[5] 29918
Collecting tm pstacks
[6] 29919
[1] Done $SQPDSHA "sqnpstack dp2 $PWD/dp2" 2> /dev/null
[4] Done $SQPDSHA "sqnpstack tdm_arkcmp $PWD/tdm_arkcmp" 2> /dev/null
[5]- Done $SQPDSHA "sqnpstack tdm_arkesp $PWD/tdm_arkesp" 2> /dev/null
[2] Done $SQPDSHA "sqnpstack monitor $PWD/monitor" 2> /dev/null
[3]- Done $SQPDSHA "sqnpstack mxosrvr $PWD/mxosrvr" 2> /dev/null
[6]+ Done $SQPDSHA "sqnpstack tm $PWD/tm" 2> /dev/null
Logs and pstacks collected in /home/squser4/logs/sqinfo.20150118_0543

Changed in trafodion:
assignee:	nobody → Oliver Bucaojit (oliver-bucaojit)
milestone:	none → r1.0

Shang-Sheng Tung (shang-sheng-tung) on 2015-01-23

Changed in trafodion:
status:	New → In Progress

Revision history for this message

Joanie Cooper (joanie-cooper) wrote on 2015-01-23:

We are able to reproduce this problem on a smaller cluster using a smaller number of streams.
HBase and DTM logging was enabled to help in the analysis of this problem.
A unique constraint error occurs when SQL receives a "false" return from a checkAndPut
trx-hbase coprocessor call to the region.
When a regionserver is killed, DTM recovery will begin for the transactions
in progress for the regionserver that is no longer running. HBase will begin
its recovery process as well.
When subsequent requests are received and full HBase and DTM recovery has not been
completed, the "checkAndPut" TrxRegionEndpoint coprocessor method can
encounter exceptions, such as java.io.IOException: NewTransactionStartedBeforeRecoveryCompleted.
Previously, the "checkAndPut" client/server coprocessor calls did not handle all exceptions,
but simply posted "false" to the call. This return causes SQL to post a unique constraint error.
The "checkAndPut" and "checkAndDelete" client/server coprocessor calls have been enhanced
to capture all exceptions and return them instead of a "true" or "false" result.
This allows SQL to return an HBase error rather than a "unique constraint" error.
This fix was delivered, along with other DTM recovery fixes as a part of the commit of job
1021 to the core thread.

Revision history for this message

Oliver Bucaojit (oliver-bucaojit) wrote on 2015-01-23:

Fix 1021 has been merged Jan 22. Better exception handling for the client and server-side checkAndPut/checkAndDelete was added. This fixed the unique constraint error as it was not the correct error being propagated to the user. Now the correct exception text will be thrown as an IOException().

Changed in trafodion:
status:	In Progress → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.