Bug #1380734 “Batch upsert into salted table not upserting all r...” : Bugs : Trafodion

Aruna Sadashiva (aruna-sadashiva) on 2014-10-13

description:

updated

Arvind Narain (arvind-narain) on 2014-10-14

tags:

added: dtm
removed: client-jdbc-t4

Revision history for this message

Arvind Narain (arvind-narain) wrote on 2014-10-15:

#1

Download full text (5.0 KiB)

Tested on a workstation. Seeing the following, hence changing tag to dtm for further investigation.

/designs/seaquest/narain/traftests/jdbc/basic>date
Tue Oct 14 23:05:07 UTC 2014
/designs/seaquest/narain/traftests/jdbc/basic>java -Dprop=t4prop batchall 100 u salt
Params accepted: batchall <batchsize, default 1000> <i|u> <salt>
Running with values: batchsize 100, Upsert, salted
bs: 100, total: 100
insertCount array length: 100
Batch All: Total of 0 warning retrieved!!
Batch All: Update count :100
Batch All : select * returned: 0
Batch All: Verifying using Select all
Batch All : ERROR: Expected rows:100 Actual rows: 0
Batch All : select count * returned: 0
Batch All: Verifying using Select count *
Batch All : ERROR: Expected rows:100 Actual rows: 0
Batch All : FAILED
/designs/seaquest/narain/traftests/jdbc/basic>date
Tue Oct 14 23:05:45 UTC 2014

From dtm log:

2014-10-14 23:05:32,378 ERROR transactional.TransactionManager: doCommitX, copro
cessor UnknownTransactionException: java.io.IOException: UnknownTransactionExcep
tion
2014-10-14 23:05:32,378 ERROR transactional.TransactionManager: exception in doC
ommitX : org.apache.hadoop.hbase.client.transactional.UnknownTransactionExceptio
n
2014-10-14 23:05:32,378 INFO transactional.TransactionManager: Got unknown excep
tion during commit. Transaction: [2004]
2014-10-14 23:05:32,378 ERROR transactional.TransactionManager: exception in doA
bortX (ignoring): java.lang.Exception: java.io.IOException: UnknownTransactionEx
ception
2014-10-14 23:05:32,378 INFO transactional.TransactionManager: Got unknown excep
tion during abort. Transaction: [2004]
2014-10-14 23:05:32,418 INFO transactional.TransactionManager: doAbortX -- TRAFO
DION.T4QA.TABALL;hconnection-0x590b4f1a region location being refreshed
2014-10-14 23:05:32,420 DEBUG client.HConnectionManager$HConnectionImplementatio
n: Removed g4t3037.houston.hp.com:49746 as a location of TRAFODION.T4QA.TABALL,\
x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00,1413327929891.f96379bc6f8cd67bc4
e4f7ac7ced031c. for tableName=TRAFODION.T4QA.TABALL from cache
2014-10-14 23:05:32,422 ERROR transactional.TransactionManager: Abort HasExcepti
on true: true
2014-10-14 23:05:32,422 ERROR transactional.TransactionManager: Abort HasExcepti
on true: java.io.IOException: UnknownTransactionException

====
in zz* files after tracing enabled (which may be expected):

23:05:28.761.004 trace(14763): tm_process_msg : tm 0, type 207, msgid 32
23:05:28.761.025 trace(14763): CTmTxMessage::CTmTxMessage(req, msgid) : ENTRY, r
equest(207), msgid(32).
23:05:28.761.030 trace(14763): CTmTxMessage::initialize : ENTRY, msgid(32).
23:05:28.761.034 trace(14763): CTmTxMessage::initialize : EXIT, msgid(32).
23:05:28.761.037 trace(14763): CTmTxMessage::CTmTxMessage : EXIT, msgid(32).
23:05:28.761.043 trace(14763): tm_process_req_status, ID 2002, ENTRY
23:05:28.761.048 trace(14763): tm_notx_check - unable to complete, returning err
or FENOTRANSID
23:05:28.761.051 trace(14763): CTmTxMessage::reply : ENTRY. msgid(32), reply cod
e(208), error(75).
23:05:28.761.069 trace(14763): CTmTxMessage::reply EXIT. XMSG_REPLY_ msgid(32),
error(75).
23:05:28.761.075 trace(14763): CTmTxMessage::~CTmTxMessage :...

Tested on a workstation. Seeing the following, hence changing tag to dtm for further investigation.

/designs/seaquest/narain/traftests/jdbc/basic>date
Tue Oct 14 23:05:07 UTC 2014
/designs/seaquest/narain/traftests/jdbc/basic>java -Dprop=t4prop batchall 100 u salt
Params accepted: batchall <batchsize, default 1000> <i|u> <salt>
Running with values: batchsize 100, Upsert, salted
bs: 100, total: 100
insertCount array length: 100
Batch All: Total of 0 warning retrieved!!
Batch All: Update count :100
Batch All : select * returned: 0
Batch All: Verifying using Select all
Batch All : ERROR: Expected rows:100 Actual rows: 0
Batch All : select count * returned: 0
Batch All: Verifying using Select count *
Batch All : ERROR: Expected rows:100 Actual rows: 0
Batch All : FAILED
/designs/seaquest/narain/traftests/jdbc/basic>date
Tue Oct 14 23:05:45 UTC 2014

From dtm log:

2014-10-14 23:05:32,378 ERROR transactional.TransactionManager: doCommitX, copro
cessor UnknownTransactionException: java.io.IOException: UnknownTransactionExcep
tion
2014-10-14 23:05:32,378 ERROR transactional.TransactionManager: exception in doC
ommitX : org.apache.hadoop.hbase.client.transactional.UnknownTransactionExceptio
n
2014-10-14 23:05:32,378 INFO transactional.TransactionManager: Got unknown excep
tion during commit. Transaction: [2004]
2014-10-14 23:05:32,378 ERROR transactional.TransactionManager: exception in doA
bortX (ignoring): java.lang.Exception: java.io.IOException: UnknownTransactionEx
ception
2014-10-14 23:05:32,378 INFO transactional.TransactionManager: Got unknown excep
tion during abort. Transaction: [2004]
2014-10-14 23:05:32,418 INFO transactional.TransactionManager: doAbortX -- TRAFO
DION.T4QA.TABALL;hconnection-0x590b4f1a region location being refreshed
2014-10-14 23:05:32,420 DEBUG client.HConnectionManager$HConnectionImplementatio
n: Removed g4t3037.houston.hp.com:49746 as a location of TRAFODION.T4QA.TABALL,\
x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00\x00,1413327929891.f96379bc6f8cd67bc4
e4f7ac7ced031c. for tableName=TRAFODION.T4QA.TABALL from cache
2014-10-14 23:05:32,422 ERROR transactional.TransactionManager: Abort HasExcepti
on true: true
2014-10-14 23:05:32,422 ERROR transactional.TransactionManager: Abort HasExcepti
on true: java.io.IOException: UnknownTransactionException

====
in zz* files after tracing enabled (which may be expected):

23:05:28.761.004 trace(14763): tm_process_msg : tm 0, type 207, msgid 32
23:05:28.761.025 trace(14763): CTmTxMessage::CTmTxMessage(req, msgid) : ENTRY, r
equest(207), msgid(32).
23:05:28.761.030 trace(14763): CTmTxMessage::initialize : ENTRY, msgid(32).
23:05:28.761.034 trace(14763): CTmTxMessage::initialize : EXIT, msgid(32).
23:05:28.761.037 trace(14763): CTmTxMessage::CTmTxMessage : EXIT, msgid(32).
23:05:28.761.043 trace(14763): tm_process_req_status, ID 2002, ENTRY
23:05:28.761.048 trace(14763): tm_notx_check - unable to complete, returning err
or FENOTRANSID
23:05:28.761.051 trace(14763): CTmTxMessage::reply : ENTRY. msgid(32), reply cod
e(208), error(75).
23:05:28.761.069 trace(14763): CTmTxMessage::reply EXIT. XMSG_REPLY_ msgid(32), 
error(75).
23:05:28.761.075 trace(14763): CTmTxMessage::~CTmTxMessage : ENTRY/EXIT msg 0x33
db5a0, request(207), msgid(0)
23:05:28.761.079 trace(14763): tm_process_msg EXIT

On clusters seeing other timeout messages including the following :
Caused by: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.31.0.14:41891 remote=n014.cm.cluster/172.31.0.14:60020]
	at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
	at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
	at java.io.FilterInputStream.read(FilterInputStream.java:133)
	at java.io.FilterInputStream.read(FilterInputStream.java:133)
	at org.apache.hadoop.hbase.ipc.RpcClient$Connection$PingInputStream.read(RpcClient.java:559)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
	at java.io.DataInputStream.readInt(DataInputStream.java:387)
	at org.apache.hadoop.hbase.ipc.RpcClient$Connection.readResponse(RpcClient.java:1077)
	at org.apache.hadoop.hbase.ipc.RpcClient$Connection.run(RpcClient.java:728)
2014-10-14 23:34:09,425 ERROR transactional.TransactionManager: doCommitX, result size: 0
2014-10-14 23:34:09,458 INFO transactional.TransactionManager: doCommitX -- TRAFODION.T4QA.TABALL;hconnection-0x69ed8d93 location being refreshed
2014-10-14 23:34:09,460 DEBUG client.HConnectionManager$HConnectionImplementation: Removed n014.cm.cluster:60020 as a location of TRAFODION.T4QA.TABALL,,1413329102847.4dbce75193125f5ecf129829381dfcf1. for tableName=TRAFODION.T4QA.TABALL from cache
2014-10-14 23:34:09,473 ERROR transactional.TransactionManager: exception in doCommitX: java.io.IOException: ERROR Received incorrect number of results from coprocessor call

Aruna Sadashiva (aruna-sadashiva) on 2014-10-15

Changed in trafodion:
importance:	High → Critical
milestone:	none → r1.0

Revision history for this message

Atanu Mishra (atanu-mishra) wrote on 2014-11-04:

#2

Download full text (3.1 KiB)

The HBase logs are being filled with the below exception. I haven't seen this exception before so it will need some investigation. I have a feeling it may have to do with region splits, but I can't tell for sure since the logs have rolled over.

If the region split is the problem then one solution would be to change the hbase property increase the size of the region before it gets split and/or increasing the number of partitions in the table. To verify the split we may have to rerun the test and kill HBase/Trafodion immediately after the problem occurs so that the logs don't roll over and get overwritten.

Thanks, Oliver

2014-10-24 21:01:20,093 WARN [RpcServer.handler=15,port=34987] regionserver.HRegion: Failed getting lock in batch put, row=\x00\x00\x00\x05\x80\x00\x00\x00\x00\x00\x00\x0B
org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row out of range for row lock on HRegion TRAFODION.T4QA.TABALL,\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00,1414180756968.7e50c8781644560f8172feea14934af3., startKey='\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00', getEndKey()='\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00', row='\x00\x00\x00\x05\x80\x00\x00\x00\x00\x00\x00\x0B'
        at org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:3421)
        at org.apache.hadoop.hbase.regionserver.HRegion.getRowLock(HRegion.java:3439)
        at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2382)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2249)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2201)
        at org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:2763)
        at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:2085)
        at org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint.commit(TrxRegionEndpoint.java:2561)
        at org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint.commit(TrxRegionEndpoint.java:3278)
        at org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint.commit(TrxRegionEndpoint.java:447)
        at org.apache.hadoop.hbase.coprocessor.transactional.generated.TrxRegionProtos$TrxRegionService.callMethod(TrxRegionProtos.java:24719)
        at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:5541)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.execServiceOnRegion(HRegionServer.java:3300)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.execService(HRegionServer.java:3282)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29501)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run...

The HBase logs are being filled with the below exception.  I haven't seen this exception before so it will need some investigation.  I have a feeling it may have to do with region splits, but I can't tell for sure since the logs have rolled over.

If the region split is the problem then one solution would be to change the hbase property increase the size of the region before it gets split and/or increasing the number of partitions in the table.  To verify the split we may have to rerun the test and kill HBase/Trafodion immediately after the problem occurs so that the logs don't roll over and get overwritten.

Thanks, Oliver

2014-10-24 21:01:20,093 WARN  [RpcServer.handler=15,port=34987] regionserver.HRegion: Failed getting lock in batch put, row=\x00\x00\x00\x05\x80\x00\x00\x00\x00\x00\x00\x0B
org.apache.hadoop.hbase.regionserver.WrongRegionException: Requested row out of range for row lock on HRegion TRAFODION.T4QA.TABALL,\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00,1414180756968.7e50c8781644560f8172feea14934af3., startKey='\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00', getEndKey()='\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x00', row='\x00\x00\x00\x05\x80\x00\x00\x00\x00\x00\x00\x0B'
        at org.apache.hadoop.hbase.regionserver.HRegion.checkRow(HRegion.java:3421)
        at org.apache.hadoop.hbase.regionserver.HRegion.getRowLock(HRegion.java:3439)
        at org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutation(HRegion.java:2382)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2249)
        at org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:2201)
        at org.apache.hadoop.hbase.regionserver.HRegion.doBatchMutate(HRegion.java:2763)
        at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:2085)
        at org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint.commit(TrxRegionEndpoint.java:2561)
        at org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint.commit(TrxRegionEndpoint.java:3278)
        at org.apache.hadoop.hbase.coprocessor.transactional.TrxRegionEndpoint.commit(TrxRegionEndpoint.java:447)
        at org.apache.hadoop.hbase.coprocessor.transactional.generated.TrxRegionProtos$TrxRegionService.callMethod(TrxRegionProtos.java:24719)
        at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:5541)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.execServiceOnRegion(HRegionServer.java:3300)
        at org.apache.hadoop.hbase.regionserver.HRegionServer.execService(HRegionServer.java:3282)
        at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:29501)
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2012)
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:98)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.consumerLoop(SimpleRpcScheduler.java:160)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler.access$000(SimpleRpcScheduler.java:38)
        at org.apache.hadoop.hbase.ipc.SimpleRpcScheduler$1.run(SimpleRpcScheduler.java:110)
        at java.lang.Thread.run(Thread.java:745)

Changed in trafodion:
assignee:	Arvind Narain (arvind-narain) → Adriana Fuentes (adriana-fuentes)
status:	New → In Progress

Revision history for this message

Adriana Fuentes (adriana-fuentes) wrote on 2014-12-16:

#3

I enabled several logging in the TrxRegionEnpoint.java code and I added several trace lines to figure out when the failure could happen and to figure out the region keys.

I also was able to narrow down to a particular query from Arvind's test to reproduce this issue. This is the table and the query:

CREATE TABLE TCHAR_T4
       (
              A CHAR(5) not null
              , B CHAR(5) not null
              , PRIMARY KEY (A)
       ) salt using 8 partitions
;
upsert into TCHAR_T4 values ('A', 'B'), ('C', 'D'), ('E', 'F');

It gets a WrongRegion exception in the ‘doPrepareX’ method once the processor call is made. And, then the transaction is aborted.

I am not able to see the row key picked when the putRow() is done. Something that might need to be investigated a little more.

What I did find different is that the two transactions that are created, are created because the first one fails with the upsert and sql has a retry logic for any failing query. I disabled this in sqlci and once I tried it you can only see one transaction. You can disable the retry logic in sql by doing the following:
cqd auto_query_retry 'off'

Again this is the query:
CREATE TABLE TCHAR_T4
       (
              A CHAR(5) not null
              , B CHAR(5) not null
              , PRIMARY KEY (A)
       ) salt using 8 partitions
;
upsert into TCHAR_T4 values ('A', 'B'), ('C', 'D'), ('E', 'F');

Changed in trafodion:
assignee:	Adriana Fuentes (adriana-fuentes) → Prashanth Vasudev (vasudev-prashanth)

Revision history for this message

Oliver Bucaojit (oliver-bucaojit) wrote on 2014-12-17:

#4

Checked-in fix: https://review.trafodion.org/#/c/838/ Merged 12/16

-- Email of the fix, details --
I took a look at the issue and used the upsert query to reproduce the problem.

While looking at the tracing I saw that there were 3 put’s being sent to 1 region, yet 2 regions were being registered to. In the client-side code I saw the work was being done in the multi-put method so that made sense to me as to how there was a chance that the row was being sent to the wrong region. In the multi-put, you send a list of puts over to the server and so that is how the rows made its way over to the incorrect region, the startKey that was being used in the coprocessor call came from a different row in the list.

Looking deeper into the multi-put I saw that it puts the HRegionLocations in a HashMap with the list of puts as the value. And that was where the problem was. There is an issue we found before (issue for us, but doesn’t seem to be one for the greater HBase) where the HRegionLocations are equal as long as their server name is equal. That lead us to extend that class and use TransactionRegionLocation instead so we can also compare the HRegionInfo for more granularity.

So by switching the HRegionLocation to TransactionRegionLocation in the multi-put, the HashMap was able to have separate region location keys instead of lumping them all into one key, which fixes the issue. I tested with the below upsert command and with the setup/load scripts from Mike Hanlon.

I’ll look for other locations where the change would be needed and run some tests then check-in the fix.

---

SQL output of successful run.

(c) Copyright 2014 Hewlett-Packard Development Company, LP.
>>drop table TCHAR_T4;

--- SQL operation complete.
>>CREATE TABLE TCHAR_T4
       (
              A CHAR(5) not null
              , B CHAR(5) not null
              , PRIMARY KEY (A)
       ) salt using 8 partitions
;
upsert into TCHAR_T4 values ('A', 'B'), ('C', 'D'), ('E', 'F');
+>+>+>+>+>+>

--- SQL operation complete.
>>
--- 3 row(s) inserted.
>>>>select * from TCHAR_T4;

A B
----- -----

A B
C D
E F

--- 3 row(s) selected.
>>exit

Thanks, Oliver

Checked-in fix: https://review.trafodion.org/#/c/838/ Merged 12/16

-- Email of the fix, details --
I took a look at the issue and used the upsert query to reproduce the problem.

While looking at the tracing I saw that there were 3 put’s being sent to 1 region, yet 2 regions were being registered to.  In the client-side code I saw the work was being done in the multi-put method so that made sense to me as to how there was a chance that the row was being sent to the wrong region.  In the multi-put, you send a list of puts over to the server and so that is how the rows made its way over to the incorrect region, the startKey that was being used in the coprocessor call came from a different row in the list.

Looking deeper into the multi-put I saw that it puts the HRegionLocations in a HashMap with the list of puts as the value.  And that was where the problem was.  There is an issue we found before (issue for us, but doesn’t seem to be one for the greater HBase) where the HRegionLocations are equal as long as their server name is equal.  That lead us to extend that class and use TransactionRegionLocation instead so we can also compare the HRegionInfo for more granularity.

So by switching the HRegionLocation to TransactionRegionLocation in the multi-put, the HashMap was able to have separate region location keys instead of lumping them all into one key, which fixes the issue.  I tested with the below upsert command and with the setup/load scripts from Mike Hanlon.

I’ll look for other locations where the change would be needed and run some tests then check-in the fix.

---

SQL output of successful run.

(c) Copyright 2014 Hewlett-Packard Development Company, LP.
>>drop table TCHAR_T4;

--- SQL operation complete.
>>CREATE TABLE TCHAR_T4
       (
              A             CHAR(5) not null
              , B        CHAR(5) not null
              , PRIMARY KEY (A)
       ) salt using 8 partitions
;
upsert into TCHAR_T4 values ('A', 'B'), ('C', 'D'), ('E', 'F');
+>+>+>+>+>+>

--- SQL operation complete.
>>
--- 3 row(s) inserted.
>>>>select * from TCHAR_T4;

A      B
-----  -----

A      B
C      D
E      F

--- 3 row(s) selected.
>>exit

Thanks, Oliver

Changed in trafodion:
status:	In Progress → Fix Committed
assignee:	Prashanth Vasudev (vasudev-prashanth) → Oliver Bucaojit (oliver-bucaojit)

Revision history for this message

Aruna Sadashiva (aruna-sadashiva) wrote on 2015-01-10:

#5

Verified on 01/08/2015 build, it works now.

Changed in trafodion:
status:	Fix Committed → Fix Released

Trafodion

Batch upsert into salted table not upserting all rows thru jdbc t4 driver

Bug Description

Other bug subscribers

Remote bug watches