So I think I figured out the problem. It is that when a scan request takes too long to process the RPC connection times out. It is not a client timeout issue as there are retries form the client, and it seems like when another RPC connection is reestablished the nextCallSeq information on the client side is lost. Increasing RPC timeout and decreasing scanner caching both work but they also impose performance penalty so I am working to find a way around that.
For Trafodion we could try with these 2 settings in hbase-site.xml
hbase.client.scanner.timeout.period
300000
hbase.rpc.timeout
300000
Also this cqd should help
cqd hbase_num_cache_rows_max '1000' ; -- we could also try 5000 here, current default is 10,000
From https:/ /issues. apache. org/jira/ browse/ HBASE-11295
So I think I figured out the problem. It is that when a scan request takes too long to process the RPC connection times out. It is not a client timeout issue as there are retries form the client, and it seems like when another RPC connection is reestablished the nextCallSeq information on the client side is lost. Increasing RPC timeout and decreasing scanner caching both work but they also impose performance penalty so I am working to find a way around that.
For Trafodion we could try with these 2 settings in hbase-site.xml
hbase.client. scanner. timeout. period
300000
hbase.rpc.timeout
300000
Also this cqd should help
cqd hbase_num_ cache_rows_ max '1000' ; -- we could also try 5000 here, current default is 10,000