initialize trafodion,upgrade -Metadata Upgrade: failed

Bug #1394702 reported by Chris Tjepkema
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Trafodion
Fix Committed
High
Joanie Cooper

Bug Description

Upgrading from :
trafodion-20141008_830.tar.gz
To:
trafodion-20141115_0830.tar.gz

A simple query informed me to upgrade metadata:
>>select count(*) from YCSB_TABLE_16;

*** ERROR[1395] Trafodion need to be upgraded on this system due to metadata version mismatch. Do 'initialize trafodion, upgrade' to upgrade metadata. Or do 'initialize trafodion, drop' followed by 'initialize trafodion'. Be aware that the second option would delete all metadata and user objects from trafodion database.

*** ERROR[4082] Object TRAFODION.JAVABENCH.YCSB_TABLE_16 does not exist or is inaccessible.

1) >>initialize trafodion, upgrade;
:
:
Metadata Upgrade: failed

*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: java.io.IOException: PerformScanResponse exception org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: Expected nextCallSeq: 0 But the nextCallSeq got from client: 1
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
.

2) noticed hbase need to be restarted
restarted hbase
Tried upgrade again and got….
*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: java.io.IOException: PerformScanResponse exception org.apache.hadoop.hbase.regionserver.WrongRegionException: Request Region Name, TRAFODION._MD_.OBJECTS,,1416354874938.797b6986b9ad8dc5109ebb663afd008d., does not match this region, TRAFODION._MD_.OBJECTS,,1416355295420.f009aa5758d4f9459118d8f7e5bd753b.
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)

3) tried restarting env with sqstop,sqstart
tried upgrade again and got..
Metadata Upgrade: failed

*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: java.io.IOException: PerformScanResponse exception org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: Expected nextCallSeq: 0 But the nextCallSeq got from client: 1
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)

Tags: dtm
Changed in trafodion:
importance: Undecided → High
Revision history for this message
Anoop Sharma (anoop-sharma) wrote :

The root cause of upgrade error are these errors from underlying subsystem. It could
be dtm, hbase or hadoop. We have seen these error at other times as well.
Forwarding this to dtm group who have looked at similar issues.

  java.util.concurrent.ExecutionException: java.io.IOException: PerformScanResponse exception
  org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException:

  java.util.concurrent.ExecutionException: java.io.IOException: PerformScanResponse exception
  org.apache.hadoop.hbase.regionserver.WrongRegionException: Request

tags: added: dtm
Changed in trafodion:
assignee: nobody → Joanie Cooper (joanie-cooper)
Revision history for this message
Joanie Cooper (joanie-cooper) wrote :

The "java.util.concurrent.ExecutionException: java.io.IOException: PerformScanResponse exception
  org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException" is generated by the TrxRegionEndpoint coprocessor when the client and server coprocessor "PerformScan" operation are not in agreement on the current sequence number. A sequence number is maintained on both the client and server sides. This ensures that scan requests/responses are not processed out of order.

We have seen these exceptions several times recently on our clusters. We have discovered that HBase load balancing appears to be enabled on our clusters. For our previous HBase 0.94 installations, this was disabled. However, for HBase 0.98 installations we may not have retained this procedure when we set up our clusters.

When a region moves locations while a transaction is active, subsequent transactional interactions with the moved region has lost precious DTM context. This can lead to a variety of HBase exceptions resulting in Trafodion SQL query failures.

HBase load balancing can be disabled using the HBase shell :

Example:
# hbase shell
hbase(main):002:0> balance_switch false
true <-- Output will be the last setting of the balance_switch value
0 row(s) in 0.0080 seconds

We believe that this setting should be persistent across HBase instance restarts for HBase 0.98.

We are investigating adding a master observer coprocessor that will disallow load rebalancing for regions that have active DTM transactions. This would allow for load rebalancing throughout the HBase cluster without impacting transactional integrity.

Changed in trafodion:
status: New → In Progress
Revision history for this message
Joanie Cooper (joanie-cooper) wrote :

The trx observer coprocessor has been enhanced to prevent region splits if DTM transactions are active. This fix was delivered on 12/10/14 in delivery https://review.trafodion.org/804.

We will put on the 12/12/14 build and run with an enhanced trx hbase jar file with additional tracing to continue working on this problem.

We should allow us to reproduce the OutOfOrderNextException scanner exception and eliminate any possibility of region splitting from contributing to the problem.

Revision history for this message
Joanie Cooper (joanie-cooper) wrote :
Download full text (3.3 KiB)

I reproduced the “OutOfOrderScannerNextException” matching
LP 1394702 (failure during trafodion upgrade) this morning.

[trafodion@getafix-1 logs]$ sqlci
Trafodion Conversational Interface 0.9.1
(c) Copyright 2014 Hewlett-Packard Development Company, LP.
>>initialize trafodion, upgrade;
Metadata Upgrade: started

Version Check: started
  Metadata need to be upgraded from Version 2.3 to 3.0
Version Check: done

Drop Old Metadata: started
Drop Old Metadata: done

Backup Current Metadata: started
Backup Current Metadata: done

Drop Current Metadata: started
Drop Current Metadata: done

Initialize New Metadata: started
Initialize New Metadata: done

Copy Old Metadata: started
Restore from Old Metadata: started
Restore from Old Metadata: done

Drop Old Metadata: started
Drop Old Metadata: done

Metadata Upgrade: failed

*** ERROR[8448] Unable to access Hbase interface. Call to ExpHbaseInterface::nextRow returned error HBASE_ACCESS_ERROR(-705). Cause:
java.util.concurrent.ExecutionException: java.io.IOException: PerformScanResponse exception org.apache.hadoop.hbase.exceptions.OutOfOrderScannerNextException: Expected nextCallSeq: 0 But the nextCallSeq got from client: 1
java.util.concurrent.FutureTask.report(FutureTask.java:122)
java.util.concurrent.FutureTask.get(FutureTask.java:188)
org.trafodion.sql.HBaseAccess.HTableClient.fetchRows(HTableClient.java:415)
.

--- SQL operation failed with errors.
>>

I moved in a later hbase-trx HDP 0.9.1 jar file matching a December 10th date.
This date had some additional logging changes and multiple hbase-trx
changes from the DTM team.

With this version, the “initialize trafodion, upgrade” succeeded.

[trafodion@getafix-1 logs]$ sqlci
Trafodion Conversational Interface 0.9.1
(c) Copyright 2014 Hewlett-Packard Development Company, LP.
>>initialize trafodion, upgrade;
Metadata Upgrade: started

Version Check: started
  Metadata need to be upgraded from Version 2.3 to 3.0
Version Check: done

Drop Old Metadata: started
Drop Old Metadata: done

Backup Current Metadata: started
Backup Current Metadata: done

Drop Current Metadata: started
Drop Current Metadata: done

Initialize New Metadata: started
Initialize New Metadata: done

Copy Old Metadata: started
Copy Old Metadata: done

Validate Metadata Copy: started
Validate Metadata Copy: done

Customize New Metadata: started
  Start: Update COLUMNS
  End: Update COLUMNS
  Start: Update TABLES
  End: Update TABLES
  Start: Update TEXT
  End: Update TEXT
  Start: Update SEQ_GEN
  End: Update SEQ_GEN
Customize New Metadata: done

Delete Old Metadata Info: started
Delete Old Metadata Info: done

Update Metadata Views: started
Update Metadata Views: done

Update Priv Mgr: started
Update Priv Mgr: done

Update Metadata Version: started
Update Metadata Version: done

Drop Old Metadata from Hbase: started
Drop Old Metadata from Hbase: done

Metadata Upgrade to Version 3.0: done

--- SQL operation complete.
>>

I did have full logging on for both runs.

I’ve kept the logs that demonstrate the failures.
Unfortunately, these logs do not contain the additional logging statements,
so it’s not possible to really confirm what is the problem.

The problem did not reproduce with ...

Read more...

Changed in trafodion:
milestone: none → r1.0
Revision history for this message
Joanie Cooper (joanie-cooper) wrote :

QA has been retesting upgrade. Another problem was found in upgrade, documented in two new LP bugs:

https://bugs.launchpad.net/trafodion/+bug/1408506
initialize trafodion, upgrade -- Metadata Upgrade failed - Critical

https://bugs.launchpad.net/trafodion/+bug/1408504
after upgrade attempt -get schemas gets core - Critical

Testing just completed with changes to address these two LP bugs allowed the upgrade to succeed.

The OutOfOrderScannerNextException was not seen with all the latest hbase-trx changes applied in the build being used during the upgrade.

Changed in trafodion:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.