The `reboot-cluster-from-complete-outage` action fails after power-loss binary log corruption

Bug #2044821 reported by John Lettman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
New
Undecided
Unassigned

Bug Description

Under certain circumstances, power loss causes MySQL state files to become corrupted on charm units. If present on any unit, this corruption causes the charm to fail the `reboot-cluster-from-complete-outage` action when it tries to find the most up-to-date binary logs, preventing power-loss recovery.

The following is a snapshot of the action result:
https://pastebin.canonical.com/p/t3QvgHMPFc/

As a result, the following "clone method" workaround is required to recover from this critical outage:

1. Obtain the passwords:
   - `juju run --unit mysql-innodb-cluster/leader leader-get cluster-password`
   - `juju run --unit mysql-innodb-cluster/leader leader-get mysql.passwd`
2. Access each downed unit and clone the instance from the working unit:
   - SSH to the downed unit: `juju ssh mysql-innodb-cluster/XX`
   - Obtain a MySQL shell: `mysql -u root -p # use 'mysql.passwd' when prompted`
   - Clone the working unit (please note **errata** below):
     ```sql
     STOP GROUP_REPLICATION \W;
     SET GLOBAL super_read_only = 0;
     CLONE INSTANCE FROM 'clusteruser'@'[IP of the working unit]':3306 IDENTIFIED BY '[use cluster-password]' REQUIRE SSL;
     ```
3. Join each downed unit back into the cluster:
   - Grab a new MySQL shell: `mysqlsh`
   - Join the cluster:
     ```python
     shell.connect('clusteruser:[use cluster-password]@[IP of the working unit]')
     cluster = dba.get_cluster()
     cluster.add_instance('clusteruser:[use cluster-password]@[this units IP]')
     ```

**Errata** for the "clone method:"
- Where `CLONE INSTANCE` fails, stating the plugin is not loaded, it may need to be loaded:
  ```sql
  INSTALL PLUGIN clone SONAME 'mysql_clone.so';
  ```
- Where an error is raised regarding `clone_valid_donor_list`, the IP of the current unit may need to be added:
  ```sql
  SET GLOBAL clone_valid_donor_list = '[this units IP]:3306'
  ```

If possible, could this also be made into a separate action?

John Lettman (jplettman)
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.