removing an instance and adding it back does not work

Bug #2006760 reported by Rodrigo Barbieri
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
New
Undecided
Unassigned

Bug Description

On a fresh jammy 3-unit deployment using charm revision 39 of 8.0/stable channel, trying to remove an instance and adding it back results in error:

juju run-action mysql-innodb-cluster/leader --wait remove-instance address=10.5.3.85

the remove action fails due to bug LP#1954306 but it actually partially succeeds removing the instance:

{"address": "10.5.3.85:3306", "instanceErrors":
      ["NOTE: group_replication is stopped."], "memberState": "OFFLINE", "mode": "R/O",
      "readReplicas": {}, "role": "HA", "status": "(MISSING)", "version": "8.0.32"}

The instance is not removed from the cluster, but it is taken offline and group_replication is stopped.

Trying to add it back now:

juju run-action mysql-innodb-cluster/leader --wait add-instance address=10.5.3.85

The action succeeds on the leader, but the status does not change. Trying to workaround this to start the group_replication back, the only action that does that is update-unit-acls, but it cannot be run due to the condition at [1]. Hacking the code to remove the condition or starting it manually, results in the following state:

{"address": "10.5.3.85:3306",
      "instanceErrors": ["ERROR: GR Recovery channel receiver stopped with an error:
      Fatal error: Invalid (empty) username when attempting to connect to the master
      server. Connection attempt terminated. (13117) at 2023-02-09 15:42:58.656640"],
      "mode": "R/O", "readReplicas": {}, "recovery": {"receiverError": "Fatal error:
      Invalid (empty) username when attempting to connect to the master server. Connection
      attempt terminated.", "receiverErrorNumber": 13117, "state": "CONNECTION_ERROR"}

At this point, another workaround is to forcibly removing the instance, but that hits bugs LP#2006759 and LP#1983158.

[1] https://github.com/openstack/charm-mysql-innodb-cluster/blob/0a3bb225c1a653767f542e1f9023ad27735a5bc5/src/lib/charm/openstack/mysql_innodb_cluster.py#L2034

Tags: sts
tags: added: sts
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I'm fairly sure this is due to, or related to, "During scale-out of cluster (zaza-openstack-tests) the leader fails to join in the new instance when related to prometheus" (https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/2015256) where the create_user() for prometheus user causes a write to the db whilst it is configured, but not yet joined, to the cluster; this causes the join_instance() to fail at that point.

I'm going to mark this as a dup, but if you feel it is not, then please un-dup it and add further comments/evidence. Thanks.

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

@Alex: Yes and No. I found one of the causes of this but hadn't had time to post back. The thing is, there is a lot usability issues that are being addressed in [1] and fixing those usability issues exposes the problem that when a unit is removed, they stay in SUPER_READ_ONLY mode, and in that mode, they cannot be added back. SSH'ing to the unit and disabling SUPER_READ_ONLY fixes it. A possible solution is to disable SUPER_READ_ONLY before trying to add or after removing, just to make it clean removal, but fixing the usability issues and exposing the problem was my top priority at [1]. I am curious to see if your patch to the bug marked duplicate will change anything. I will test that soon.

[1] https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/875041

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.