MySQL InnoDB Cluster Charm

Charm stuck in waiting after rejoining the cluster

Bug #1983158 reported by Vern Hart on 2022-07-29

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	MySQL InnoDB Cluster Charm	In Progress	Medium	Rodrigo Barbieri

Bug Description

Running focal/ussuri.

Was testing availability zone failure and the node with one of the mysql-innodb-cluster units went down.

After the nodes came up, all services restored fairly quickly except mysql-innodb-cluster/2.

It says:
Cluster is inaccessible from this instance. Please check logs for details.

Checking the logs I see:
RuntimeError: Dba.get_cluster: Group replication does not seem to be active in instance '10.103.223.3:3306'
10.103.223.3 is the IP of mysql-innodb-cluser/2

I tried the juju action (from a healthy unit) to rejoin the cluster:
juju run-action mysql-innodb-cluster/0 --wait rejoin-instance address=10.103.223.3
But that failed saying:
The group_replication_group_name cannot be changed when Group Replication is running
The mysql logs say there may be corruption in the relay log. It also says it set the member to read-only and then left the group (well before me running the rejoin-instance action).
I suspect running the reboot-cluster-from-complete-outage on one of the good units would probably work but that seems like a bigger hammer than necessary.

I'll try removing and re-adding:
  juju run-action mysql-innodb-cluster/0 --wait remove-instance address=10.103.223.3 force=true
  juju run-action mysql-innodb-cluster/0 --wait add-instance address=10.103.223.3
force=true was necessary because the node is marked ERROR.
These action succeeded but now the juju status for that unit says:
  Instance not yet configured for clustering

After connecting to mysql in the bad unit (using the pw from leader-get mysql.passwd) I executed:
stop group_replication;
reset replica;
Afterwards, running the add-instance action worked and the cluster-status action shows all three nodes joined with the new one RO, as expected.

However, juju status still shows it's waiting with:
Instance not yet configured for clustering

I've tried manually running hooks, restarting mysql and juju agent on the suspect node, and the status still shows waiting.

Checking the logs, neither mysql nor juju are showing any errors and the unit appears to be functioning appropriately so this seems to be a charm bug and not an actual state of things.

Tags:

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2023-02-09:

Adding back an instance after it has been forcibly removed causes the charm state to go off-sync. The charm currently does not monitor the cluster state properly, so it would require the following command to set the charm-state back to what it should be after adding back the instance:

juju run -u mysql-innodb-cluster/leader -- leader-set cluster-instance-configured-192-168-0-32=True

replace 192-168-0-32 with the IP of the instance you added back

tags:

added: sts

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-02-24: Fix proposed to charm-mysql-innodb-cluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/875041

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-05-30:

What needs to be resolved is how to fix the charm's state indicates 'not yet configured' when using out-of-band (i.e. not using charm actions) mysql actions to resolve clustering issues. What needs to happen is for the charm to 'discover' that the underlying mysql cluster is healthy, that the unit is configured and reflect that, rather than using previously cached information (flags, state on relations, etc.) The charm should always try to resolve it's state from the environment, rather than holding on to cached information.

Changed in charm-mysql-innodb-cluster:
importance:	Undecided → Medium
status:	New → Triaged

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-05-30:

Note that this is almost certainly related to: https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/2013078, and they are likely to be fixed at the same time.

Revision history for this message

Rodrigo Barbieri (rodrigo-barbieri2010) wrote on 2023-05-30:

Hi Alex. As I commented in the other bug you linked above, I strongly disagree they are related and very unlikely to be fixed at the same time by addressing a single root cause.

Regarding detecting the cluster status and setting the flags appropriately, we've discussed that in the past and agreed to not do that at this time, that is why in my patch above I am merely error'ing out so intervention steps can be pointed out to the user, as a first step consisting of UX improvement for the resolution of the problem.

But I agree the ideal but more complex solution would be to detect cluster status and set the flags appropriately.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-05-30:

Quoting from the above bug:

---

However, juju status still shows it's waiting with:
Instance not yet configured for clustering

---

i.e. it's related as an instance was removed or added. I'm just sign-posting so that people who read the bugs can see that other bugs also exist around adding and removing instances.

Felipe Reyes (freyes) on 2024-04-03

Changed in charm-mysql-innodb-cluster:
assignee:	nobody → Rodrigo Barbieri (rodrigo-barbieri2010)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-04-22: Fix merged to charm-mysql-innodb-cluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/875041
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/fe8c359fc2150cdc50b37488af9a85f2b87129d8
Submitter: "Zuul (22348)"
Branch: master

commit fe8c359fc2150cdc50b37488af9a85f2b87129d8
Author: Rodrigo Barbieri <email address hidden>
Date: Fri Feb 24 10:54:33 2023 -0300

Improve add/remove actions

    - Improved docs for remove-instance action
      in light of bugs LP#1954306 and LP#2006759
    - Fixed exception handling of failed configure
      step of add-instance action, which resulted in
      bug LP#1983158
    - Fixed exception handling which
      was raising a new exception instead
      of reraising the existing one.

Closes-bug: #1983158
Change-Id: I03096239d42cc8fcf93ca22ead6b84766d6f5926

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.