Cluster stuck on "SystemError: RuntimeError: Cluster.add_instance: RESET MASTER is not allowed because Group Replication is running."

Bug #1912688 reported by Pedro Guimarães
34
This bug affects 8 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Fix Released
High
David Ames

Bug Description

Hi,

Running latest mysql-innodb-cluster charm on focal / ussuri.
I am seeing cluster stuck on "Not all instances clustered".

I am aware of LP #1881735 and #1901771.
My issue does not seem to be directly related to either.

Looking at Juju logs, I can see that charm consistently fails with:
2021-01-21 19:34:43 INFO juju-log Adding instance, <redacted>.34, to the cluster.
2021-01-21 19:34:43 ERROR juju-log Failed adding instance <redacted>.34 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory

Clone based recovery selected through the recoveryMethod option

NOTE: Group Replication will communicate with other members using '<redacted>.34:33061'. Use the localAddress option to override.

Validating instance configuration at <redacted>.34:3306...
This instance reports its own address as <redacted>.34:3306
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster...
Traceback (most recent call last):
  File "<string>", line 3, in <module>
SystemError: RuntimeError: Cluster.add_instance: RESET MASTER is not allowed because Group Replication is running.

That repeats essentially for every update_status.
Going back to the addInstance command that we are trying to execute here, I can see on the documentation that it states:
https://dev.mysql.com/doc/dev/mysqlsh-api-javascript/8.0/classmysqlsh_1_1dba_1_1_cluster.html

auto: let Group Replication choose whether or not a full snapshot has to be taken, based on what the target server supports and the group_replication_clone_threshold sysvar. This is the default value. A prompt will be shown if not possible to safely determine a safe way forward. If interaction is disabled, the operation will be canceled instead.

If I switch the "recoveryMethod" from "clone" to "auto" for that particular addInstance call, it works:
https://pastebin.canonical.com/p/SJgZB8xYwZ/

So, is there any reason why we are using recovery method as clone instead of auto?

Revision history for this message
David Ames (thedac) wrote :

So as you can see the first functional test [1] from [0] has failed because it did not cluster with auto. I originally set recoverymethod to clone due to these kinds of problems. I am open to changing it but it will need to be thoroughly tested.

Having the logs from all the mysql-innodb-cluster from your particular failure would be very helpful. As well as the bundle.

I'll test the auto setting myself and report back.

[0] https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/771882
[1] https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_smoke/openstack/charm-mysql-innodb-cluster/771882/2/21322/index.html

Revision history for this message
David Ames (thedac) wrote :

From the functional test:

2021-01-21 21:21:05 INFO juju-log cluster:2: Configuring instance for clustering: 172.17.105.14.
2021-01-21 21:21:06 INFO juju-log cluster:2: Adding instance, 172.17.105.14, to the cluster.
2021-01-21 21:21:07 ERROR juju-log cluster:2: Failed adding instance 172.17.105.14 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
^[[33mWARNING: ^[[0mA GTID set check of the MySQL instance at '172.17.105.14:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

172.17.105.14:3306 has the following errant GTIDs that do not exist in the cluster:
70fc52c0-5c2e-11eb-88ae-fa163ec8b92c:1-16

^[[33mWARNING: ^[[0mDiscarding these extra GTID events can either be done manually or by completely overwriting the state of 172.17.105.14:3306 with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
^[[31mERROR: ^[[0mThe target instance must be either cloned or fully provisioned before it can be added to the target cluster.
Traceback (most recent call last):
  File "<string>", line 3, in <module>
SystemError: MYSQLSH (51153): Cluster.add_instance: Instance provisioning required

I confirmed this by repeating the functional test manually. So rather than the patchset change. I would rather look closer at the particular failure you saw on site.

Changed in charm-mysql-innodb-cluster:
status: New → Incomplete
David Ames (thedac)
Changed in charm-mysql-innodb-cluster:
status: Incomplete → Confirmed
importance: Undecided → High
assignee: nobody → David Ames (thedac)
Revision history for this message
David Ames (thedac) wrote :
Download full text (3.6 KiB)

OK, we have now captured this failure a couple of times.

The issue is the first attempt to add the node to the cluster fails due to "contains transactions that do not originate from the cluster":

2021-03-17 23:31:37 ERROR juju-log cluster:7: Failed adding instance 10.5.0.30 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
^[[33mWARNING: ^[[0mA GTID set check of the MySQL instance at '10.5.0.30:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

10.5.0.30:3306 has the following errant GTIDs that do not exist in the cluster:
aa6b07b2-8778-11eb-ad15-fa163ec0cd82:1-16

^[[33mWARNING: ^[[0mDiscarding these extra GTID events can either be done manually or by completely overwriting the state of 10.5.0.30:3306 with a physical snapshot from an existing cluster member. To use this method by default,
set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
Clone based recovery selected through the recoveryMethod option

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '10.5.0.30:33061'. Use the localAddress option to override.

Validating instance configuration at 10.5.0.30:3306...
This instance reports its own address as ^[[1m10.5.0.30:3306^[[0m
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster...
Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
^[[1mClone based state recovery is now in progress.^[[0m

^[[36mNOTE: ^[[0mA server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
^[[36mNOTE: ^[[0m10.5.0.30:3306 is being cloned from 10.5.0.6:3306
** Stage DROP DATA: Completed
** Stage FILE COPY: Completed
** Stage PAGE COPY: Completed
** Stage REDO COPY: Completed
** Stage FILE SYNC: Completed
** Stage RESTART: Completed
* Clone process has finished: 72.20 MB transferred in about 1 second (~72.20 MB/s)

Then subsequent attempts to re-add the node fail with "RESET MASTER":

2021-03-17 23:32:08 ERROR juju-log cluster:7: Failed adding instance 10.5.0.30 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory

Clone based recovery selected through the recoveryMethod option

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '10.5.0.30:33061'. Use the localAddress option to override.

Validating instance configuration at 10.5.0.30:3306...
This instance reports its own address as ^[[1m10.5.0.30:3306^[[0m
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a fTraceback (most recent call last):
  ...

Read more...

Revision history for this message
David Ames (thedac) wrote :
Changed in charm-mysql-innodb-cluster:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-mysql-innodb-cluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/781922
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/8c9920ec6a8a7b6dbc5679c5bef90a9966defc1d
Submitter: "Zuul (22348)"
Branch: master

commit 8c9920ec6a8a7b6dbc5679c5bef90a9966defc1d
Author: David Ames <email address hidden>
Date: Fri Mar 19 08:41:27 2021 -0700

    Do not fail on Cloned recoveryMethod

    When the recoveryMethod clone actually needs to overwrite the remote
    node the mysql-shell unfortunately returns with returncode 1. Both
    "Clone process has finished" and "Group Replication is running"
    actually indicate successful states.

    Handle these two edge cassess as succesful.

    func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/565

    Closes-Bug: #1912688
    Change-Id: Ia0e99feee76f403ba5ed6e631bd0671c017c9c2c

Changed in charm-mysql-innodb-cluster:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-mysql-innodb-cluster (stable/21.04)
Revision history for this message
Dongwon Cho (dongwoncho) wrote :

I am not sure if it's something different but I got this[1] with revision 7 today again and had to `stop group_replication` and `add-instance` manually.
[1] https://pastebin.canonical.com/p/NrPzM8y4zy/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-mysql-innodb-cluster (stable/21.04)

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/791049
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/ed8b3d46271a2018017fd49e0322033f4fbe3d82
Submitter: "Zuul (22348)"
Branch: stable/21.04

commit ed8b3d46271a2018017fd49e0322033f4fbe3d82
Author: David Ames <email address hidden>
Date: Fri Mar 19 08:41:27 2021 -0700

    Do not fail on Cloned recoveryMethod

    When the recoveryMethod clone actually needs to overwrite the remote
    node the mysql-shell unfortunately returns with returncode 1. Both
    "Clone process has finished" and "Group Replication is running"
    actually indicate successful states.

    Handle these two edge cassess as succesful.

    func-test-pr: https://github.com/openstack-charmers/zaza-openstack-tests/pull/565

    Closes-Bug: #1912688
    Change-Id: Ia0e99feee76f403ba5ed6e631bd0671c017c9c2c
    (cherry picked from commit 8c9920ec6a8a7b6dbc5679c5bef90a9966defc1d)

Changed in charm-mysql-innodb-cluster:
milestone: none → 21.10
Changed in charm-mysql-innodb-cluster:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-mysql-innodb-cluster (master)

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/771882
Reason: This review is > 12 weeks without comment and currently blocked by a core reviewer with a -2. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and contacting the reviewer with the -2 on this review to ensure you address their concerns.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.