Bug #1912688 “Cluster stuck on “SystemError: RuntimeError: Clust...” : Bugs : MySQL InnoDB Cluster Charm

Revision history for this message

David Ames (thedac) wrote on 2021-01-21:

#1

So as you can see the first functional test [1] from [0] has failed because it did not cluster with auto. I originally set recoverymethod to clone due to these kinds of problems. I am open to changing it but it will need to be thoroughly tested.

Having the logs from all the mysql-innodb-cluster from your particular failure would be very helpful. As well as the bundle.

I'll test the auto setting myself and report back.

[0] https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/771882
[1] https://openstack-ci-reports.ubuntu.com/artifacts/test_charm_pipeline_func_smoke/openstack/charm-mysql-innodb-cluster/771882/2/21322/index.html

Revision history for this message

David Ames (thedac) wrote on 2021-01-22:

#2

From the functional test:

2021-01-21 21:21:05 INFO juju-log cluster:2: Configuring instance for clustering: 172.17.105.14.
2021-01-21 21:21:06 INFO juju-log cluster:2: Adding instance, 172.17.105.14, to the cluster.
2021-01-21 21:21:07 ERROR juju-log cluster:2: Failed adding instance 172.17.105.14 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
^[[33mWARNING: ^[[0mA GTID set check of the MySQL instance at '172.17.105.14:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

172.17.105.14:3306 has the following errant GTIDs that do not exist in the cluster:
70fc52c0-5c2e-11eb-88ae-fa163ec8b92c:1-16

^[[33mWARNING: ^[[0mDiscarding these extra GTID events can either be done manually or by completely overwriting the state of 172.17.105.14:3306 with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
^[[31mERROR: ^[[0mThe target instance must be either cloned or fully provisioned before it can be added to the target cluster.
Traceback (most recent call last):
File "<string>", line 3, in <module>
SystemError: MYSQLSH (51153): Cluster.add_instance: Instance provisioning required

I confirmed this by repeating the functional test manually. So rather than the patchset change. I would rather look closer at the particular failure you saw on site.

Changed in charm-mysql-innodb-cluster:
status:	New → Incomplete

David Ames (thedac) on 2021-03-18

Changed in charm-mysql-innodb-cluster:
status:	Incomplete → Confirmed
importance:	Undecided → High
assignee:	nobody → David Ames (thedac)

Revision history for this message

David Ames (thedac) wrote on 2021-03-18:

#3

Download full text (3.6 KiB)

OK, we have now captured this failure a couple of times.

The issue is the first attempt to add the node to the cluster fails due to "contains transactions that do not originate from the cluster":

2021-03-17 23:31:37 ERROR juju-log cluster:7: Failed adding instance 10.5.0.30 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
^[[33mWARNING: ^[[0mA GTID set check of the MySQL instance at '10.5.0.30:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

10.5.0.30:3306 has the following errant GTIDs that do not exist in the cluster:
aa6b07b2-8778-11eb-ad15-fa163ec0cd82:1-16

^[[33mWARNING: ^[[0mDiscarding these extra GTID events can either be done manually or by completely overwriting the state of 10.5.0.30:3306 with a physical snapshot from an existing cluster member. To use this method by default,
set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
Clone based recovery selected through the recoveryMethod option

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '10.5.0.30:33061'. Use the localAddress option to override.

Validating instance configuration at 10.5.0.30:3306...
This instance reports its own address as ^[[1m10.5.0.30:3306^[[0m
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster...
Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
^[[1mClone based state recovery is now in progress.^[[0m

^[[36mNOTE: ^[[0mA server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
^[[36mNOTE: ^[[0m10.5.0.30:3306 is being cloned from 10.5.0.6:3306
** Stage DROP DATA: Completed
** Stage FILE COPY: Completed
** Stage PAGE COPY: Completed
** Stage REDO COPY: Completed
** Stage FILE SYNC: Completed
** Stage RESTART: Completed
* Clone process has finished: 72.20 MB transferred in about 1 second (~72.20 MB/s)

Then subsequent attempts to re-add the node fail with "RESET MASTER":

2021-03-17 23:32:08 ERROR juju-log cluster:7: Failed adding instance 10.5.0.30 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory

Clone based recovery selected through the recoveryMethod option

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '10.5.0.30:33061'. Use the localAddress option to override.

Validating instance configuration at 10.5.0.30:3306...
This instance reports its own address as ^[[1m10.5.0.30:3306^[[0m
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a fTraceback (most recent call last):
...

OK, we have now captured this failure a couple of times.

The issue is the first attempt to add the node to the cluster fails due to "contains transactions that do not originate from the cluster":

2021-03-17 23:31:37 ERROR juju-log cluster:7: Failed adding instance 10.5.0.30 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
^[[33mWARNING: ^[[0mA GTID set check of the MySQL instance at '10.5.0.30:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

10.5.0.30:3306 has the following errant GTIDs that do not exist in the cluster:
aa6b07b2-8778-11eb-ad15-fa163ec0cd82:1-16

^[[33mWARNING: ^[[0mDiscarding these extra GTID events can either be done manually or by completely overwriting the state of 10.5.0.30:3306 with a physical snapshot from an existing cluster member. To use this method by default, 
set the 'recoveryMethod' option to 'clone'.