MySQL InnoDB Cluster Charm

Bug #1912688
Comment #3

Comment 3 for bug 1912688

Revision history for this message

David Ames (thedac) wrote on 2021-03-18:

OK, we have now captured this failure a couple of times.

The issue is the first attempt to add the node to the cluster fails due to "contains transactions that do not originate from the cluster":

2021-03-17 23:31:37 ERROR juju-log cluster:7: Failed adding instance 10.5.0.30 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
^[[33mWARNING: ^[[0mA GTID set check of the MySQL instance at '10.5.0.30:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

10.5.0.30:3306 has the following errant GTIDs that do not exist in the cluster:
aa6b07b2-8778-11eb-ad15-fa163ec0cd82:1-16

^[[33mWARNING: ^[[0mDiscarding these extra GTID events can either be done manually or by completely overwriting the state of 10.5.0.30:3306 with a physical snapshot from an existing cluster member. To use this method by default,
set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
Clone based recovery selected through the recoveryMethod option

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '10.5.0.30:33061'. Use the localAddress option to override.

Validating instance configuration at 10.5.0.30:3306...
This instance reports its own address as ^[[1m10.5.0.30:3306^[[0m
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster...
Monitoring recovery process of the new cluster member. Press ^C to stop monitoring and let it continue in background.
^[[1mClone based state recovery is now in progress.^[[0m

^[[36mNOTE: ^[[0mA server restart is expected to happen as part of the clone process. If the
server does not support the RESTART command or does not come back after a
while, you may need to manually start it back.

* Waiting for clone to finish...
^[[36mNOTE: ^[[0m10.5.0.30:3306 is being cloned from 10.5.0.6:3306
** Stage DROP DATA: Completed
** Stage FILE COPY: Completed
** Stage PAGE COPY: Completed
** Stage REDO COPY: Completed
** Stage FILE SYNC: Completed
** Stage RESTART: Completed
* Clone process has finished: 72.20 MB transferred in about 1 second (~72.20 MB/s)

Then subsequent attempts to re-add the node fail with "RESET MASTER":

2021-03-17 23:32:08 ERROR juju-log cluster:7: Failed adding instance 10.5.0.30 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory

Clone based recovery selected through the recoveryMethod option

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '10.5.0.30:33061'. Use the localAddress option to override.

Validating instance configuration at 10.5.0.30:3306...
This instance reports its own address as ^[[1m10.5.0.30:3306^[[0m
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a fTraceback (most recent call last):
File "<string>", line 3, in <module>
SystemError: RuntimeError: Cluster.add_instance: RESET MASTER is not allowed because Group Replication is running.

So the root cause is the extra GTIDs found. It is a WARNING but is returning non-zero. So it *may* be OK to ignore as the subsequent text indicates the process drops the data.

I will also re-re-re investigate the auto setting. It may be that on a second attempt auto would fix things but we may still need clone for the first attempt.

I'll investigate and report back.

OK, we have now captured this failure a couple of times.

The issue is the first attempt to add the node to the cluster fails due to "contains transactions that do not originate from the cluster":

10.5.0.30:3306 has the following errant GTIDs that do not exist in the cluster:
aa6b07b2-8778-11eb-ad15-fa163ec0cd82:1-16

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '10.5.0.30:33061'. Use the localAddress option to override.

Then subsequent attempts to re-add the node fail with "RESET MASTER":

2021-03-17 23:32:08 ERROR juju-log cluster:7: Failed adding instance 10.5.0.30 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory

Clone based recovery selected through the recoveryMethod option

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '10.5.0.30:33061'. Use the localAddress option to override.

Validating instance configuration at 10.5.0.30:3306...
This instance reports its own address as ^[[1m10.5.0.30:3306^[[0m
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a fTraceback (most recent call last):
  File "<string>", line 3, in <module>
SystemError: RuntimeError: Cluster.add_instance: RESET MASTER is not allowed because Group Replication is running.

So the root cause is the extra GTIDs found. It is a WARNING but is returning non-zero. So it *may* be OK to ignore as the subsequent text indicates the process drops the data.

I will also re-re-re investigate the auto setting. It may be that on a second attempt auto would fix things but we may still need clone for the first attempt.

I'll investigate and report back.