Intermittent deployment failure, unit stuck in "Instance not yet in the cluster" state

Bug #1881735 reported by Frode Nordahl
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
In Progress
High
David Ames

Bug Description

$ juju status mysql-innodb-cluster
Model Controller Cloud/Region Version SLA Timestamp
zaza-4928e0d8aa32 fnordahl-serverstack serverstack/serverstack 2.7.6 unsupported 10:15:11Z

App Version Status Scale Charm Store Rev OS Notes
mysql-innodb-cluster 8.0.20 waiting 3 mysql-innodb-cluster jujucharms 37 ubuntu

Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/0 waiting idle 0 10.5.0.4 Not all instances clustered
mysql-innodb-cluster/1 waiting idle 1 10.5.0.33 Instance not yet in the cluster
mysql-innodb-cluster/2* waiting idle 2 10.5.0.32 Not all instances clustered

Machine State DNS Inst id Series AZ Message
0 started 10.5.0.4 48dfc189-e9b4-4552-991a-31b86315f5f6 focal nova ACTIVE
1 started 10.5.0.33 2c7d7ef7-b36a-4757-92fd-0190cfc4eb09 focal nova ACTIVE
2 started 10.5.0.32 59223555-6e1d-4d1a-bcb9-df39dd251580 focal nova ACTIVE

Excerpt from leader unit log:
2020-06-02 09:58:01 ERROR juju-log cluster:2: Failed adding instance 10.5.0.33 to cluster: Logger: Tried to log to an uninitialized logger.
WARNING: A GTID set check of the MySQL instance at '10.5.0.33:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

10.5.0.33:3306 has the following errant GTIDs that do not exist in the cluster:
cb5028c2-a4b5-11ea-bf92-fa163ed05024:1-16

WARNING: Discarding these extra GTID events can either be done manually or by completely overwriting the state of 10.5.0.33:3306 with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
Clone based recovery selected through the recoveryMethod option

NOTE: Group Replication will communicate with other members using '10.5.0.33:33061'. Use the localAddress option to override.

Validating instance configuration at 10.5.0.33:3306...
This instance reports its own address as 10.5.0.33:3306
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster...
ERROR: Unable to start Group Replication for instance '10.5.0.33:3306'. Please check the MySQL server error log for more information.
Traceback (most recent call last):
  File "<string>", line 3, in <module>
SystemError: RuntimeError: Cluster.add_instance: Group Replication failed to start: MySQL Error 3092 (HY000): 10.5.0.33:3306: The server is not configured properly to be an active member of the group. Please see more details on error log.

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Revision history for this message
Frode Nordahl (fnordahl) wrote :

Hit this again, new crashdump for reference

Revision history for this message
David Ames (thedac) wrote :
Download full text (14.7 KiB)

The crashdumps including [0] all have the same failure mode. The leader is trying to add an instance to the cluster:

2020-10-14 11:53:43 INFO juju-log cluster:4: Adding instance, 172.17.112.7, to the cluster.
2020-10-14 11:55:38 ERROR juju-log cluster:4: Failed adding instance 172.17.112.7 to cluster: Logger: Tried to log to an uninitialized logger.
^[[33mWARNING: ^[[0mA GTID set check of the MySQL instance at '172.17.112.7:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

172.17.112.7:3306 has the following errant GTIDs that do not exist in the cluster:
49f37ae7-0e13-11eb-b788-fa163ea0e822:1-16

^[[33mWARNING: ^[[0mDiscarding these extra GTID events can either be done manually or by completely overwriting the state of 172.17.112.7:3306 with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
Clone based recovery selected through the recoveryMethod option

^[[36mNOTE: ^[[0mGroup Replication will communicate with other members using '172.17.112.7:33061'. Use the localAddress option to override.

Validating instance configuration at 172.17.112.7:3306...
This instance reports its own address as ^[[1m172.17.112.7:3306^[[0m
Instance configuration is suitable.
A new instance will be added to the InnoDB cluster. Depending on the amount of
data on the cluster this might take from a few seconds to several hours.
Adding instance to the cluster...
^[[31mERROR: ^[[0mUnable to start Group Replication for instance '172.17.112.7:3306'. Please check the MySQL server error log for more information.
Traceback (most recent call last):
  File "<string>", line 3, in <module>
SystemError: RuntimeError: Cluster.add_instance: Group Replication failed to start: MySQL Error 3092 (HY000): 172.17.112.7:3306: The server is not configured properly to be an active member of the group. Please see more details on error log.

2020-10-14 11:55:38 DEBUG juju-log cluster:4: tracer: set flag leadership.changed.cluster-instances-configured
2020-10-14 11:55:38 DEBUG juju-log cluster:4: tracer>
tracer: set flag leadership.set.cluster-instances-configured
tracer: ++ queue handler reactive/mysql_innodb_cluster_handlers.py:168:add_instances_to_cluster
tracer: -- dequeue handler reactive/mysql_innodb_cluster_handlers.py:134:configure_instances_for_clustering
2020-10-14 11:55:38 INFO juju-log cluster:4: Invoking reactive handler: hooks/relations/mysql-router/provides.py:47:joined:db-router
2020-10-14 11:55:39 INFO juju-log cluster:4: Invoking reactive handler: hooks/relations/mysql-router/provides.py:53:changed:db-router
2020-10-14 11:55:39 INFO juju-log cluster:4: Invoking reactive handler: hooks/relations/mysql-innodb-cluster/peers.py:69:joined:cluster
2020-10-14 11:55:39 INFO juju-log cluster:4: Invoking reactive handler: hooks/relations/mysql-innodb-cluster/peers.py:75:changed:cluster
2020-10-14 11:55:39 INFO juju-log cluster:4: Invoking...

David Ames (thedac)
Changed in charm-mysql-innodb-cluster:
status: New → Triaged
importance: Undecided → High
assignee: nobody → David Ames (thedac)
milestone: none → 20.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-mysql-innodb-cluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/759301

Changed in charm-mysql-innodb-cluster:
status: Triaged → In Progress
Revision history for this message
David Ames (thedac) wrote :

The theory of the case for [0] is that running configure_instance, add_instance_to_cluster and/or create_cluster too close together may mean the instance is not yet 100% ready for the next step.

This is only a theory as much googling of error messages was inconclusive and I have been unable to re-create the failure locally.

The [0] change spaces these out to give the instance time to be ready for the next step in the clustering process. We can land this post-release and see if OSCI CI runs improve.

[0] https://review.opendev.org/759301

Revision history for this message
David Ames (thedac) wrote :

Just a note that we have a distinct bug in LP Bug #1901771 [0] that presents in a similar way to this bug with "Instance not yet in the cluster" and "Not all instances clustered" but the underlying failure is different and should remain a separate bug.

[0] https://bugs.launchpad.net/snap-mysql-shell/+bug/1901771

David Ames (thedac)
Changed in charm-mysql-innodb-cluster:
milestone: 20.10 → 21.01
Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

I think I may have hit this.

In a test with recently saw mysql-innodb-cluster fail to cluster, hanging in "waiting/Not all instances clustered" status, respectively in "waiting/Instance not yet in the cluster"

This seems intermittent, I only saw this once among abt. 4 test runs.

Checking logs of one unit I found this error:

-- snip --
2021-01-28 21:03:38 ERROR juju-log Failed adding instance 10.0.8.235 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory

Clone based recovery selected through the recoveryMethod option

ESC[36mNOTE: ESC[0mGroup Replication will communicate with other members using '10.0.8.235:33061'. Use the localAddress option to override.
...
Adding instance to the cluster...
Traceback (most recent call last):
  File "<string>", line 3, in <module>
SystemError: RuntimeError: Cluster.add_instance: RESET MASTER is not allowed because Group Replication is running.

-- snip --

Charm version:

cs:~openstack-charmers-next/mysql-innodb-cluster-61

LXD container running focal

Additional diagnostics: https://private-fileshare.canonical.com/~sabaini/e8a3afda-c460-4e61-9d33-135cfc666c0f/juju-crashdump-812b5deb-7fd1-4dde-bdd4-4a37b996c214.tar.xz

David Ames (thedac)
Changed in charm-mysql-innodb-cluster:
milestone: 21.01 → none
Revision history for this message
Maciej Leszczynski (mleszczy) wrote :

Any workaround available?

Revision history for this message
Sérgio Manso (sergiomanso) wrote :

I've also faced this issue in field with cs:mysql-innodb-cluster-3

The cluster could not be initialized because one of the units got stuck in group_replication statement with the exact same message.

Workaround: Get the mysql password, stop group replication in the failed unit and add it again to the cluster.

1. juju run --unit mysql-innodb-cluster/<leader> leader-get
2. juju ssh <stucked_unit>
3. mysql -u root -p
4. [mysql]> stop group_replication;
5. juju run-action mysql-innodb-cluster/<leader> add-instance address="<stucked_unit_ip>"

Revision history for this message
Zhanglei Mao (zhanglei-mao) wrote :

Got the same issue and #9 works fine. By the way, to get "mysql password", you can
 juju run --unit mysql-innodb-cluster/leader leader-get

Revision history for this message
Przemyslaw Lal (przemeklal) wrote :

I hit a similar issue that might be related but on already deployed cluster after a couple of days, please have a look at lp:1917332.

David Ames (thedac)
Changed in charm-mysql-innodb-cluster:
milestone: none → 21.04
Revision history for this message
David Ames (thedac) wrote :

This bug is being resolved in https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1912688 marking this as a duplicate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.