adding a new unit randomly fails on "configure_instance" with "Host is not allowed to connect"
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL InnoDB Cluster Charm |
Triaged
|
Medium
|
Unassigned |
Bug Description
When deploying a new fresh 3-unit cluster successfully, attempting to add a 4th unit may or may not fail with the following error message:
INFO unit.mysql/
ERROR unit.mysql/
Traceback (most recent call last):
File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (1130): Dba.configure_
The error is ignored due to [1] where the return causes the function to end successfully. The leader unit remains healthy and without errors, but the newly added unit stays with status "Instance not yet configured for clustering" because the leader-set in [2] is not run.
Since there is no hook failure, the action is not retried, and update-status also does not retry, and the 4th instance remains in "Instance not yet configured for clustering"
If a leader-set command is manually run (where [2] is not) to set it to configured, then the status changes to "Instance not yet in the cluster", as the 4th unit is not in cluster-status because the next method "add_instance_
Investigated further today, and comparing successful and failed deployments, I see the following differences on the 4th unit:
mysql-py []> \sql SHOW GLOBAL VARIABLES LIKE '%replication%'
the above command shows very diffirent results from both deployments, then:
mysql-py []> \sql START GROUP_REPLICATION
ERROR: 3093: The START GROUP_REPLICATION command failed since the group is already running.
on the successful deployment ^.
mysql-py []> \sql START GROUP_REPLICATION
ERROR: 3092: The server is not configured properly to be an active member of the group. Please see more details on error log.
on the failed deployment ^. Looking at the logs:
[ERROR] [MY-010381] [Repl] Group Replication plugin is not installed.
mysql-py []> \sql SHOW PLUGINS
group_replication | ACTIVE | GROUP REPLICATION | group_replicati on.so | GPL
the line ^ is present on the successful deployment but not on the failed one
Looking at juju logs with TRACE debug level I don't see any difference in output grepping -i for the word "group".
Another interesting fact is that when the 4th unit is being added it automatically becomes a coordinator, invoking the coordinator layer and relation [1]
[1] https:/ /github. com/juju/ layer-index/ blob/master/ layers/ coordinator. json