adding a new unit randomly fails on "configure_instance" with "Host is not allowed to connect"

Bug #2013078 reported by Rodrigo Barbieri
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Triaged
Medium
Unassigned

Bug Description

When deploying a new fresh 3-unit cluster successfully, attempting to add a 4th unit may or may not fail with the following error message:

INFO unit.mysql/2.juju-log server.go:327 cluster:1: Configuring instance for clustering: 10.5.3.181.
ERROR unit.mysql/2.juju-log server.go:327 cluster:1: Failed configuring instance 10.5.3.181: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (1130): Dba.configure_instance: Host '10.5.0.75' is not allowed to connect to this MySQL server

The error is ignored due to [1] where the return causes the function to end successfully. The leader unit remains healthy and without errors, but the newly added unit stays with status "Instance not yet configured for clustering" because the leader-set in [2] is not run.

Since there is no hook failure, the action is not retried, and update-status also does not retry, and the 4th instance remains in "Instance not yet configured for clustering"

If a leader-set command is manually run (where [2] is not) to set it to configured, then the status changes to "Instance not yet in the cluster", as the 4th unit is not in cluster-status because the next method "add_instance_to_cluster" also fails with the same error ("not allowed") and returns at [3].

[1] https://github.com/openstack/charm-mysql-innodb-cluster/blob/c43ae5147e0303c04fe70d9bd0aaa4bd696939f1/src/lib/charm/openstack/mysql_innodb_cluster.py#L667

[2] https://github.com/openstack/charm-mysql-innodb-cluster/blob/c43ae5147e0303c04fe70d9bd0aaa4bd696939f1/src/lib/charm/openstack/mysql_innodb_cluster.py#L678

[3] https://github.com/openstack/charm-mysql-innodb-cluster/blob/c43ae5147e0303c04fe70d9bd0aaa4bd696939f1/src/lib/charm/openstack/mysql_innodb_cluster.py#L935

Tags: sts
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Investigated further today, and comparing successful and failed deployments, I see the following differences on the 4th unit:

mysql-py []> \sql SHOW GLOBAL VARIABLES LIKE '%replication%'

the above command shows very diffirent results from both deployments, then:

mysql-py []> \sql START GROUP_REPLICATION
ERROR: 3093: The START GROUP_REPLICATION command failed since the group is already running.

on the successful deployment ^.

mysql-py []> \sql START GROUP_REPLICATION
ERROR: 3092: The server is not configured properly to be an active member of the group. Please see more details on error log.

on the failed deployment ^. Looking at the logs:

[ERROR] [MY-010381] [Repl] Group Replication plugin is not installed.

mysql-py []> \sql SHOW PLUGINS

group_replication | ACTIVE | GROUP REPLICATION | group_replication.so | GPL

the line ^ is present on the successful deployment but not on the failed one

Looking at juju logs with TRACE debug level I don't see any difference in output grepping -i for the word "group".

Another interesting fact is that when the 4th unit is being added it automatically becomes a coordinator, invoking the coordinator layer and relation [1]

[1] https://github.com/juju/layer-index/blob/master/layers/coordinator.json

tags: added: sts
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Further investigation on this led to find out a few extra things:

1) Retrying hooks and add-instance actions do not cause the operation to succeed, the same error "host not allowed to connect" persists

2) Tried several times in different ways to enable/install the group_replication plugin, which was not installed in the failed unit for some strange reason. The only way I could get it enabled was by adding "plugin_load_add='group_replication.so'" to the /etc/mysql/mysql.conf.d/mysqld.cnf file

3) Even after re-enabling the group_replication plugin, re-running hooks and actions would still not work, same error "host not allowed to connect" persists. I searched the charm code and relation code trying to find anything special being executed to add a "permission to connect", I even ran "update-unit-acl" action but that also didn't help

At this point I am quite clueless on how to proceed from here, what happens differently and randomly between a successful add-unit operation of one deployment to another that could result in the group_replication plugin not being enabled and not being allowed to connect.

However, what I do find very misleading why attempting to re-run all those actions is that add-instance never "fails" despite it actually never working. As can be seen in [1] and [3] in the bug description, it ignores the failures and moves on. If those failures are not ignored, the error is shown to the user, and the same charm status can be achieved by running "juju resolved --no-retry" on the leader that has a failed hook (I tested it).

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I think this may be the same as https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/2015256 "During scale-out of cluster (zaza-openstack-tests) the leader fails to join in the new instance when related to prometheus", or at least along a similar line. There are some fixes going in around handling failures in the database and ensuring that the charm doesn't attempt to write to the db when it is partitioned from the cluster, which *may* fix this situation.

Changed in charm-mysql-innodb-cluster:
status: New → Triaged
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

@Alex, I haven't looked at the patches yet (I will!) but just wanted to point out that no prometheus relation was present on the reproducer for this bug. So hopefully the 2nd part of your explanation is the one that addresses it. I will test your patches soon and try to reproduce the problem with them applied

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

I reproduced this today with latest/edge rev 52. The issue is still there. Steps to reproduce is just to deploy 3 units of mysql-innodb-cluster and when the cluster is idle, deploy a 4th unit. This is a random issue so I deployed several models in parallel and some of them hit the issue, and others didn't. There are no prometheus unit involved.

2023-05-29 13:48:11 INFO unit.mysql/2.juju-log server.go:316 cluster:0: Configuring instance for clustering: 10.5.4.1.
2023-05-29 13:48:11 ERROR unit.mysql/2.juju-log server.go:316 cluster:0: Failed configuring instance 10.5.4.1: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 1, in <module>
mysqlsh.DBError: MySQL Error (1130): Dba.configure_instance: Host '10.5.2.245' is not allowed to connect to this MySQL server

2023-05-29 13:48:12 INFO unit.mysql/2.juju-log server.go:316 cluster:0: Adding instance, 10.5.4.1, to the cluster.
2023-05-29 13:48:12 ERROR unit.mysql/2.juju-log server.go:316 cluster:0: Failed adding instance 10.5.4.1 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
^[[31mERROR: ^[[0mUnable to connect to the target instance '10.5.4.1:3306'. Please verify the connection settings, make sure the instance is available and try again.
Traceback (most recent call last):
  File "<string>", line 3, in <module>
mysqlsh.DBError: MySQL Error (1130): Cluster.add_instance: Could not open connection to '10.5.4.1:3306': Host '10.5.2.245' is not allowed to connect to this MySQL server

Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

I have also tried the actions add-instance and rejoin-instance trying to fix the issue but still get the same error and the 4th instance remains in "Instance not yet configured for clustering" state

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

This is another bug related to adding/removing instances; as such it should probably be grouped with it. Other bug is: https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1983158

Changed in charm-mysql-innodb-cluster:
importance: Undecided → Medium
Revision history for this message
Rodrigo Barbieri (rodrigo-barbieri2010) wrote :

Hi Alex. I'm not sure there is a misunderstanding on your end, but I strongly disagree those 2 issues are related.

This bug is not related to removing and adding back an instance to the cluster, it is related to scaling it up by adding brand NEW units. Also, it is random while the other is more consistently reproduced. This one is also unrelated to flags (at least on the surface, there is nothing setting flags would fix) and has a clear error message in the logs: Host '10.5.2.245' is not allowed to connect to this MySQL server

While the other is more clearly a flag-related problem.

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

They are related as they are both about adding and remove instances.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.