During scale-out of cluster (zaza-openstack-tests) the leader fails to join in the new instance when related to prometheus

Bug #2015256 reported by Alex Kavanagh
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Fix Committed
High
Alex Kavanagh

Bug Description

This appears to be a random bug, and may be related to the other scale out bugs. I think it may be due to the prometheus code, but I'm not entirely sure yet.

Juju status:

mysql-innodb-cluster/1* active idle 2 172.16.0.77 Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
mysql-innodb-cluster/3 blocked idle 6 172.16.0.213 Cluster is inaccessible from this instance. Please check logs for details.
mysql-innodb-cluster/4 active idle 7 172.16.0.113 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.

(ignore unit /4 - it is added during test_802 AFTER test_801 has failed).

--

When adding a unit to a cluster that was reduced to 2 units, test_801_add_unit fails {1}. This is due to the following error message in mysql-innodb-cluster/1's mysql error.log:

2023-04-03 16:06:03 ERROR unit.mysql-innodb-cluster/1.juju-log server.go:316 cluster:1: Failed adding instance 172.16.0.213 to cluster: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
WARNING: A GTID set check of the MySQL instance at '172.16.0.213:3306' determined that it contains transactions that do not originate from the cluster, which must be discarded before it can join the cluster.

172.16.0.213:3306 has the following errant GTIDs that do not exist in the cluster:
04f79476-d239-11ed-81bb-fa163e64798f:1-32

WARNING: Discarding these extra GTID events can either be done manually or by completely overwriting the state of 172.16.0.213:3306 with a physical snapshot from an existing cluster member. To use this method by default, set the 'recoveryMethod' option to 'clone'.

Having extra GTID events is not expected, and it is recommended to investigate this further and ensure that the data can be removed prior to choosing the clone recovery method.
Clone based recovery selected through the recoveryMethod option

--

i.e. it looks like some local activity has taken place on the new unit (172.16.0.213). I *think* it may be the prometheus hook, but I'm not entirely sure: (from mysql-innodb-cluster/3 unit log):

tracer: ++ queue handler reactive/prometheus_mysql_exporter_handlers.py:12:create_local_prometheus_exporter_user
tracer: ++ queue handler reactive/prometheus_mysql_exporter_handlers.py:201:prometheus_connected
tracer: ++ queue handler reactive/prometheus_mysql_exporter_handlers.py:68:snap_install_prometheus_exporter
2023-04-03 16:04:48 INFO unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Invoking reactive handler: hooks/relations/mysql-router/provides.py:47:joined:db-router
2023-04-03 16:04:49 INFO unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Invoking reactive handler: hooks/relations/mysql-router/provides.py:53:changed:db-router
2023-04-03 16:04:49 INFO unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Invoking reactive handler: reactive/prometheus_mysql_exporter_handlers.py:12:create_local_prometheus_exporter_user
2023-04-03 16:04:49 WARNING unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: This instance is not yet clustered: cannot determine the cluster status.
2023-04-03 16:04:49 WARNING unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Cannot determine the cluster primary RW node for writes.
2023-04-03 16:04:49 DEBUG unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Opening db connection for root@None
2023-04-03 16:04:49 DEBUG unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Revoke all privileges for use 'prom_exporter'@'172.16.0.213'
2023-04-03 16:04:49 DEBUG unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Grant PROCESS, REPLICATION CLIENT for use 'prom_exporter'@'172.16.0.213'
2023-04-03 16:04:49 DEBUG unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Revoke all privileges for use 'prom_exporter'@'localhost'
2023-04-03 16:04:49 DEBUG unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Grant PROCESS, REPLICATION CLIENT for use 'prom_exporter'@'localhost'
2023-04-03 16:04:49 DEBUG unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: tracer>
tracer: set flag local.prom-exporter.user-created
tracer: -- dequeue handler reactive/prometheus_mysql_exporter_handlers.py:12:create_local_prometheus_exporter_user
2023-04-03 16:04:49 INFO unit.mysql-innodb-cluster/3.juju-log server.go:316 prometheus:10: Create prometheus mysql exporter user in the mysql db

--

I think that if this is done with group replication 'on' then these are written to the bin log and then it won't cluster at that point as it is effectively split-brained.

From the error.log for /3

2023-04-03T16:05:55.654401Z 22 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.'
2023-04-03T16:05:55.660494Z 22 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv4 localhost address to the allowlist. It is mandatory that it is added.'
2023-04-03T16:05:55.660528Z 22 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Automatically adding IPv6 localhost address to the allowlist. It is mandatory that it is added.'
2023-04-03T16:05:55.673395Z 27 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.

--

I think the teltale is "master_log_pos= 4", which implies that something has been written to the master log, which will then have some entries and thus won't be clustered in (at least I think that's what is happening).

So the culprit may be "create_local_prometheus_exporter_user" which seems to try to clear a user and grants from the database?

--

{1} job output from zaza:

2023-04-03 15:57:40.721762 | focal-medium | 2023-04-03 15:57:40 [INFO] test_801_add_unit (zaza.openstack.charm_tests.mysql.tests.MySQLInnoDBClusterScaleTest)
2023-04-03 15:57:40.721791 | focal-medium | 2023-04-03 15:57:40 [INFO] Add mysql-innodb-cluster node.
2023-04-03 15:57:40.721805 | focal-medium | 2023-04-03 15:57:40 [INFO] ...
2023-04-03 15:57:40.721816 | focal-medium | 2023-04-03 15:57:40 [INFO] Wait till model is idle ...
2023-04-03 15:57:41.225888 | focal-medium | 2023-04-03 15:57:41 [INFO] Adding unit after removed unit ...
2023-04-03 15:57:41.637882 | focal-medium | 2023-04-03 15:57:41 [INFO] Wait until 3 units ...
2023-04-03 15:57:41.761492 | focal-medium | 2023-04-03 15:57:41 [INFO] Wait for application states ...
2023-04-03 15:57:41.762321 | focal-medium | 2023-04-03 15:57:41 [INFO] Waiting for application states to reach targeted states.
2023-04-03 15:57:41.763875 | focal-medium | 2023-04-03 15:57:41 [INFO] Waiting for an application to be present
2023-04-03 15:57:41.764417 | focal-medium | 2023-04-03 15:57:41 [INFO] Now checking workload status and status messages
2023-04-03 15:57:42.270052 | focal-medium | 2023-04-03 15:57:42 [INFO] Application prometheus2 is ready.
2023-04-03 15:57:42.276635 | focal-medium | 2023-04-03 15:57:42 [INFO] Application keystone is ready.
2023-04-03 15:57:42.283250 | focal-medium | 2023-04-03 15:57:42 [INFO] Application keystone-mysql-router is ready.
2023-04-03 15:57:42.294215 | focal-medium | 2023-04-03 15:57:42 [INFO] Application vault is ready.
2023-04-03 15:57:42.297183 | focal-medium | 2023-04-03 15:57:42 [INFO] Application vault-mysql-router is ready.
2023-04-03 16:07:41.863754 | focal-medium | 2023-04-03 16:07:41 [INFO] Applications left: mysql-innodb-cluster
2023-04-03 16:17:42.029986 | focal-medium | 2023-04-03 16:17:42 [INFO] Applications left: mysql-innodb-cluster
2023-04-03 16:27:42.037384 | focal-medium | 2023-04-03 16:27:42 [INFO] Applications left: mysql-innodb-cluster
2023-04-03 16:37:42.335575 | focal-medium | 2023-04-03 16:37:42 [INFO] Applications left: mysql-innodb-cluster
2023-04-03 16:42:42.777747 | focal-medium | 2023-04-03 16:42:42 [INFO] TIMEOUT: Workloads didn't reach acceptable status:
2023-04-03 16:42:42.778121 | focal-medium | 2023-04-03 16:42:42 [INFO] Timed out waiting for 'mysql-innodb-cluster/3'. The workload status is 'waiting' which is not one of '['active']'
2023-04-03 16:42:42.778160 | focal-medium | 2023-04-03 16:42:42 [INFO] Timed out waiting for 'mysql-innodb-cluster/3'. The workload status message is 'Instance not yet in the cluster' which is not one of '['ready', 'Ready', 'Unit is ready']'
2023-04-03 16:42:42.779682 | focal-medium | 2023-04-03 16:42:42 [INFO] ERROR

summary: During scale-out of cluster (zaza-openstack-tests) the leader fails to
- join in the new instance
+ join in the new instance when related to prometheus
Changed in charm-mysql-innodb-cluster:
assignee: nobody → Alex Kavanagh (ajkavanagh)
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-mysql-innodb-cluster (master)
Changed in charm-mysql-innodb-cluster:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-mysql-innodb-cluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/879541
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/2b944827d5d2ac5cc529a689693aa62e1e11d57f
Submitter: "Zuul (22348)"
Branch: master

commit 2b944827d5d2ac5cc529a689693aa62e1e11d57f
Author: Alex Kavanagh <email address hidden>
Date: Wed Apr 5 11:50:47 2023 +0100

    Fix create_user() when unit is out of cluster

    When a unit is configured for the cluster but not yet joined, the
    create_user() function would write to the local database. This would
    then break the cluster; instead check for this condition and delay the
    user creation until the the unit is clustered.

    Change-Id: I9badf28a22edede19302ff5115b1dee16f2b17f7
    Closes-Bug: #2015256

Changed in charm-mysql-innodb-cluster:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.