Bug #1926449 “Deployment blocked with “Cluster is inaccessible f...” : Bugs : MySQL InnoDB Cluster Charm

Revision history for this message

David Ames (thedac) wrote on 2021-04-28:

#1

Download full text (14.7 KiB)

Initial log parsing

The cluster comes up fine before vault sends out certificates. Once the cluster receives certificates the cluster goes into a rolling restart as expected.

However, the first node to restart reports problems immediately with "Failed to stop the group replication applier thread" during shutdown

Here is a sorted (interleaved grep from the mysql error logs:
Formatted version: https://pastebin.ubuntu.com/p/xf572df7vD/

mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:44.880786Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.23-0ubuntu0.20.04.1).:
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:45.073488Z 11 [ERROR] [MY-011535] [Repl] Plugin group_replication reported: 'Failed to stop the group replication applier thread.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:20:48.510785Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.54
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:20:48.515514Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:20:48.526308Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.54
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:20:48.527036Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:50.688199Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:50.695954Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:56.752996Z 0 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:58.223924Z 10 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.::
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:21:00.623164Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:21:00.626408Z 2 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.168.33.52:3306.'
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:21:00.630934Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.1...

Initial log parsing

The cluster comes up fine before vault sends out certificates. Once the cluster receives certificates the cluster goes into a rolling restart as expected.

However, the first node to restart reports problems immediately with "Failed to stop the group replication applier thread" during shutdown

Here is a sorted (interleaved grep from the mysql error logs:
Formatted version: https://pastebin.ubuntu.com/p/xf572df7vD/

mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:44.880786Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.23-0ubuntu0.20.04.1).:
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:45.073488Z 11 [ERROR] [MY-011535] [Repl] Plugin group_replication reported: 'Failed to stop the group replication applier thread.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:20:48.510785Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.54
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:20:48.515514Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:20:48.526308Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.54
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:20:48.527036Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:50.688199Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:50.695954Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:56.752996Z 0 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:20:58.223924Z 10 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.::
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:21:00.623164Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:21:00.626408Z 2 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.168.33.52:3306.'
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:21:00.630934Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:21:01.649666Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:21:01.655107Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:18.998153Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.23-0ubuntu0.20.04.1).:
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:19.114028Z 11 [ERROR] [MY-011535] [Repl] Plugin group_replication reported: 'Failed to stop the group replication applier thread.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:22:22.124782Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:22:22.124896Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.54
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:22:22.125859Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.55
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:22:22.125958Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.54
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:24.242854Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:24.246264Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:28.327116Z 0 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:28.968854Z 10 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.::
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:22:30.946956Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:30.948047Z 2 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.168.33.52:3306.'
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:22:30.950223Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:31.950612Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Incremental recovery from a group donor'
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:31.952268Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:22:49.088142Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user <via user signal>. Shutting down mysqld (Version: 8.0.23-0ubuntu0.20.04.1).:
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:22:52.641130Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.52
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:22:52.641166Z 0 [System] [MY-011500] [Repl] Plugin group_replication reported: 'Primary server with address 192.168.33.52:3306 left the group. Electing new Primary.'
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:52.641220Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.52
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:22:52.641234Z 0 [Warning] [MY-011512] [Repl] Plugin group_replication reported: 'Unable to set any member as primary. No suitable candidate.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:52.641267Z 0 [System] [MY-011500] [Repl] Plugin group_replication reported: 'Primary server with address 192.168.33.52:3306 left the group. Electing new Primary.'
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:52.641417Z 0 [Warning] [MY-011512] [Repl] Plugin group_replication reported: 'Unable to set any member as primary. No suitable candidate.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:22:53.642661Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.55:3306, 192.168.33.54
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:22:53.643062Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.55:3306, 192.168.33.54
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:22:54.768072Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:22:54.771359Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.':
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:22:59.568274Z 0 [System] [MY-013587] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' is starting.':
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:23:00.368795Z 19 [System] [MY-010597] [Repl] 'CHANGE MASTER TO FOR CHANNEL 'group_replication_applier' executed'. Previous state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='<NULL>', master_port= 0, master_log_file='', master_log_pos= 4, master_bind=''.::
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:23:02.268857Z 0 [ERROR] [MY-011526] [Repl] Plugin group_replication reported: 'This member has more executed transactions than those present in the group. Local transactions: 5ddbba38-a808-11eb-91b0-00163e82404d
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:23:02.268984Z 0 [ERROR] [MY-011522] [Repl] Plugin group_replication reported: 'The member contains transactions not present in the group. The member will now exit the group.':
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:23:02.269024Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:23:02.269556Z 0 [Warning] [MY-011512] [Repl] Plugin group_replication reported: 'Unable to set any member as primary. No suitable candidate.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:23:02.271344Z 0 [Warning] [MY-011512] [Repl] Plugin group_replication reported: 'Unable to set any member as primary. No suitable candidate.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:23:03.275743Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:23:03.280878Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.52:3306, 192.168.33.55
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:23:04.291181Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.52
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:23:04.291281Z 0 [Warning] [MY-011512] [Repl] Plugin group_replication reported: 'Unable to set any member as primary. No suitable candidate.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:23:04.293165Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.52
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:23:04.293364Z 0 [Warning] [MY-011512] [Repl] Plugin group_replication reported: 'Unable to set any member as primary. No suitable candidate.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:23:05.294435Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.55:3306, 192.168.33.54
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:23:05.294855Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.55:3306, 192.168.33.54
mysql-innodb-cluster_2/var/log/mysql/error.log 2021-04-28T11:23:06.224372Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:33:42.076811Z 21 [ERROR] [MY-011574] [Repl] Plugin group_replication reported: 'Maximum number of retries when trying to connect to a donor reached. Aborting group replication incremental recovery.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:33:42.076917Z 21 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.':
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:33:42.077014Z 21 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:33:43.994039Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.54
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:33:44.995974Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.55:3306 on view 16196040323665586
mysql-innodb-cluster_0/var/log/mysql/error.log 2021-04-28T11:33:45.459239Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:34:04.977160Z 22 [ERROR] [MY-011575] [Repl] Plugin group_replication reported: 'All donors left. Aborting group replication incremental recovery.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:34:04.977321Z 22 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:34:04.978371Z 22 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.':
mysql-innodb-cluster_1/var/log/mysql/error.log 2021-04-28T11:34:08.296074Z 0 [System] [MY-011504] [Repl] Plugin group_replication reported: 'Group membership changed: This member has left the group.'

Things I noticed:
The leader node went last (but that *should* not matter

This could be an asynchronous process in mysql that is racing the charm.

I'll keep digging.

Revision history for this message

David Ames (thedac) wrote on 2021-04-28:

#2

Confirmed the leader restarting last is a non-issue by manually testing.

The root cause appears to be the failure during shutdown: "Failed to stop the group replication applier thread."

Unfortunately, the mysql documentation is sparse beyond acknowledging this is an error [0] that exists.

[0] https://dev.mysql.com/doc/mysql-errors/8.0/en/server-error-reference.html#error_er_grp_rpl_applier_thd_stop_error

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-05-31:

#3

Setting to incomplete as it's been over 2 years since the bug was reported. It's possible that the bug has been fixed in more recent versions of the charm. This bug will time out in about 2 months.

Changed in charm-mysql-innodb-cluster:
status:	New → Incomplete

Revision history for this message

Moises Emilio Benzan Mora (moisesbenzan) wrote on 2023-06-26:

#4

Seen once more this week in the following two runs:
- https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a
- https://solutions.qa.canonical.com/testruns/5f64870e-6428-4701-b97c-5f13044b7bcb

Artifacts for those can be found at the bottom of the page on each of those runs.

Pd: Over the last year we have seen this bug numerous times, see https://solutions.qa.canonical.com/bugs/1926449, for all occurrences and artifacts.

Changed in charm-mysql-innodb-cluster:
status:	Incomplete → New

Alex Kavanagh (ajkavanagh) on 2023-06-26

Changed in charm-mysql-innodb-cluster:
status:	New → Incomplete

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-06-26:

#5

Please provide the juju log files, and mysql/error.log file, from an instance of the the mysql-innodb-cluster where this has happened, and not just a link to the sqa test run. Although the links are useful, it's much more useful to have the logs actually in the bug for future reference. Thanks.

Revision history for this message

Jeffrey Chang (modern911) wrote on 2023-07-19:

#6

Download full text (4.3 KiB)

This is from latest occurrence, https://oil-jenkins.canonical.com/artifacts/785b0abe-24e7-4075-a046-70bc44ecfa39/generated/generated/openstack/juju-crashdump-openstack-2023-07-18-16.25.40.tar.gz

From 9ffd5995-11e6-443d-b2c5-c6e1235e7de9/0/lxd/8/var/log/juju/unit-mysql-innodb-cluster-0.log

We saw many occurrences of this generic error.
2023-07-18 15:35:22 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
File "<string>", line 2, in <module>
mysqlsh.Error: Shell Error (51314): Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)

This is kind new to me.
2023-07-18 15:44:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action reboot-cluster-from-complete-outage
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:380 running action "reboot-cluster-from-complete-outage" on 1
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server {unix @/var/lib/juju/agents/unit-mysql-innodb-cluster-0/agent.socket <nil>}
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Loading state
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Leader handling coordinator requests
2023-07-18 15:44:14 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Failed rebooting from complete outage: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Restoring the default cluster from complete outage...

Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The active session instance (10.246.170.29:3306) isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. Please use the most up to date instance: '10.246.168.165:3306'.

From 9ffd5995-11e6-443d-b2c5-c6e1235e7de9/0/lxd/8/var/log/mysql/error.log
2023-07-18T13:24:05.812325Z 14 [ERROR] [MY-011685] [Repl] Plugin group_replication reported: 'The group_replication_group_name option is mandatory'
2023-07-18T13:24:05.812350Z 14 [ERROR] [MY-011660] [Repl] Plugin group_replication reported: 'Unable to start Group Replication on boot'

2023-07-18T15:21:42.644552Z 539422 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 539422'
2023-07-18T15:21:42.644562Z 274917 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 274917'
2023-07-18T15:21:42.644569Z 539422 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2023-07-18T15:21:42.644582Z 274917 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2023-07-18T15:21:42.647084Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.'

202...

This is from latest occurrence, https://oil-jenkins.canonical.com/artifacts/785b0abe-24e7-4075-a046-70bc44ecfa39/generated/generated/openstack/juju-crashdump-openstack-2023-07-18-16.25.40.tar.gz

From 9ffd5995-11e6-443d-b2c5-c6e1235e7de9/0/lxd/8/var/log/juju/unit-mysql-innodb-cluster-0.log

We saw many occurrences of this generic error.
2023-07-18 15:35:22 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
mysqlsh.Error: Shell Error (51314): Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)

This is kind new to me.
2023-07-18 15:44:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action reboot-cluster-from-complete-outage
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:380 running action "reboot-cluster-from-complete-outage" on 1
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server  {unix @/var/lib/juju/agents/unit-mysql-innodb-cluster-0/agent.socket <nil>}
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Loading state
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Leader handling coordinator requests
2023-07-18 15:44:14 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Failed rebooting from complete outage: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Restoring the default cluster from complete outage...

Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The active session instance (10.246.170.29:3306) isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. Please use the most up to date instance: '10.246.168.165:3306'.

From 9ffd5995-11e6-443d-b2c5-c6e1235e7de9/0/lxd/8/var/log/mysql/error.log
2023-07-18T13:24:05.812325Z 14 [ERROR] [MY-011685] [Repl] Plugin group_replication reported: 'The group_replication_group_name option is mandatory'
2023-07-18T13:24:05.812350Z 14 [ERROR] [MY-011660] [Repl] Plugin group_replication reported: 'Unable to start Group Replication on boot'

2023-07-18T15:21:42.644552Z 539422 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 539422'
2023-07-18T15:21:42.644562Z 274917 [ERROR] [MY-011615] [Repl] Plugin group_replication reported: 'Error while waiting for conflict detection procedure to finish on session 274917'
2023-07-18T15:21:42.644569Z 539422 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2023-07-18T15:21:42.644582Z 274917 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2023-07-18T15:21:42.647084Z 0 [System] [MY-011651] [Repl] Plugin group_replication reported: 'Plugin 'group_replication' has been stopped.'

2023-07-18T15:35:42.614671Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.165:33061 when joining a group. My local port is: 33061.'
2023-07-18T15:35:42.614830Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.171:33061 when joining a group. My local port is: 33061.'
2023-07-18T15:35:42.616817Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error on opening a connection to peer node 10.246.168.165:33061 when joining a group. My local port is: 33061.'
2023-07-18T15:35:42.616842Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Error connecting to all peers. Member join failed. Local port: 33061'
2023-07-18T15:35:42.682630Z 0 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 33061'
2023-07-18T15:35:55.933788Z 4 [ERROR] [MY-011640] [Repl] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2023-07-18T15:35:55.933853Z 4 [ERROR] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'

Changed in charm-mysql-innodb-cluster:
status:	Incomplete → New

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-07-21:

#7

I spent a while looking at the logs, and as far as I can tell, this is due to a complete outage of the mysql cluster (which is part of the test). The action to restore the cluster is to run "reboot-cluster-from-complete-outage" on the unit that has the GTID superset of transactions.

According to the guide [1] this action is run on any node, and then either the cluster reboots OR it tells you which node to pick. From the logs (from the crashdump):

ag reboot-cluster-from-complete-outage
0/lxd/8/juju-show-status-log/mysql-innodb-cluster_0
19:18 Jul 2023 15:44:12Z juju-unit executing running action reboot-cluster-from-complete-outage

0/lxd/8/var/log/juju/unit-mysql-innodb-cluster-0.log
76723:2023-07-18 15:44:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action reboot-cluster-from-complete-outage
76724:2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:380 running action "reboot-cluster-from-complete-outage" on 1
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server {unix @/var/lib/juju/agents/unit-mysql-innodb-cluster-0/agent.socket <nil>}
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Loading state
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Leader handling coordinator requests
2023-07-18 15:44:14 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Failed rebooting from complete outage: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Restoring the default cluster from complete outage...

Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The active session instance (10.246.170.29:3306) isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. Please use the most up to d
ate instance: '10.246.168.165:3306'.

e.g. the node the test ran the action on wasn't the latest and thus replied with the correct unit IP address. However, the logs then don't show any other action on the other units.

I suspect that the test needs to be updated to detect this condition and then select the correct unit.

I'm setting this to invalid; however, if further information indicates that the above assessment isn't correct then please re-open the bug.

Note: it could be argued that the charms should then chat amongst themselves and run the action on the corresponding unit that has the GTID; that would be a feature request as currently the charms don't provide that feature. If that feature is desired then please open a NEW bug with the feature request.

[1] https://docs.openstack.org/charm-guide/latest/admin/managing-power-events.html#mysql-innodb-cluster

I spent a while looking at the logs, and as far as I can tell, this is due to a complete outage of the mysql cluster (which is part of the test).  The action to restore the cluster is to run "reboot-cluster-from-complete-outage" on the unit that has the GTID superset of transactions.

According to the guide [1] this action is run on any node, and then either the cluster reboots OR it tells you which node to pick.  From the logs (from the crashdump):

ag reboot-cluster-from-complete-outage
0/lxd/8/juju-show-status-log/mysql-innodb-cluster_0
19:18 Jul 2023 15:44:12Z  juju-unit  executing    running action reboot-cluster-from-complete-outage

0/lxd/8/var/log/juju/unit-mysql-innodb-cluster-0.log
76723:2023-07-18 15:44:12 DEBUG juju.worker.uniter agent.go:20 [AGENT-STATUS] executing: running action reboot-cluster-from-complete-outage
76724:2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:380 running action "reboot-cluster-from-complete-outage" on 1
2023-07-18 15:44:12 DEBUG juju.worker.uniter.runner runner.go:728 starting jujuc server  {unix @/var/lib/juju/agents/unit-mysql-innodb-cluster-0/agent.socket <nil>}
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Loading state
2023-07-18 15:44:13 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 coordinator.DelayedActionCoordinator Leader handling coordinator requests
2023-07-18 15:44:14 ERROR unit.mysql-innodb-cluster/0.juju-log server.go:316 Failed rebooting from complete outage: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Restoring the default cluster from complete outage...

Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.reboot_cluster_from_complete_outage: The active session instance (10.246.170.29:3306) isn't the most updated in comparison with the ONLINE instances of the Cluster's metadata. Please use the most up to d
ate instance: '10.246.168.165:3306'.

e.g. the node the test ran the action on wasn't the latest and thus replied with the correct unit IP address.  However, the logs then don't show any other action on the other units.

I suspect that the test needs to be updated to detect this condition and then select the correct unit.

I'm setting this to invalid; however, if further information indicates that the above assessment isn't correct then please re-open the bug.

Note: it could be argued that the charms should then chat amongst themselves and run the action on the corresponding unit that has the GTID; that would be a feature request as currently the charms don't provide that feature.  If that feature is desired then please open a NEW bug with the feature request.

[1] https://docs.openstack.org/charm-guide/latest/admin/managing-power-events.html#mysql-innodb-cluster

Changed in charm-mysql-innodb-cluster:
status:	New → Invalid

Revision history for this message

Bas de Bruijne (basdbruijne) wrote on 2023-07-24:

#8

Thanks for the investigation. I'm afraid that the test run Jeffrey pointed out is a bit of a red herring, it had some manual intervention from me while it was running which has polluted the crashdumps.

I took a look at https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a, which has the following state:
=====
mysql-innodb-cluster/0 active idle 3/lxd/1 10.246.166.115 Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
  logrotated/10 active idle 10.246.166.115 Unit is ready.
mysql-innodb-cluster/1* active idle 4/lxd/1 10.246.167.161 Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
  logrotated/11 active idle 10.246.167.161 Unit is ready.
mysql-innodb-cluster/2 blocked idle 5/lxd/2 10.246.166.105 Cluster is inaccessible from this instance. Please check logs for details.
  logrotated/12 active idle 10.246.166.105 Unit is ready.
=====

The crashdumps can be downloaded here: https://oil-jenkins.canonical.com/artifacts/b49db99c-959f-417e-beca-cf4a2521709a/generated/generated/kubernetes-maas/juju-crashdump-kubernetes-maas-2023-06-24-19.11.12.tar.gz

Units 0 and 1 are clustered, but 2 did not join for some reason. In the crashdump, the juju relation between units 2 and 1 was clearly enabled.

The following message comes up a lot in the logs:
=====
2023-06-24 15:33:04 ERROR unit.mysql-innodb-cluster/2.juju-log server.go:325 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: This function is not available through a session to an instance belonging to an unmanaged replication group
=====

I don't see anything suspicious otherwise. There are some messages about a leader change, is it possible that this issue is due to an unfortunate timing of a leader change?

Thanks for the investigation. I'm afraid that the test run Jeffrey pointed out is a bit of a red herring, it had some manual intervention from me while it was running which has polluted the crashdumps.

I took a look at https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a, which has the following state:
=====
mysql-innodb-cluster/0                active    idle   3/lxd/1  10.246.166.115            Unit is ready: Mode: R/O, Cluster is ONLINE and can tolerate up to ONE failure.
  logrotated/10                       active    idle            10.246.166.115            Unit is ready.
mysql-innodb-cluster/1*               active    idle   4/lxd/1  10.246.167.161            Unit is ready: Mode: R/W, Cluster is ONLINE and can tolerate up to ONE failure.
  logrotated/11                       active    idle            10.246.167.161            Unit is ready.
mysql-innodb-cluster/2                blocked   idle   5/lxd/2  10.246.166.105            Cluster is inaccessible from this instance. Please check logs for details.
  logrotated/12                       active    idle            10.246.166.105            Unit is ready.
=====

The crashdumps can be downloaded here: https://oil-jenkins.canonical.com/artifacts/b49db99c-959f-417e-beca-cf4a2521709a/generated/generated/kubernetes-maas/juju-crashdump-kubernetes-maas-2023-06-24-19.11.12.tar.gz

Units 0 and 1 are clustered, but 2 did not join for some reason. In the crashdump, the juju relation between units 2 and 1 was clearly enabled.

The following message comes up a lot in the logs:
=====
2023-06-24 15:33:04 ERROR unit.mysql-innodb-cluster/2.juju-log server.go:325 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: This function is not available through a session to an instance belonging to an unmanaged replication group
=====

I don't see anything suspicious otherwise. There are some messages about a leader change, is it possible that this issue is due to an unfortunate timing of a leader change?

Changed in charm-mysql-innodb-cluster:
status:	Invalid → New

Revision history for this message

Jeffrey Chang (modern911) wrote on 2023-07-24:

#9

Please ignore comment #6, there's some human intervention to push through the test.

Checked other runs below and feel it's better to track in LP#1998356.
https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a
https://solutions.qa.canonical.com/testruns/5f64870e-6428-4701-b97c-5f13044b7bcb

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-07-25:

#10

> The following message comes up a lot in the logs:
=====
2023-06-24 15:33:04 ERROR unit.mysql-innodb-cluster/2.juju-log server.go:325 Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: This function is not available through a session to an instance belonging to an unmanaged replication group
=====

Yes, that's an unhandled issue on the 8.0/stable branch; there's a fix in progress on the master branch that will get backported.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-07-25:

#11

> Checked other runs below and feel it's better to track in LP#1998356.
> https://solutions.qa.canonical.com/testruns/b49db99c-959f-417e-beca-cf4a2521709a
> https://solutions.qa.canonical.com/testruns/5f64870e-6428-4701-b97c-5f13044b7bcb

There don't appear to be any mysql logs/unit logs in the dumps; or I'm looking in the wrong place. Please could you link at least to the mysql/error.log and mysql unit logs for the failed runs.

Revision history for this message

Dagmawi Biru (dagbiru) wrote on 2023-07-26:

#12

- MySQL 8.0.33, yoga/stable

I'm also seeing this issue in my environment, being triggered soon after the initial setup of vault (unsealing + root ca generation):
---
Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: Unable to find a primary member in the cluster
---

and

---
unit-mysql-innodb-cluster-2: 20:31:05 ERROR unit.mysql-innodb-cluster/2.juju-log Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
File "<string>", line 2, in <module>
RuntimeError: Dba.get_cluster: Group replication does not seem to be active in instance '172.16.66.165:3306'
---

MySQL InnoDB Cluster Charm

Deployment blocked with "Cluster is inaccessible from this instance"

Bug Description

Other bug subscribers

Remote bug watches