From keystone/1: (keystone.server.flask.request_processing.middleware.auth_context): 2023-04-01 08:36:25,698 ERROR (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') (Background on this error at: https://sqlalche.me/e/14/e3q8) Traceback (most recent call last): File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1802, in _execute_context self.dialect.do_execute( File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 732, in do_execute cursor.execute(statement, parameters) File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 148, in execute result = self._query(query) File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 310, in _query conn.query(q) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 548, in query self._affected_rows = self._read_query_result(unbuffered=unbuffered) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 775, in _read_query_result result.read() File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1163, in read self._read_result_packet(first_packet) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1236, in _read_result_packet self._read_rowdata_packet() File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1270, in _read_rowdata_packet packet = self.connection._read_packet() File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 725, in _read_packet packet.raise_for_error() File "/usr/lib/python3/dist-packages/pymysql/protocol.py", line 221, in raise_for_error err.raise_mysql_exception(self._data) File "/usr/lib/python3/dist-packages/pymysql/err.py", line 143, in raise_mysql_exception raise errorclass(errno, errval) pymysql.err.OperationalError: (1053, 'Server shutdown in progress') So "Server shutdown in progress" is a bit ominous. From mysql-innodb-cluster/0 at the same time: 2023-04-01T08:36:23.178312Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.61:3306' 2023-04-01T08:36:23.178353Z 0 [System] [MY-011500] [Repl] Plugin group_replication reported: 'Primary server with address 192.168.33.61:3306 left the group. Electing new Primary.' 2023-04-01T08:36:23.178515Z 0 [System] [MY-011507] [Repl] Plugin group_replication reported: 'A new primary with address 192.168.33.83:3306 was elected. The new primary will execute all previous group transactions before allowing writes.' 2023-04-01T08:36:23.178770Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.83:3306, 192.168.33.84:3306 on view 16803355385089891:10.' 2023-04-01T08:36:23.179472Z 68 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.' 2023-04-01T08:36:23.179579Z 68 [System] [MY-011511] [Repl] Plugin group_replication reported: 'This server is working as secondary member with primary member address 192.168.33.83:3306.' 2023-04-01T08:37:31.674902Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 192.168.33.61:3306, 192.168.33.83:3306, 192.168.33.84:3306 on view 16803355385089891:11.' 2023-04-01T08:37:36.190735Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 192.168.33.61:3306 was declared online within the replication group.' 2023-04-01T08:38:15.493860Z 0 [Warning] [MY-011499] [Repl] Plugin group_replication reported: 'Members removed from the group: 192.168.33.83:3306' 2023-04-01T08:38:15.493903Z 0 [System] [MY-011500] [Repl] Plugin group_replication reported: 'Primary server with address 192.168.33.83:3306 left the group. Electing new Primary.' i.e. 192.168.33.84 (which is mysql-innodb-cluster/0's local address) left the cluster and then rejoined during the the attempted keystone-manager.py connection. With the cluster at 2 units, it would go RO; the out of cluster unit (this one) would refuse connections if that's what the mysql-router was connected to. Why did it leave the cluster and then rejoin? It appeared to be handling a certificates-relation-changed hook from vault/0 at the same time: 2023-04-01 08:36:07 INFO unit.mysql-innodb-cluster/0.juju-log server.go:316 certificates:454: Reactive main running for hook certificates-relation-changed and from the machine.lock: 2023-04-01 08:36:15 unit-mysql-innodb-cluster-0: mysql-innodb-cluster/0 uniter (run relation-changed (454; unit: vault/0) hook), waited 0s, held 9s -- So the cluster dropped out due to vault providing certificates and the mysql-innodb-cluster/0 restarting. It then took 2 minutes 30 seconds (ish) to reform the cluster, but which time the keystone unit had run out of retries and given up connecting to the cluster. Additionally, mysql-innodb-cluster/1 also dropped out at 8:36 2023-04-01T08:36:22.413986Z 0 [System] [MY-013172] [Server] Received SHUTDOWN from user . Shutting down mysqld (Version: 8.0.32-0ubuntu0.20.04.2). And mysql-innodb-cluster/2: 2023-04-01T08:37:20.236588Z 0 [System] [MY-010910] [Server] /usr/sbin/mysqld: Shutdown complete (mysqld 8.0.32-0ubuntu0.20.04.2) (Ubuntu). ... 2023-04-01T08:39:28.616528Z 0 [System] [MY-011492] [Repl] Plugin group_replication reported: 'The member with address 192.168.33.83:3306 was declared online within the replication group.' So the cluster was still up/down until 08:39:28. -- Summary & thoughts ------------------ 3 minutes is actually a long time for the mysql server to update certificates and re-join as a cluster. Basically, the cluster was offline for 3 minutes and no other hooks in the system that would cause a mysql database write would work, and would have to be deferred. It may be that the system was being deployed in a resource constrained environment, and thus the timeouts on database access were exceeded. keystone (as a classic charm) either has to wait longer in the hook, or hook retries should be used that will retry the erroring hook until it succeeds - both are suboptimal. Alternatively, the hook could be deferred; in a reactive or ops-framework charm this is reasonably easy to achieve; unfortunately, classic charms do not have the capability to do this, and there are no (current) plans to do so. The mysql-innodb-cluster charm could defer all router-db relations until it has achieved as TLS cluster if vault is related to it; this would prevent any other related charm from attempting to access the database until the it has formed a cluster AND upgraded to TLS if vault is related. This might be achieved using goal-state to decide if certificates are going to arrive. As a mitigation strategy, it may be preferable to deploy vault + mysql-innodb-cluster *first*, wait for them to settle into a cluster with TLS enabled, and then do the rest of the deployment. This would at least work around this race hazard with the mysql-server bootstrapping itself into TLS mode.