Hi, I've been going through the logs that were handed off internally[0] where I found some unexpected failures that I believe could be affecting the behaviour of Magnum, I will list them and explain what the could mean separately. 1. barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen 2. pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query') 3. oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED [...] 4. ConnectionResetError: [Errno 104] Connection reset by peer 5. keystoneauth1.exceptions.http.Unauthorized: The request you have made requires authentication. About (1), this error is present in the logs ~14k times[1], the first occurrence on January 27th and the last one in the log is on June 19th. The absence of a healthy Barbican service prevents Magnum from stablishing a connection to k8s since that's the place where the secrets (e.g. private keys) are stored and read from. Traceback (most recent call last): File "/usr/lib/python3/dist-packages/magnum/service/periodic.py", line 106, in _update_health_status monitor.poll_health_status() File "/usr/lib/python3/dist-packages/magnum/drivers/common/k8s_monitor.py", line 55, in poll_health_status k8s_api = k8s.create_k8s_api(self.context, self.cluster) File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 145, in create_k8s_api return K8sAPI(context, cluster) File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 114, in __init__ self.cert_file) = create_client_files(cluster, context) File "/usr/lib/python3/dist-packages/magnum/conductor/handlers/common/cert_manager.py", line 159, in create_client_files magnum_cert.get_decrypted_private_key())) File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/cert_manager.py", line 46, in get_decrypted_private_key return operations.decrypt_key(self.get_private_key(), File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/barbican_cert_manager.py", line 52, in get_private_key return self._cert_container.private_key.payload File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 193, in payload self._fetch_payload() File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 271, in _fetch_payload payload = self._api._get_raw(payload_url, headers=headers) File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 83, in _get_raw return self.request(path, 'GET', *args, **kwargs).content File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 63, in request self._check_status_code(resp) File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 97, in _check_status_code raise exceptions.HTTPServerError( barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen - please contact site administrator. About (2), this is a better understood error in general, it's basically that the database dropped the client (in this case magnum-conductor process), this can be due to numerous reasons, more data is needed to understand this, and it's out of the scope of this bug, you most likely are seeing this issue accross all the control plane services, if you detect this issue exclusive to Magnum service please provide us the data you are using to get to that conclusion, so we can look into why Magmnum may be misbehaving. This error is less common in the logs[2], although almost all of the occurrences were in June[3] Traceback (most recent call last): File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1245, in _execute_context self.dialect.do_execute( File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 581, in do_execute cursor.execute(statement, parameters) File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute result = self._query(query) File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query conn.query(q) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query self._affected_rows = self._read_query_result(unbuffered=unbuffered) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result result.read() File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read first_packet = self.connection._read_packet() File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet packet_header = self._read_bytes(4) File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 706, in _read_bytes raise err.OperationalError( pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query') About (3), during June the rabbitmq service was having issues and services couldn't connect to it[4][5], RabbitMQ is critical component for control plane services (as important as MySQL). About (4), the ConnectionResetError[6][7] may be the symptom of a more general error in the underlying network or large performance degradation where services are not able to respond to queries on time, hence the client closed the connection (raising a Timeout Error on the client side) and when the server tries to write the response in the socket there is no one on the side an fails. About (5), I don't have enough logs share insights, because we do know that Keystone and Magnum are correctly configured, since you can launch k8s clusters, although the main loop fails consistently, I will be looking into Magnum's source code to understand the consequence of this scheduled task not able to complete its job. [0] 8508554a99e603a898dbe9b70cbc0c2d magnum-conductor.log [1] $ grep barbicanclient.exceptions.HTTPServerError magnum-conductor-error.log | wc -l 14147 [2] $ grep pymysql.err.OperationalError magnum-conductor-error.log | wc -l 373 [3]$ grep pymysql.err.OperationalError magnum-conductor-error.log | grep 2023-06 | wc -l 363 [4] $ grep "Connection failed" magnum-conductor-error.log | grep rabbit | wc -l 3099 [5] $ grep "Connection failed" magnum-conductor-error.log | grep rabbit | grep 2023-06 | wc -l 3068 [6]$ grep "ConnectionResetError" magnum-conductor-error.log | wc -l 305 [7] $ grep "ConnectionResetError" magnum-conductor-error.log | grep 2023-06 | wc -l 301 [8] $ grep "keystoneauth1.exceptions.http.Unauthorized" magnum-conductor-error.log | wc -l 90482