Comment 21 for bug 2015870

Revision history for this message
Felipe Reyes (freyes) wrote :

Hi,

I've been going through the logs that were handed off internally[0] where I
found some unexpected failures that I believe could be affecting the behaviour
of Magnum, I will list them and explain what the could mean separately.

1. barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen
2. pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
3. oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED [...]
4. ConnectionResetError: [Errno 104] Connection reset by peer
5. keystoneauth1.exceptions.http.Unauthorized: The request you have made requires authentication.

About (1), this error is present in the logs ~14k times[1], the first occurrence
on January 27th and the last one in the log is on June 19th. The absence of a
healthy Barbican service prevents Magnum from stablishing a connection to k8s
since that's the place where the secrets (e.g. private keys) are stored and read from.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/magnum/service/periodic.py", line 106, in _update_health_status
    monitor.poll_health_status()
  File "/usr/lib/python3/dist-packages/magnum/drivers/common/k8s_monitor.py", line 55, in poll_health_status
    k8s_api = k8s.create_k8s_api(self.context, self.cluster)
  File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 145, in create_k8s_api
    return K8sAPI(context, cluster)
  File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 114, in __init__
    self.cert_file) = create_client_files(cluster, context)
  File "/usr/lib/python3/dist-packages/magnum/conductor/handlers/common/cert_manager.py", line 159, in create_client_files
    magnum_cert.get_decrypted_private_key()))
  File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/cert_manager.py", line 46, in get_decrypted_private_key
    return operations.decrypt_key(self.get_private_key(),
  File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/barbican_cert_manager.py", line 52, in get_private_key
    return self._cert_container.private_key.payload
  File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 193, in payload
    self._fetch_payload()
  File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 271, in _fetch_payload
    payload = self._api._get_raw(payload_url, headers=headers)
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 83, in _get_raw
    return self.request(path, 'GET', *args, **kwargs).content
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 63, in request
    self._check_status_code(resp)
  File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 97, in _check_status_code
    raise exceptions.HTTPServerError(
barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen - please contact site administrator.

About (2), this is a better understood error in general, it's basically that the
database dropped the client (in this case magnum-conductor process), this can be
due to numerous reasons, more data is needed to understand this, and it's out of
the scope of this bug, you most likely are seeing this issue accross all the
control plane services, if you detect this issue exclusive to Magnum service
please provide us the data you are using to get to that conclusion, so we can
look into why Magmnum may be misbehaving. This error is less common in the
logs[2], although almost all of the occurrences were in June[3]

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1245, in _execute_context
    self.dialect.do_execute(
  File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 581, in do_execute
    cursor.execute(statement, parameters)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute
    result = self._query(query)
  File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query
    conn.query(q)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result
    result.read()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read
    first_packet = self.connection._read_packet()
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet
    packet_header = self._read_bytes(4)
  File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 706, in _read_bytes
    raise err.OperationalError(
pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')

About (3), during June the rabbitmq service was having issues and services
couldn't connect to it[4][5], RabbitMQ is critical component for control plane
services (as important as MySQL).

About (4), the ConnectionResetError[6][7] may be the symptom of a more general error
in the underlying network or large performance degradation where services are
not able to respond to queries on time, hence the client closed the connection
(raising a Timeout Error on the client side) and when the server tries to write
the response in the socket there is no one on the side an fails.

About (5), I don't have enough logs share insights, because we do know that
Keystone and Magnum are correctly configured, since you can launch k8s clusters,
although the main loop fails consistently, I will be looking into Magnum's
source code to understand the consequence of this scheduled task not able to
complete its job.

[0] 8508554a99e603a898dbe9b70cbc0c2d magnum-conductor.log
[1] $ grep barbicanclient.exceptions.HTTPServerError magnum-conductor-error.log | wc -l
14147
[2] $ grep pymysql.err.OperationalError magnum-conductor-error.log | wc -l
373
[3]$ grep pymysql.err.OperationalError magnum-conductor-error.log | grep 2023-06 | wc -l
363
[4] $ grep "Connection failed" magnum-conductor-error.log | grep rabbit | wc -l
3099
[5] $ grep "Connection failed" magnum-conductor-error.log | grep rabbit | grep 2023-06 | wc -l
3068
[6]$ grep "ConnectionResetError" magnum-conductor-error.log | wc -l
305
[7] $ grep "ConnectionResetError" magnum-conductor-error.log | grep 2023-06 | wc -l
301
[8] $ grep "keystoneauth1.exceptions.http.Unauthorized" magnum-conductor-error.log | wc -l
90482