I've been going through the logs that were handed off internally[0] where I
found some unexpected failures that I believe could be affecting the behaviour
of Magnum, I will list them and explain what the could mean separately.
1. barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen
2. pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
3. oslo.messaging._drivers.impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED [...]
4. ConnectionResetError: [Errno 104] Connection reset by peer
5. keystoneauth1.exceptions.http.Unauthorized: The request you have made requires authentication.
About (1), this error is present in the logs ~14k times[1], the first occurrence
on January 27th and the last one in the log is on June 19th. The absence of a
healthy Barbican service prevents Magnum from stablishing a connection to k8s
since that's the place where the secrets (e.g. private keys) are stored and read from.
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/magnum/service/periodic.py", line 106, in _update_health_status
monitor.poll_health_status()
File "/usr/lib/python3/dist-packages/magnum/drivers/common/k8s_monitor.py", line 55, in poll_health_status
k8s_api = k8s.create_k8s_api(self.context, self.cluster)
File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 145, in create_k8s_api
return K8sAPI(context, cluster)
File "/usr/lib/python3/dist-packages/magnum/conductor/k8s_api.py", line 114, in __init__
self.cert_file) = create_client_files(cluster, context)
File "/usr/lib/python3/dist-packages/magnum/conductor/handlers/common/cert_manager.py", line 159, in create_client_files
magnum_cert.get_decrypted_private_key()))
File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/cert_manager.py", line 46, in get_decrypted_private_key
return operations.decrypt_key(self.get_private_key(),
File "/usr/lib/python3/dist-packages/magnum/common/cert_manager/barbican_cert_manager.py", line 52, in get_private_key
return self._cert_container.private_key.payload
File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 193, in payload
self._fetch_payload()
File "/usr/lib/python3/dist-packages/barbicanclient/v1/secrets.py", line 271, in _fetch_payload
payload = self._api._get_raw(payload_url, headers=headers)
File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 83, in _get_raw
return self.request(path, 'GET', *args, **kwargs).content
File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 63, in request
self._check_status_code(resp)
File "/usr/lib/python3/dist-packages/barbicanclient/client.py", line 97, in _check_status_code
raise exceptions.HTTPServerError(
barbicanclient.exceptions.HTTPServerError: Internal Server Error: Secret payload retrieval failure seen - please contact site administrator.
About (2), this is a better understood error in general, it's basically that the
database dropped the client (in this case magnum-conductor process), this can be
due to numerous reasons, more data is needed to understand this, and it's out of
the scope of this bug, you most likely are seeing this issue accross all the
control plane services, if you detect this issue exclusive to Magnum service
please provide us the data you are using to get to that conclusion, so we can
look into why Magmnum may be misbehaving. This error is less common in the
logs[2], although almost all of the occurrences were in June[3]
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/base.py", line 1245, in _execute_context
self.dialect.do_execute(
File "/usr/lib/python3/dist-packages/sqlalchemy/engine/default.py", line 581, in do_execute
cursor.execute(statement, parameters)
File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 170, in execute
result = self._query(query)
File "/usr/lib/python3/dist-packages/pymysql/cursors.py", line 328, in _query
conn.query(q)
File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 517, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 732, in _read_query_result
result.read()
File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 1075, in read
first_packet = self.connection._read_packet()
File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 657, in _read_packet
packet_header = self._read_bytes(4)
File "/usr/lib/python3/dist-packages/pymysql/connections.py", line 706, in _read_bytes
raise err.OperationalError(
pymysql.err.OperationalError: (2013, 'Lost connection to MySQL server during query')
About (3), during June the rabbitmq service was having issues and services
couldn't connect to it[4][5], RabbitMQ is critical component for control plane
services (as important as MySQL).
About (4), the ConnectionResetError[6][7] may be the symptom of a more general error
in the underlying network or large performance degradation where services are
not able to respond to queries on time, hence the client closed the connection
(raising a Timeout Error on the client side) and when the server tries to write
the response in the socket there is no one on the side an fails.
About (5), I don't have enough logs share insights, because we do know that
Keystone and Magnum are correctly configured, since you can launch k8s clusters,
although the main loop fails consistently, I will be looking into Magnum's
source code to understand the consequence of this scheduled task not able to
complete its job.
Hi,
I've been going through the logs that were handed off internally[0] where I
found some unexpected failures that I believe could be affecting the behaviour
of Magnum, I will list them and explain what the could mean separately.
1. barbicanclient. exceptions. HTTPServerError : Internal Server Error: Secret payload retrieval failure seen err.Operational Error: (2013, 'Lost connection to MySQL server during query') _drivers. impl_rabbit [-] Connection failed: [Errno 111] ECONNREFUSED [...] Error: [Errno 104] Connection reset by peer exceptions. http.Unauthoriz ed: The request you have made requires authentication.
2. pymysql.
3. oslo.messaging.
4. ConnectionReset
5. keystoneauth1.
About (1), this error is present in the logs ~14k times[1], the first occurrence
on January 27th and the last one in the log is on June 19th. The absence of a
healthy Barbican service prevents Magnum from stablishing a connection to k8s
since that's the place where the secrets (e.g. private keys) are stored and read from.
Traceback (most recent call last): python3/ dist-packages/ magnum/ service/ periodic. py", line 106, in _update_ health_ status poll_health_ status( ) python3/ dist-packages/ magnum/ drivers/ common/ k8s_monitor. py", line 55, in poll_health_status k8s_api( self.context, self.cluster) python3/ dist-packages/ magnum/ conductor/ k8s_api. py", line 145, in create_k8s_api python3/ dist-packages/ magnum/ conductor/ k8s_api. py", line 114, in __init__ client_ files(cluster, context) python3/ dist-packages/ magnum/ conductor/ handlers/ common/ cert_manager. py", line 159, in create_client_files cert.get_ decrypted_ private_ key())) python3/ dist-packages/ magnum/ common/ cert_manager/ cert_manager. py", line 46, in get_decrypted_ private_ key decrypt_ key(self. get_private_ key(), python3/ dist-packages/ magnum/ common/ cert_manager/ barbican_ cert_manager. py", line 52, in get_private_key container. private_ key.payload python3/ dist-packages/ barbicanclient/ v1/secrets. py", line 193, in payload _fetch_ payload( ) python3/ dist-packages/ barbicanclient/ v1/secrets. py", line 271, in _fetch_payload _get_raw( payload_ url, headers=headers) python3/ dist-packages/ barbicanclient/ client. py", line 83, in _get_raw python3/ dist-packages/ barbicanclient/ client. py", line 63, in request _check_ status_ code(resp) python3/ dist-packages/ barbicanclient/ client. py", line 97, in _check_status_code HTTPServerError ( exceptions. HTTPServerError : Internal Server Error: Secret payload retrieval failure seen - please contact site administrator.
File "/usr/lib/
monitor.
File "/usr/lib/
k8s_api = k8s.create_
File "/usr/lib/
return K8sAPI(context, cluster)
File "/usr/lib/
self.cert_file) = create_
File "/usr/lib/
magnum_
File "/usr/lib/
return operations.
File "/usr/lib/
return self._cert_
File "/usr/lib/
self.
File "/usr/lib/
payload = self._api.
File "/usr/lib/
return self.request(path, 'GET', *args, **kwargs).content
File "/usr/lib/
self.
File "/usr/lib/
raise exceptions.
barbicanclient.
About (2), this is a better understood error in general, it's basically that the
database dropped the client (in this case magnum-conductor process), this can be
due to numerous reasons, more data is needed to understand this, and it's out of
the scope of this bug, you most likely are seeing this issue accross all the
control plane services, if you detect this issue exclusive to Magnum service
please provide us the data you are using to get to that conclusion, so we can
look into why Magmnum may be misbehaving. This error is less common in the
logs[2], although almost all of the occurrences were in June[3]
Traceback (most recent call last): python3/ dist-packages/ sqlalchemy/ engine/ base.py" , line 1245, in _execute_context dialect. do_execute( python3/ dist-packages/ sqlalchemy/ engine/ default. py", line 581, in do_execute execute( statement, parameters) python3/ dist-packages/ pymysql/ cursors. py", line 170, in execute python3/ dist-packages/ pymysql/ cursors. py", line 328, in _query python3/ dist-packages/ pymysql/ connections. py", line 517, in query _affected_ rows = self._read_ query_result( unbuffered= unbuffered) python3/ dist-packages/ pymysql/ connections. py", line 732, in _read_query_result python3/ dist-packages/ pymysql/ connections. py", line 1075, in read ._read_ packet( ) python3/ dist-packages/ pymysql/ connections. py", line 657, in _read_packet python3/ dist-packages/ pymysql/ connections. py", line 706, in _read_bytes Error( err.Operational Error: (2013, 'Lost connection to MySQL server during query')
File "/usr/lib/
self.
File "/usr/lib/
cursor.
File "/usr/lib/
result = self._query(query)
File "/usr/lib/
conn.query(q)
File "/usr/lib/
self.
File "/usr/lib/
result.read()
File "/usr/lib/
first_packet = self.connection
File "/usr/lib/
packet_header = self._read_bytes(4)
File "/usr/lib/
raise err.Operational
pymysql.
About (3), during June the rabbitmq service was having issues and services
couldn't connect to it[4][5], RabbitMQ is critical component for control plane
services (as important as MySQL).
About (4), the ConnectionReset Error[6] [7] may be the symptom of a more general error
in the underlying network or large performance degradation where services are
not able to respond to queries on time, hence the client closed the connection
(raising a Timeout Error on the client side) and when the server tries to write
the response in the socket there is no one on the side an fails.
About (5), I don't have enough logs share insights, because we do know that
Keystone and Magnum are correctly configured, since you can launch k8s clusters,
although the main loop fails consistently, I will be looking into Magnum's
source code to understand the consequence of this scheduled task not able to
complete its job.
[0] 8508554a99e603a 898dbe9b70cbc0c 2d magnum- conductor. log exceptions. HTTPServerError magnum- conductor- error.log | wc -l err.Operational Error magnum- conductor- error.log | wc -l err.Operational Error magnum- conductor- error.log | grep 2023-06 | wc -l conductor- error.log | grep rabbit | wc -l conductor- error.log | grep rabbit | grep 2023-06 | wc -l tError" magnum- conductor- error.log | wc -l tError" magnum- conductor- error.log | grep 2023-06 | wc -l exceptions. http.Unauthoriz ed" magnum- conductor- error.log | wc -l
[1] $ grep barbicanclient.
14147
[2] $ grep pymysql.
373
[3]$ grep pymysql.
363
[4] $ grep "Connection failed" magnum-
3099
[5] $ grep "Connection failed" magnum-
3068
[6]$ grep "ConnectionRese
305
[7] $ grep "ConnectionRese
301
[8] $ grep "keystoneauth1.
90482