Activity log for bug #2022859

Date Who What changed Old value New value Message
2023-06-04 16:14:09 Arun Neelicattu bug added bug
2023-06-04 16:17:19 Arun Neelicattu bug task added charm-kubeapi-load-balancer
2023-06-04 16:19:03 Arun Neelicattu description Observed Behaviour ------------------ 1. kubernetes-control-plane units never get to active/idle state. 2. kubernetes-control-plane units continuously reacts to client relation / config change events causing restarts. 3. easyrsa units detect client relation changes triggering certification revocation and generates new certificates with different SANs. Would be great if someone has a workaround for this that we can use. Probably Root Cause ------------------- When multiple DNS records exists for the host, python's socket.getfqdn() calls will provide inconsistent results between calls due to https://github.com/python/cpython/issues/49254. For example the following commands were consecutively executed on one of the control plane machines. root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())' 10-XXX-XXX-XXX.example.net root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())' juju-bf6a17-0.example.net This causes the SAN list generated for the certificate request to be different every time. https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7258630cf0a5560a665ed1e4770cc8f2f52013c4/reactive/kubernetes_control_plane.py#L1570C7-L1588 This then triggers a certificate change and the cycle continues. Proposed Fix ------------ A potential fix here could be to replace the call to `socket.getfqdn()` to the patched method in the cpython upstream issue. For convenience, I have put the code into a gist at https://gist.github.com/abn/c4165a6d288e5f7137bdec5a4db199d1. Alternatively, you can simply replace the call with the following. socket.getaddrinfo(socket.gethostname(), None, 0, socket.SOCK_DGRAM, 0, socket.AI_CANONNAME) Ideally the fix can be replicated and/or reused across all the charms that request for certs. Observed Behaviour ------------------ 1. kubernetes-control-plane units never get to active/idle state. 2. kubernetes-control-plane units continuously reacts to client relation / config change events causing restarts. 3. easyrsa units detect client relation changes triggering certification revocation and generates new certificates with different SANs. Would be great if someone has a workaround for this that we can use. Probable Root Cause ------------------- When multiple DNS records exists for the host, python's socket.getfqdn() calls will provide inconsistent results between calls due to https://github.com/python/cpython/issues/49254. For example the following commands were consecutively executed on one of the control plane machines. root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())' 10-XXX-XXX-XXX.example.net root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())' juju-bf6a17-0.example.net This causes the SAN list generated for the certificate request to be different every time. https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7258630cf0a5560a665ed1e4770cc8f2f52013c4/reactive/kubernetes_control_plane.py#L1570C7-L1588 This then triggers a certificate change and the cycle continues. Proposed Fix ------------ A potential fix here could be to replace the call to `socket.getfqdn()` to the patched method in the cpython upstream issue. For convenience, I have put the code into a gist at https://gist.github.com/abn/c4165a6d288e5f7137bdec5a4db199d1. Alternatively, you can simply replace the call with the following. socket.getaddrinfo(socket.gethostname(), None, 0, socket.SOCK_DGRAM, 0, socket.AI_CANONNAME) Ideally the fix can be replicated and/or reused across all the charms that request for certs.
2023-06-04 19:15:52 Arun Neelicattu description Observed Behaviour ------------------ 1. kubernetes-control-plane units never get to active/idle state. 2. kubernetes-control-plane units continuously reacts to client relation / config change events causing restarts. 3. easyrsa units detect client relation changes triggering certification revocation and generates new certificates with different SANs. Would be great if someone has a workaround for this that we can use. Probable Root Cause ------------------- When multiple DNS records exists for the host, python's socket.getfqdn() calls will provide inconsistent results between calls due to https://github.com/python/cpython/issues/49254. For example the following commands were consecutively executed on one of the control plane machines. root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())' 10-XXX-XXX-XXX.example.net root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())' juju-bf6a17-0.example.net This causes the SAN list generated for the certificate request to be different every time. https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7258630cf0a5560a665ed1e4770cc8f2f52013c4/reactive/kubernetes_control_plane.py#L1570C7-L1588 This then triggers a certificate change and the cycle continues. Proposed Fix ------------ A potential fix here could be to replace the call to `socket.getfqdn()` to the patched method in the cpython upstream issue. For convenience, I have put the code into a gist at https://gist.github.com/abn/c4165a6d288e5f7137bdec5a4db199d1. Alternatively, you can simply replace the call with the following. socket.getaddrinfo(socket.gethostname(), None, 0, socket.SOCK_DGRAM, 0, socket.AI_CANONNAME) Ideally the fix can be replicated and/or reused across all the charms that request for certs. Observed Behaviour ------------------ 1. kubernetes-control-plane units never get to active/idle state. 2. kubernetes-control-plane units continuously reacts to client relation / config change events causing restarts. 3. easyrsa units detect client relation changes triggering certification revocation and generates new certificates with different SANs. Would be great if someone has a workaround for this that we can use. Probable Root Cause ------------------- When multiple DNS records exists for the host, python's socket.getfqdn() calls will provide inconsistent results between calls due to https://github.com/python/cpython/issues/49254. For example the following commands were consecutively executed on one of the control plane machines. root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())' 10-XXX-XXX-XXX.example.net root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())' juju-bf6a17-0.example.net This causes the SAN list generated for the certificate request to be different every time. https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7258630cf0a5560a665ed1e4770cc8f2f52013c4/reactive/kubernetes_control_plane.py#L1570C7-L1588 This then triggers a certificate change and the cycle continues. Proposed Fix ------------ A potential fix here could be to replace the call to `socket.getfqdn()` to the patched method in the cpython upstream issue. For convenience, I have put the code into a gist at https://gist.github.com/abn/c4165a6d288e5f7137bdec5a4db199d1. Alternatively, you can simply replace the call with the following. socket.getaddrinfo(socket.gethostname(), None, 0, socket.SOCK_DGRAM, 0, socket.AI_CANONNAME) Ideally the fix can be replicated and/or reused across all the charms that request for certs. Workaround ---------- Should adjust the command to your specific case. juju exec --all -- bash -c 'sudo sed -i s/"127.0.0.1 localhost"/"127.0.0.1 $(hostname -f) localhost"/ /etc/hosts' Since the hosts file has precedence, this seems to have at least mitigated the issue for now.