Kubernetes Control Plane Charm

Bug #2022859
Comment #0

Comment 0 for bug 2022859

Revision history for this message

Arun Neelicattu (arun-neelicattu) wrote on 2023-06-04:

Observed Behaviour
------------------
1. kubernetes-control-plane units never get to active/idle state.
2. kubernetes-control-plane units continuously reacts to client relation / config change events causing restarts.
3. easyrsa units detect client relation changes triggering certification revocation and generates new certificates with different SANs.

Would be great if someone has a workaround for this that we can use.

Probably Root Cause
-------------------
When multiple DNS records exists for the host, python's socket.getfqdn() calls will provide inconsistent results between calls due to https://github.com/python/cpython/issues/49254.

For example the following commands were consecutively executed on one of the control plane machines.

root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())'
10-XXX-XXX-XXX.example.net
root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())'
juju-bf6a17-0.example.net

This causes the SAN list generated for the certificate request to be different every time. https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7258630cf0a5560a665ed1e4770cc8f2f52013c4/reactive/kubernetes_control_plane.py#L1570C7-L1588

This then triggers a certificate change and the cycle continues.

Proposed Fix
------------
A potential fix here could be to replace the call to `socket.getfqdn()` to the patched method in the cpython upstream issue. For convenience, I have put the code into a gist at https://gist.github.com/abn/c4165a6d288e5f7137bdec5a4db199d1.

Alternatively, you can simply replace the call with the following.

socket.getaddrinfo(socket.gethostname(), None, 0, socket.SOCK_DGRAM, 0, socket.AI_CANONNAME)

Ideally the fix can be replicated and/or reused across all the charms that request for certs.