Control plane crashloop due to cert regeneration caused by inconsistent SANs

Bug #2022859 reported by Arun Neelicattu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kubernetes API Load Balancer
New
Undecided
Unassigned
Kubernetes Control Plane Charm
New
Undecided
Unassigned

Bug Description

Observed Behaviour
------------------
1. kubernetes-control-plane units never get to active/idle state.
2. kubernetes-control-plane units continuously reacts to client relation / config change events causing restarts.
3. easyrsa units detect client relation changes triggering certification revocation and generates new certificates with different SANs.

Would be great if someone has a workaround for this that we can use.

Probable Root Cause
-------------------
When multiple DNS records exists for the host, python's socket.getfqdn() calls will provide inconsistent results between calls due to https://github.com/python/cpython/issues/49254.

For example the following commands were consecutively executed on one of the control plane machines.

root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())'
10-XXX-XXX-XXX.example.net
root@juju-bf6a17-0:/# python3 -c 'import socket; print(socket.getfqdn())'
juju-bf6a17-0.example.net

This causes the SAN list generated for the certificate request to be different every time. https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/7258630cf0a5560a665ed1e4770cc8f2f52013c4/reactive/kubernetes_control_plane.py#L1570C7-L1588

This then triggers a certificate change and the cycle continues.

Proposed Fix
------------
A potential fix here could be to replace the call to `socket.getfqdn()` to the patched method in the cpython upstream issue. For convenience, I have put the code into a gist at https://gist.github.com/abn/c4165a6d288e5f7137bdec5a4db199d1.

Alternatively, you can simply replace the call with the following.

socket.getaddrinfo(socket.gethostname(), None, 0, socket.SOCK_DGRAM, 0, socket.AI_CANONNAME)

Ideally the fix can be replicated and/or reused across all the charms that request for certs.

Workaround
----------
Should adjust the command to your specific case.

juju exec --all -- bash -c 'sudo sed -i s/"127.0.0.1 localhost"/"127.0.0.1 $(hostname -f) localhost"/ /etc/hosts'

Since the hosts file has precedence, this seems to have at least mitigated the issue for now.

description: updated
Revision history for this message
Arun Neelicattu (arun-neelicattu) wrote :
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.