System Config
System Controller
Description of failure
This issue came up from ME lab,
It was observed that the kube-controller-manager pod spews out a large number of error messages as below. Over a period of one hour, there are over 5k entries like this. Half of which are marked with Error log level
Impact of Failure
Minor
Impact on Customer
Large number of these error messages get to monitoring systems eventually
Was the issue reproduced internally
Yes
Time-line based on log analysis:
Below are the error messages seen in logs
sysadmin@controller-0:~$ kubectl -n kube-system logs --since=5m --tail 15 --timestamps kube-controller-manager-controller-0
2023-10-26T13:45:12.463375446Z I1026 13:45:12.463317 1 event.go:294] "Event occurred" object="controller-1" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-1 status is now: CIDRNotAvailable"
2023-10-26T13:45:13.275175308Z E1026 13:45:13.275087 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:13.275205782Z I1026 13:45:13.275122 1 event.go:294] "Event occurred" object="worker-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node worker-0 status is now: CIDRNotAvailable"
2023-10-26T13:45:16.378798271Z E1026 13:45:16.378697 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:16.378857639Z I1026 13:45:16.378751 1 event.go:294] "Event occurred" object="controller-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-0 status is now: CIDRNotAvailable"
2023-10-26T13:45:16.588729223Z E1026 13:45:16.588648 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:16.588759244Z I1026 13:45:16.588719 1 event.go:294] "Event occurred" object="controller-1" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-1 status is now: CIDRNotAvailable"
2023-10-26T13:45:17.395793922Z E1026 13:45:17.395695 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:17.395843933Z I1026 13:45:17.395780 1 event.go:294] "Event occurred" object="worker-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node worker-0 status is now: CIDRNotAvailable"
2023-10-26T13:45:20.557402365Z E1026 13:45:20.557334 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:20.557427086Z I1026 13:45:20.557370 1 event.go:294] "Event occurred" object="controller-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-0 status is now: CIDRNotAvailable"
2023-10-26T13:45:20.711208011Z E1026 13:45:20.711143 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:20.711239072Z I1026 13:45:20.711183 1 event.go:294] "Event occurred" object="controller-1" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-1 status is now: CIDRNotAvailable"
2023-10-26T13:45:21.521863050Z E1026 13:45:21.521758 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:21.521891349Z I1026 13:45:21.521811 1 event.go:294] "Event occurred" object="worker-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node worker-0 status is now: CIDRNotAvailable"
sysadmin@controller-0:~$
In an one hour, there are over 5k entries like this. Half of which are marked with Error log level
sysadmin@controller-0:~$ kubectl -n kube-system logs --since=60m kube-controller-manager-controller-0 | grep -c -e "CIDRNotAvailable" -e "CIDR allocation failed"
5272
sysadmin@controller-0:~$ kubectl -n kube-system logs --since=60m kube-controller-manager-controller-0 | grep -Ec "^E[0-9]{4}.*CIDR"
2636
CIDR specific config in the manifest: the same as all over production and in Planning:
sysadmin@controller-0:~$ sudo sh -c 'grep cidr /etc/kubernetes/manifests/*.yaml'
/etc/kubernetes/manifests/kube-controller-manager.yaml: - --allocate-node-cidrs=true
/etc/kubernetes/manifests/kube-controller-manager.yaml: - --cluster-cidr=2001:db8:4::/64
Describe of kube-controller-manager-controller-0
Name: kube-controller-manager-controller-0
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: controller-0/2607:f160:10:923e:ce:290:0:11
Start Time: Thu, 19 Oct 2023 23:02:27 +0000
Labels: component=kube-controller-manager
tier=control-plane
Annotations: kubernetes.io/config.hash: 530328cce679946a96e2dd5e1e49c0b2
kubernetes.io/config.mirror: 530328cce679946a96e2dd5e1e49c0b2
kubernetes.io/config.seen: 2023-10-19T22:46:57.788196843Z
kubernetes.io/config.source: file
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: 2607:f160:10:923e:ce:290:0:11
IPs:
IP: 2607:f160:10:923e:ce:290:0:11
Controlled By: Node/controller-0
Containers:
kube-controller-manager:
Container ID: containerd://54fdf1b7ef682ba7d8e954d61e4df5cb799b835f525f070e50328a752897433d
Image: registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4
Image ID: registry.local:9001/k8s.gcr.io/kube-controller-manager@sha256:da588c9f0e65e93317f5e016603d1ed7466427e9e0cf8b028c505bf30837f7dd
Port: <none>
Host Port: <none>
Command:
kube-controller-manager
--allocate-node-cidrs=true
--authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
--authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
--bind-address=127.0.0.1
--client-ca-file=/etc/kubernetes/pki/ca.crt
--cluster-cidr=2001:db8:4::/64
--cluster-name=kubernetes
--cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
--cluster-signing-key-file=/etc/kubernetes/pki/ca.key
--controllers=*,bootstrapsigner,tokencleaner
--flex-volume-plugin-dir=/var/opt/libexec/kubernetes/kubelet-plugins/volume/exec/
--kubeconfig=/etc/kubernetes/controller-manager.conf
--leader-elect=true
--node-monitor-grace-period=20s
--node-monitor-period=2s
--pod-eviction-timeout=30s
--requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
--root-ca-file=/etc/kubernetes/pki/ca.crt
--service-account-private-key-file=/etc/kubernetes/pki/sa.key
--service-cluster-ip-range=2001:db8:4::/112
--use-service-account-credentials=true
State: Running
Started: Wed, 25 Oct 2023 19:14:25 +0000
Last State: Terminated
Reason: Error
Exit Code: 2
Started: Wed, 25 Oct 2023 18:50:46 +0000
Finished: Wed, 25 Oct 2023 19:06:42 +0000
Ready: True
Restart Count: 7
Requests:
cpu: 0
Liveness: http-get https://127.0.0.1:10257/healthz delay=10s timeout=15s period=10s #success=1 #failure=8
Startup: http-get https://127.0.0.1:10257/healthz delay=10s timeout=15s period=10s #success=1 #failure=24
Environment: <none>
Mounts:
/etc/ca-certificates from etc-ca-certificates (ro)
/etc/kubernetes/controller-manager.conf from kubeconfig (ro)
/etc/kubernetes/pki from k8s-certs (ro)
/etc/pki from etc-pki (ro)
/etc/ssl/certs from ca-certs (ro)
/usr/local/share/ca-certificates from usr-local-share-ca-certificates (ro)
/usr/share/ca-certificates from usr-share-ca-certificates (ro)
/var/opt/libexec/kubernetes/kubelet-plugins/volume/exec/ from flexvolume-dir (rw)
Conditions:
Type Status
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
ca-certs:
Type: HostPath (bare host directory volume)
Path: /etc/ssl/certs
HostPathType: DirectoryOrCreate
etc-ca-certificates:
Type: HostPath (bare host directory volume)
Path: /etc/ca-certificates
HostPathType: DirectoryOrCreate
etc-pki:
Type: HostPath (bare host directory volume)
Path: /etc/pki
HostPathType: DirectoryOrCreate
flexvolume-dir:
Type: HostPath (bare host directory volume)
Path: /var/opt/libexec/kubernetes/kubelet-plugins/volume/exec/
HostPathType: DirectoryOrCreate
k8s-certs:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/pki
HostPathType: DirectoryOrCreate
kubeconfig:
Type: HostPath (bare host directory volume)
Path: /etc/kubernetes/controller-manager.conf
HostPathType: FileOrCreate
usr-local-share-ca-certificates:
Type: HostPath (bare host directory volume)
Path: /usr/local/share/ca-certificates
HostPathType: DirectoryOrCreate
usr-share-ca-certificates:
Type: HostPath (bare host directory volume)
Path: /usr/share/ca-certificates
HostPathType: DirectoryOrCreate
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Pulled 19h (x2 over 6d12h) kubelet Container image "registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4" already present on machine
Normal Created 19h (x2 over 6d12h) kubelet Created container kube-controller-manager
Normal Started 19h (x2 over 6d12h) kubelet Started container kube-controller-manager
Normal SandboxChanged 18h kubelet Pod sandbox changed, it will be killed and re-created.
Normal Pulled 18h kubelet Container image "registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4" already present on machine
Normal Created 18h kubelet Created container kube-controller-manager
Normal Started 18h kubelet Started container kube-controller-manager
Reviewed: https:/ /review. opendev. org/c/starlingx /ansible- playbooks/ +/900874 /opendev. org/starlingx/ ansible- playbooks/ commit/ 1acbbb1e89f7506 c11be788d17545a 0251eb8556
Committed: https:/
Submitter: "Zuul (22348)"
Branch: master
commit 1acbbb1e89f7506 c11be788d17545a 0251eb8556
Author: akantek <email address hidden>
Date: Fri Nov 10 15:17:50 2023 -0300
Turn off allocate-node-cidrs in kube-controller -manager
It was detected the generation of the error message "CIDR allocation
failed, there are no remaining CIDRs left to allocate in the accepted
range" with high occurence in the log.
By default allocate-node-cidrs is turned on to allocate address ranges
for each node. But if cluster-cidr is set to a /64 (or /24 for IPv4
install) mask with node-cidr-mask-size using the default 64 (or /24
in IPv4) there will be no available range for the other nodes,
controller-0 receives all available addresses for the pods.
But StarlingX uses calico as the CNI solution and it does not use the -manager to distribute
allocated addresses from kube-controller
addresses among the pods. That is why this configuration was not
noticed.
This change turn off the value during install to remove the error
message.
Test Plan
CIDRNotAvai lable events in the k8s nodes (using kubectl *]}{@.metadata. name}{" \t"}\
{ @.spec. podCIDR} {"\n"}{ end}'
[PASS] install an AIO-DX in IPv6 and validate that there is no more
describe node controller-1)
[PASS] validate that there are no podCIDR address range allocated to
k8s nodes (with kubectl get nodes \
-o jsonpath='{range .items[
[PASS] create user pods and validate that the address allocation is
done to all pods in all nodes, following the ranges determined
by calico (with kubectl get blockaffinities)
Closes-Bug: 2044115
Change-Id: I31e900757e9a94 d21fb86e533b6ba 46ebfe50e39
Signed-off-by: Andre Kantek <email address hidden>