CIDR allocation failed error messages in kube-controller-manager pod

Bug #2044115 reported by Andre Kantek
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Low
Andre Kantek

Bug Description

System Config
System Controller

Description of failure
This issue came up from ME lab,

It was observed that the kube-controller-manager pod spews out a large number of error messages as below. Over a period of one hour, there are over 5k entries like this. Half of which are marked with Error log level

Impact of Failure
Minor

Impact on Customer
Large number of these error messages get to monitoring systems eventually

Was the issue reproduced internally
Yes

Time-line based on log analysis:

Below are the error messages seen in logs

sysadmin@controller-0:~$ kubectl -n kube-system logs --since=5m --tail 15 --timestamps kube-controller-manager-controller-0
2023-10-26T13:45:12.463375446Z I1026 13:45:12.463317 1 event.go:294] "Event occurred" object="controller-1" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-1 status is now: CIDRNotAvailable"
2023-10-26T13:45:13.275175308Z E1026 13:45:13.275087 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:13.275205782Z I1026 13:45:13.275122 1 event.go:294] "Event occurred" object="worker-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node worker-0 status is now: CIDRNotAvailable"
2023-10-26T13:45:16.378798271Z E1026 13:45:16.378697 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:16.378857639Z I1026 13:45:16.378751 1 event.go:294] "Event occurred" object="controller-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-0 status is now: CIDRNotAvailable"
2023-10-26T13:45:16.588729223Z E1026 13:45:16.588648 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:16.588759244Z I1026 13:45:16.588719 1 event.go:294] "Event occurred" object="controller-1" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-1 status is now: CIDRNotAvailable"
2023-10-26T13:45:17.395793922Z E1026 13:45:17.395695 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:17.395843933Z I1026 13:45:17.395780 1 event.go:294] "Event occurred" object="worker-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node worker-0 status is now: CIDRNotAvailable"
2023-10-26T13:45:20.557402365Z E1026 13:45:20.557334 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:20.557427086Z I1026 13:45:20.557370 1 event.go:294] "Event occurred" object="controller-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-0 status is now: CIDRNotAvailable"
2023-10-26T13:45:20.711208011Z E1026 13:45:20.711143 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:20.711239072Z I1026 13:45:20.711183 1 event.go:294] "Event occurred" object="controller-1" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node controller-1 status is now: CIDRNotAvailable"
2023-10-26T13:45:21.521863050Z E1026 13:45:21.521758 1 controller_utils.go:256] Error while processing Node Add/Delete: failed to allocate cidr from cluster cidr at idx:0: CIDR allocation failed; there are no remaining CIDRs left to allocate in the accepted range
2023-10-26T13:45:21.521891349Z I1026 13:45:21.521811 1 event.go:294] "Event occurred" object="worker-0" fieldPath="" kind="Node" apiVersion="v1" type="Normal" reason="CIDRNotAvailable" message="Node worker-0 status is now: CIDRNotAvailable"
sysadmin@controller-0:~$
In an one hour, there are over 5k entries like this. Half of which are marked with Error log level

sysadmin@controller-0:~$ kubectl -n kube-system logs --since=60m kube-controller-manager-controller-0 | grep -c -e "CIDRNotAvailable" -e "CIDR allocation failed"
5272
sysadmin@controller-0:~$ kubectl -n kube-system logs --since=60m kube-controller-manager-controller-0 | grep -Ec "^E[0-9]{4}.*CIDR"
2636
CIDR specific config in the manifest: the same as all over production and in Planning:

sysadmin@controller-0:~$ sudo sh -c 'grep cidr /etc/kubernetes/manifests/*.yaml'
/etc/kubernetes/manifests/kube-controller-manager.yaml: - --allocate-node-cidrs=true
/etc/kubernetes/manifests/kube-controller-manager.yaml: - --cluster-cidr=2001:db8:4::/64

Describe of kube-controller-manager-controller-0

Name: kube-controller-manager-controller-0
Namespace: kube-system
Priority: 2000001000
Priority Class Name: system-node-critical
Node: controller-0/2607:f160:10:923e:ce:290:0:11
Start Time: Thu, 19 Oct 2023 23:02:27 +0000
Labels: component=kube-controller-manager
                      tier=control-plane
Annotations: kubernetes.io/config.hash: 530328cce679946a96e2dd5e1e49c0b2
                      kubernetes.io/config.mirror: 530328cce679946a96e2dd5e1e49c0b2
                      kubernetes.io/config.seen: 2023-10-19T22:46:57.788196843Z
                      kubernetes.io/config.source: file
                      seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Running
IP: 2607:f160:10:923e:ce:290:0:11
IPs:
  IP: 2607:f160:10:923e:ce:290:0:11
Controlled By: Node/controller-0
Containers:
  kube-controller-manager:
    Container ID: containerd://54fdf1b7ef682ba7d8e954d61e4df5cb799b835f525f070e50328a752897433d
    Image: registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4
    Image ID: registry.local:9001/k8s.gcr.io/kube-controller-manager@sha256:da588c9f0e65e93317f5e016603d1ed7466427e9e0cf8b028c505bf30837f7dd
    Port: <none>
    Host Port: <none>
    Command:
      kube-controller-manager
      --allocate-node-cidrs=true
      --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
      --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
      --bind-address=127.0.0.1
      --client-ca-file=/etc/kubernetes/pki/ca.crt
      --cluster-cidr=2001:db8:4::/64
      --cluster-name=kubernetes
      --cluster-signing-cert-file=/etc/kubernetes/pki/ca.crt
      --cluster-signing-key-file=/etc/kubernetes/pki/ca.key
      --controllers=*,bootstrapsigner,tokencleaner
      --flex-volume-plugin-dir=/var/opt/libexec/kubernetes/kubelet-plugins/volume/exec/
      --kubeconfig=/etc/kubernetes/controller-manager.conf
      --leader-elect=true
      --node-monitor-grace-period=20s
      --node-monitor-period=2s
      --pod-eviction-timeout=30s
      --requestheader-client-ca-file=/etc/kubernetes/pki/front-proxy-ca.crt
      --root-ca-file=/etc/kubernetes/pki/ca.crt
      --service-account-private-key-file=/etc/kubernetes/pki/sa.key
      --service-cluster-ip-range=2001:db8:4::/112
      --use-service-account-credentials=true
    State: Running
      Started: Wed, 25 Oct 2023 19:14:25 +0000
    Last State: Terminated
      Reason: Error
      Exit Code: 2
      Started: Wed, 25 Oct 2023 18:50:46 +0000
      Finished: Wed, 25 Oct 2023 19:06:42 +0000
    Ready: True
    Restart Count: 7
    Requests:
      cpu: 0
    Liveness: http-get https://127.0.0.1:10257/healthz delay=10s timeout=15s period=10s #success=1 #failure=8
    Startup: http-get https://127.0.0.1:10257/healthz delay=10s timeout=15s period=10s #success=1 #failure=24
    Environment: <none>
    Mounts:
      /etc/ca-certificates from etc-ca-certificates (ro)
      /etc/kubernetes/controller-manager.conf from kubeconfig (ro)
      /etc/kubernetes/pki from k8s-certs (ro)
      /etc/pki from etc-pki (ro)
      /etc/ssl/certs from ca-certs (ro)
      /usr/local/share/ca-certificates from usr-local-share-ca-certificates (ro)
      /usr/share/ca-certificates from usr-share-ca-certificates (ro)
      /var/opt/libexec/kubernetes/kubelet-plugins/volume/exec/ from flexvolume-dir (rw)
Conditions:
  Type Status
  Initialized True
  Ready True
  ContainersReady True
  PodScheduled True
Volumes:
  ca-certs:
    Type: HostPath (bare host directory volume)
    Path: /etc/ssl/certs
    HostPathType: DirectoryOrCreate
  etc-ca-certificates:
    Type: HostPath (bare host directory volume)
    Path: /etc/ca-certificates
    HostPathType: DirectoryOrCreate
  etc-pki:
    Type: HostPath (bare host directory volume)
    Path: /etc/pki
    HostPathType: DirectoryOrCreate
  flexvolume-dir:
    Type: HostPath (bare host directory volume)
    Path: /var/opt/libexec/kubernetes/kubelet-plugins/volume/exec/
    HostPathType: DirectoryOrCreate
  k8s-certs:
    Type: HostPath (bare host directory volume)
    Path: /etc/kubernetes/pki
    HostPathType: DirectoryOrCreate
  kubeconfig:
    Type: HostPath (bare host directory volume)
    Path: /etc/kubernetes/controller-manager.conf
    HostPathType: FileOrCreate
  usr-local-share-ca-certificates:
    Type: HostPath (bare host directory volume)
    Path: /usr/local/share/ca-certificates
    HostPathType: DirectoryOrCreate
  usr-share-ca-certificates:
    Type: HostPath (bare host directory volume)
    Path: /usr/share/ca-certificates
    HostPathType: DirectoryOrCreate
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: :NoExecute op=Exists
Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Pulled 19h (x2 over 6d12h) kubelet Container image "registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4" already present on machine
  Normal Created 19h (x2 over 6d12h) kubelet Created container kube-controller-manager
  Normal Started 19h (x2 over 6d12h) kubelet Started container kube-controller-manager
  Normal SandboxChanged 18h kubelet Pod sandbox changed, it will be killed and re-created.
  Normal Pulled 18h kubelet Container image "registry.local:9001/k8s.gcr.io/kube-controller-manager:v1.24.4" already present on machine
  Normal Created 18h kubelet Created container kube-controller-manager
  Normal Started 18h kubelet Started container kube-controller-manager

Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/c/starlingx/ansible-playbooks/+/900874
Committed: https://opendev.org/starlingx/ansible-playbooks/commit/1acbbb1e89f7506c11be788d17545a0251eb8556
Submitter: "Zuul (22348)"
Branch: master

commit 1acbbb1e89f7506c11be788d17545a0251eb8556
Author: akantek <email address hidden>
Date: Fri Nov 10 15:17:50 2023 -0300

    Turn off allocate-node-cidrs in kube-controller-manager

    It was detected the generation of the error message "CIDR allocation
    failed, there are no remaining CIDRs left to allocate in the accepted
    range" with high occurence in the log.

    By default allocate-node-cidrs is turned on to allocate address ranges
    for each node. But if cluster-cidr is set to a /64 (or /24 for IPv4
    install) mask with node-cidr-mask-size using the default 64 (or /24
    in IPv4) there will be no available range for the other nodes,
    controller-0 receives all available addresses for the pods.

    But StarlingX uses calico as the CNI solution and it does not use the
    allocated addresses from kube-controller-manager to distribute
    addresses among the pods. That is why this configuration was not
    noticed.

    This change turn off the value during install to remove the error
    message.

    Test Plan
    [PASS] install an AIO-DX in IPv6 and validate that there is no more
           CIDRNotAvailable events in the k8s nodes (using kubectl
           describe node controller-1)
    [PASS] validate that there are no podCIDR address range allocated to
           k8s nodes (with kubectl get nodes \
             -o jsonpath='{range .items[*]}{@.metadata.name}{"\t"}\
             {@.spec.podCIDR}{"\n"}{end}'
    [PASS] create user pods and validate that the address allocation is
           done to all pods in all nodes, following the ranges determined
           by calico (with kubectl get blockaffinities)

    Closes-Bug: 2044115

    Change-Id: I31e900757e9a94d21fb86e533b6ba46ebfe50e39
    Signed-off-by: Andre Kantek <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Low
tags: added: stx.9.0 stx.networking
Changed in starlingx:
assignee: nobody → Andre Kantek (akantek)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-puppet (master)

Reviewed: https://review.opendev.org/c/starlingx/stx-puppet/+/900875
Committed: https://opendev.org/starlingx/stx-puppet/commit/1c39bc6c894edbbd9d9ef0f679e8e0b213789c6c
Submitter: "Zuul (22348)"
Branch: master

commit 1c39bc6c894edbbd9d9ef0f679e8e0b213789c6c
Author: Andre Kantek <email address hidden>
Date: Mon Nov 13 16:03:06 2023 -0300

    Turn off allocate-node-cidrs in kube-controller-manager

    It was detected the generation of the error message "CIDR allocation
    failed, there are no remaining CIDRs left to allocate in the accepted
    range" with high occurence in the log.

    By default allocate-node-cidrs is turned on to allocate address ranges
    for each node. But if cluster-cidr is set to a /64 (or /24 for IPv4
    install) mask with node-cidr-mask-size using the default 64 (or /24
    in IPv4) there will be no available range for the other nodes,
    controller-0 receives all available addresses for the pods.

    But StarlingX uses calico as the CNI solution and it does not use the
    allocated addresses from kube-controller-manager to distribute
    addresses among the pods. That is why this configuration was not
    noticed.

    This change turn off the value on existing installations to remove the
    error message if inserted as a sw patch.

    Test Plan
    [PASS] install an AIO-DX in IPv6 and apply this change as a patch and
           validate that the occurrence of CIDRNotAvailable events in the
           k8s nodes stops growing (kubectl describe node controller-1)
    [PASS] create user pods and validate that the address allocation is
           done to all pods in all nodes, following the ranges determined
           by calico (with kubectl get blockaffinities)

    Closes-Bug: 2044115

    Change-Id: I62b0cb482873703b4b266708fdba279aafd7c5c1
    Signed-off-by: Andre Kantek <email address hidden>

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.