One of calico instance stuck in "Configuring Calico"

Bug #2064145 reported by Jeffrey Chang
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Calico Charm
In Progress
High
Unassigned

Bug Description

In this SQA run - https://solutions.qa.canonical.com/testruns/18cd9a82-ba5a-4938-98a5-2be7e8d6e5a5
When deploying Charmed Kubernetes on top of baremetal with charm Cailco rev 105 on focal,
one of the calico instance stuck in "Configuring Calico",
and kubernetes-control-plane is pending calico service.

Error logs
2024-04-26 16:00:11 INFO unit.calico/1.juju-log server.go:325 Configured Calico IP pool.
2024-04-26 16:00:11 ERROR unit.calico/1.juju-log server.go:325 b'resource does not exist: Node(duosion) with error: <nil>\n'
2024-04-26 16:00:11 ERROR unit.calico/1.juju-log server.go:325 b'null\n'
2024-04-26 16:00:11 ERROR unit.calico/1.juju-log server.go:325 Failed to configure node.
Traceback (most recent call last):
  File "./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'duosion']' returned non-zero exit status 1.
2024-04-26 16:00:11 ERROR unit.calico/1.juju-log server.go:325 Failed to configure Calico, will retry.
Traceback (most recent call last):
  File "./src/charm.py", line 174, in _install_or_upgrade
    self._configure_calico()
  File "./src/charm.py", line 125, in _configure_calico
    self._configure_node()
  File "./src/charm.py", line 305, in _configure_node
    raise e
  File "./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'duosion']' returned non-zero exit status 1.
2024-04-26 16:00:11 DEBUG unit.calico/1.juju-log server.go:325 Deferring <InstallEvent via CalicoCharm/on/install[1]>.

Revision history for this message
Michael Fischer (michaelandrewfischer) wrote (last edit ):
Download full text (3.7 KiB)

Having this issue in revision 105 on jammy. All of the calico nodes are stuck in this infinite loop failing to configure. If I ssh into any of the calico nodes and run the command (/opt/calicoctl/calicoctl get -o yaml --export node novel-bird it does not return null neither does it return a non-zero exit status of 1.

command:
/opt/calicoctl/calicoctl get -o yaml --export node novel-bird

output:
apiVersion: projectcalico.org/v3
kind: Node
metadata:
  annotations:
    projectcalico.org/kube-labels: '{"beta.kubernetes.io/arch":"amd64","beta.kubernetes.io/os":"linux","juju-application":"kubernetes-control-plane","juju-charm":"kubernetes-control-plane","kubernetes.io/arch":"amd64","kubernetes.io/hostname":"novel-bird","kubernetes.io/os":"linux","node-role.kubernetes.io/control-plane":""}'
  creationTimestamp: null
  labels:
    beta.kubernetes.io/arch: amd64
    beta.kubernetes.io/os: linux
    juju-application: kubernetes-control-plane
    juju-charm: kubernetes-control-plane
    kubernetes.io/arch: amd64
    kubernetes.io/hostname: novel-bird
    kubernetes.io/os: linux
    node-role.kubernetes.io/control-plane: ""
  name: novel-bird
spec:
  addresses:
  - address: 192.168.2.38
    type: InternalIP
  orchRefs:
  - nodeName: novel-bird
    orchestrator: k8s
status: {}

error logs:

unit-calico-3: 09:10:37 ERROR unit.calico/3.juju-log b'resource does not exist: Node(novel-bird) with error: <nil>\n'
unit-calico-3: 09:10:37 ERROR unit.calico/3.juju-log b'null\n'
unit-calico-3: 09:10:37 ERROR unit.calico/3.juju-log Failed to configure node.
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-calico-3/charm/./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "/var/lib/juju/agents/unit-calico-3/charm/./src/charm.py", line 640, in _calicoctl_get
    output = self.calicoctl(*args)
  File "/var/lib/juju/agents/unit-calico-3/charm/./src/charm.py", line 632, in calicoctl
    return subprocess.check_output(cmd, env=env, stderr=subprocess.PIPE, timeout=timeout)
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['/opt/calicoctl/calicoctl', 'get', '-o', 'yaml', '--export', 'node', 'novel-bird']' returned non-zero exit status 1.
unit-calico-3: 09:10:37 ERROR unit.calico/3.juju-log Failed to configure Calico, will retry.
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-calico-3/charm/./src/charm.py", line 174, in _install_or_upgrade
    self._configure_calico()
  File "/var/lib/juju/agents/unit-calico-3/charm/./src/charm.py", line 125, in _configure_calico
    self._configure_node()
  File "/var/lib/juju/agents/unit-calico-3/charm/./src/charm.py", line 305, in _configure_node
    raise e
  File "/var/lib/juju/agents/unit-calico-3/charm/./src/charm.py", line 298, in _configure_node
    node = self._calicoctl_get("node", node_name)
  File "/var/lib/juju/agents/unit-calico-3/charm/./src/charm.py", line 640, in _calicoctl_get
    output = s...

Read more...

Adam Dyess (addyess)
Changed in charm-calico:
milestone: none → 1.30+ck1
status: New → Triaged
Adam Dyess (addyess)
summary: - One of calico instace stuck in "Configuring Calico"
+ One of calico instance stuck in "Configuring Calico"
Adam Dyess (addyess)
Changed in charm-calico:
milestone: 1.30+ck1 → 1.30
Revision history for this message
Adam Dyess (addyess) wrote :

Instrumenting more debug from the charm:

https://github.com/charmed-kubernetes/charm-calico/pull/107

Changed in charm-calico:
milestone: 1.30 → 1.30+ck1
Revision history for this message
Adam Dyess (addyess) wrote :

We continue to see this in fresh deployments and it may be associated with LP#2064305

Changed in charm-calico:
status: Triaged → Fix Committed
importance: Undecided → Medium
importance: Medium → High
Revision history for this message
Adam Dyess (addyess) wrote :

if this is encountered, there seems to be some collision from having two certificate authority charms in the cluster:

for instance

etcd may be related to easyrsa to provide certificates
calico, kubernetes-control-plane -- are related to vault for certificates

There's really no NEED for two certificate authorities. One should drive to use a SINGLE certificate authority source

Revision history for this message
Adam Dyess (addyess) wrote (last edit ):

I see in the crashdumps from solqa this awesome folder called `pod-logs` which is super spiffy

Using the provided run [0] from 4/26/2023 and i noticed that one of the calico node pods didn't start after 4 hours. It seems to be blocking the charm from going active.

I wish there was a way see the events on these pods too. I suppose we could add these logs to the charm debug log -- but it might be cool if this was part of the crashdump

```
kubectl events -A -oyaml
```

[0] https://solutions.qa.canonical.com/testruns/18cd9a82-ba5a-4938-98a5-2be7e8d6e5a5

Revision history for this message
Adam Dyess (addyess) wrote :

The following link helps to track down examples of this bug in various solqa runs
https://solutions.qa.canonical.com/bugs/2064145

Revision history for this message
Adam Dyess (addyess) wrote :

Adding ANOTHER patch to try to catch these failures.

https://github.com/charmed-kubernetes/charm-calico/pull/108

Revision history for this message
Michael Fischer (michaelandrewfischer) wrote :
Download full text (15.4 KiB)

Trying again with 1.30 release on maas bare metal; results in endless configuration loop for all calico instances. See debug output below:

unit-calico-3: 00:47:39 INFO unit.calico/3.juju-log cni:17: Configured Calico IP pool.
unit-calico-3: 00:47:39 ERROR unit.calico/3.juju-log cni:17: env={'JUJU_UNIT_NAME': 'calico/3', 'JUJU_VERSION': '3.5.1', 'JUJU_CHARM_HTTP_PROXY': '', 'APT_LISTCHANGES_FRONTEND': 'none', 'JUJU_CONTEXT_ID': 'calico/3-cni-relation-changed-1657795354183363859', 'JUJU_AGENT_SOCKET_NETWORK': 'unix', 'JUJU_API_ADDRESSES': '192.168.2.56:17070', 'JUJU_CHARM_HTTPS_PROXY': '', 'JUJU_AGENT_SOCKET_ADDRESS': '@/var/lib/juju/agents/unit-calico-3/agent.socket', 'JUJU_MODEL_NAME': 'k8s', 'JUJU_DISPATCH_PATH': 'hooks/cni-relation-changed', 'JUJU_AVAILABILITY_ZONE': 'default', 'JUJU_REMOTE_UNIT': 'kubernetes-control-plane/1', 'JUJU_CHARM_DIR': '/var/lib/juju/agents/unit-calico-3/charm', 'TERM': 'tmux-256color', 'JUJU_RELATION': 'cni', 'PATH': '/var/lib/juju/tools/unit-calico-3:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin', 'JUJU_RELATION_ID': 'cni:17', 'JUJU_METER_STATUS': 'AMBER', 'JUJU_HOOK_NAME': 'cni-relation-changed', 'LANG': 'C.UTF-8', 'CLOUD_API_VERSION': '', 'DEBIAN_FRONTEND': 'noninteractive', 'JUJU_SLA': 'unsupported', 'JUJU_MODEL_UUID': '746a6374-c778-49f8-88fe-f85e86ebceab', 'JUJU_MACHINE_ID': '13', 'JUJU_CHARM_FTP_PROXY': '', 'JUJU_METER_INFO': 'not set', 'PWD': '/var/lib/juju/agents/unit-calico-3/charm', 'JUJU_PRINCIPAL_UNIT': 'kubernetes-control-plane/1', 'JUJU_CHARM_NO_PROXY': '127.0.0.1,localhost,::1', 'PYTHONPATH': 'lib:venv', 'CHARM_DIR': '/var/lib/juju/agents/unit-calico-3/charm', 'JUJU_REMOTE_APP': 'kubernetes-control-plane', 'OPERATOR_DISPATCH': '1', 'ETCD_ENDPOINTS': 'https://192.168.2.2:2379,https://192.168.2.40:2379,https://192.168.2.71:2379', 'ETCD_KEY_FILE': '/opt/calicoctl/etcd-key', 'ETCD_CERT_FILE': '/opt/calicoctl/etcd-cert', 'ETCD_CA_CERT_FILE': '/opt/calicoctl/etcd-ca'}
unit-calico-3: 00:47:39 ERROR unit.calico/3.juju-log cni:17: out=null

unit-calico-3: 00:47:39 ERROR unit.calico/3.juju-log cni:17: err=time="2024-07-17T05:47:39Z" level=info msg="Log level set to debug"
time="2024-07-17T05:47:39Z" level=info msg="Executing config command"
time="2024-07-17T05:47:39Z" level=info msg="Config file: /etc/calico/calicoctl.cfg cannot be read - reading config from environment"
time="2024-07-17T05:47:39Z" level=debug msg="Datastore type isn't set, trying to detect it"
time="2024-07-17T05:47:39Z" level=debug msg="EtcdEndpoints specified, detected etcdv3."
time="2024-07-17T05:47:39Z" level=info msg="Loaded client config: apiconfig.CalicoAPIConfigSpec{DatastoreType:\"etcdv3\", EtcdConfig:apiconfig.EtcdConfig{EtcdEndpoints:\"https://192.168.2.2:2379,https://192.168.2.40:2379,https://192.168.2.71:2379\", EtcdDiscoverySrv:\"\", EtcdUsername:\"\", EtcdPassword:\"\", EtcdKeyFile:\"/opt/calicoctl/etcd-key\", EtcdCertFile:\"/opt/calicoctl/etcd-cert\", EtcdCACertFile:\"/opt/calicoctl/etcd-ca\", EtcdKey:\"\", EtcdCert:\"\", EtcdCACert:\"\", EtcdFIPSModeEnabled:false}, KubeConfig:apiconfig.KubeConfig{Kubeconfig:\"\", K8sAPIEndpoint:\"\", K8sKeyFile:\"\", K8sCertFile:\"\", K8sCAFile:\"\", K8sAPIToken:\"...

Adam Dyess (addyess)
Changed in charm-calico:
milestone: 1.30+ck1 → 1.30+ck2
status: Fix Committed → In Progress
Revision history for this message
Michael Fischer (michaelandrewfischer) wrote (last edit ):
Download full text (5.5 KiB)

This bug, in my case, is caused by kube-proxy failing to start in kubernetes-control-plane and kubernetes-worker nodes. The ipset package is not installed and causes kube-proxy process to crash when kubernetes-control-plane and kubernetes-worker apps are configured with {mode: ipvs, ipvs: {strictARP: true}}

see https://bugs.launchpad.net/charm-calico/+bug/2045651

*****************

kubectl logs calico-node-2glrn -n kube-system -c install-cni:

2024-07-19 14:59:08.896 [INFO][1] cni-installer/<nil> <nil>: Running as a Kubernetes pod
2024-07-19 14:59:08.896 [INFO][1] cni-installer/<nil> <nil>: Installing any TLS assets
2024-07-19 14:59:08.973 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/bandwidth"
2024-07-19 14:59:08.973 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/bandwidth
2024-07-19 14:59:09.158 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico"
2024-07-19 14:59:09.158 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico
2024-07-19 14:59:09.362 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/calico-ipam"
2024-07-19 14:59:09.362 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/calico-ipam
2024-07-19 14:59:09.367 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/flannel"
2024-07-19 14:59:09.367 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/flannel
2024-07-19 14:59:09.372 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/host-local"
2024-07-19 14:59:09.372 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/host-local
2024-07-19 14:59:09.378 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/loopback"
2024-07-19 14:59:09.378 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/loopback
2024-07-19 14:59:09.385 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/portmap"
2024-07-19 14:59:09.385 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/portmap
2024-07-19 14:59:09.391 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/tuning"
2024-07-19 14:59:09.391 [INFO][1] cni-installer/<nil> <nil>: Installed /host/opt/cni/bin/tuning
2024-07-19 14:59:09.391 [INFO][1] cni-installer/<nil> <nil>: Wrote Calico CNI binaries to /host/opt/cni/bin

2024-07-19 14:59:09.494 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.27.3
2024-07-19 14:59:09.494 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
2024-07-19 14:59:09.494 [WARNING][1] cni-installer/<nil> <nil>: Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
2024-07-19 14:59:39.522 [ERROR][1] cni-installer/<nil> <nil>: Unable to create token for CNI kubeconfig error=Post "https://10.152.183.1:443/api/v1/namespaces/kube-system/serviceaccounts/calico-cni-plugin/token": dial tcp 10.152.183.1:443: i/o timeout
2024-07-1...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.