kube-ovn stuck: Waiting to retry configuring Kube-OVN

Bug #2007162 reported by Bas de Bruijne
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Kube OVN Charm
Triaged
Medium
Unassigned
Kubernetes Control Plane Charm
Fix Released
Medium
George Kraft
Kubernetes Worker Charm
Fix Released
Medium
George Kraft

Bug Description

In testrun https://solutions.qa.canonical.com/v2/testruns/412c05fd-2a4d-4b9f-86f6-1497ddabbd1c, which is ck8s on baremetal maas focal with kube-ovn, the testrun fails because the kube-ovn unit is stuck waiting:

```
kubernetes-control-plane/1* waiting idle 4/kvm/0 10.244.8.137 6443/tcp Waiting for 3 kube-system pods to start
  canonical-livepatch/10 active idle 10.244.8.137 Running kernel 5.4.0-139.156-generic, patchState: nothing-to-apply (source version/commit f1e83ae)
  containerd/7 active idle 10.244.8.137 Container runtime available
  kube-ovn/7 active idle 10.244.8.137
  ntp/10 active idle 10.244.8.137 123/udp chrony: Ready
kubernetes-control-plane/2 waiting idle 5/kvm/0 10.244.8.139 6443/tcp Waiting for 3 kube-system pods to start
  canonical-livepatch/11 active idle 10.244.8.139 Running kernel 5.4.0-139.156-generic, patchState: nothing-to-apply (source version/commit f1e83ae)
  containerd/8 active idle 10.244.8.139 Container runtime available
  kube-ovn/8 waiting idle 10.244.8.139 Waiting to retry configuring Kube-OVN
  ntp/11 active idle 10.244.8.139 123/udp chrony: Ready
```

In the logs we see:
```
2023-02-13 10:02:08 ERROR unit.kube-ovn/8.juju-log server.go:316 Traceback (most recent call last):
  File "./src/charm.py", line 342, in configure_kube_ovn
    self.wait_for_kube_ovn_cni()
  File "./src/charm.py", line 655, in wait_for_kube_ovn_cni
    self.wait_for_rollout("daemonset/kube-ovn-cni")
  File "./src/charm.py", line 666, in wait_for_rollout
    self.kubectl(
  File "./src/charm.py", line 464, in kubectl
    return check_output(cmd)
  File "/usr/lib/python3.8/subprocess.py", line 415, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['kubectl', '--kubeconfig', '/root/.kube/config', 'rollout', 'status', '-n', 'kube-system', 'daemonset/kube-ovn-cni', '--timeout', '1s']' returned non-zero exit status 1.
```

This doesn't tell us much about what actually is going wrong, unfortunately.

Crashdumps and configs are found here:
https://oil-jenkins.canonical.com/artifacts/412c05fd-2a4d-4b9f-86f6-1497ddabbd1c/index.html

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :
tags: added: cdo-qa foundations-engine
Revision history for this message
Jeffrey Chang (modern911) wrote :

application channels in SKU (solutionsqa/fkb/sku/master-kubernetes-focal-baremetal-ovn)
        kubernetes-control-plane: latest/edge
        kubernetes-worker: latest/edge
        kube-ovn: latest/edge

Revision history for this message
George Kraft (cynerva) wrote :

The kube-ovn-controller pod logs this error repeatedly:

E0213 10:02:01.196909 6 subnet.go:186] error syncing 'ovn-default': gateway 192.168.0.1 is not in cidr 192.168.252.0/22, requeuing

Which prevents it from annotating nodes and such with the information that kube-ovn-cni needs to progress. This leads the kube-ovn-cni containers to CrashLoopBackOff, which leaves the kube-ovn charm stuck waiting forever.

I see in bundle.yaml that the kube-ovn charm is configured with `default-cidr: 192.168.252.0/22`. For this to work, it also needs to be configured with `default-gateway: 192.168.252.1`. I think if you add that to the bundle, it should work.

Revision history for this message
George Kraft (cynerva) wrote :

There is a second issue with this deployment, which is that it's running Kubernetes 1.23 on latest/edge charms that do not support Kubernetes 1.23. I see that the k8s snap channel config is unspecified in the bundle or overlays, so it's using the default value from the charms, which is out-of-date[1][2].

We can fix that default on our end easy enough. I'll open a couple PRs.

[1]: https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/blob/45e994db6052740cc43de8f80c87cf16c335fa09/config.yaml#L108
[2]: https://github.com/charmed-kubernetes/charm-kubernetes-worker/blob/8a3751d57d8c3b715da5871913cde5e1e9c54433/config.yaml#L14

Changed in charm-kubernetes-master:
importance: Undecided → Medium
Changed in charm-kubernetes-worker:
importance: Undecided → Medium
Changed in charm-kube-ovn:
status: New → Invalid
Changed in charm-kubernetes-master:
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubernetes-worker:
assignee: nobody → George Kraft (cynerva)
Changed in charm-kubernetes-master:
status: New → In Progress
Changed in charm-kubernetes-worker:
status: New → In Progress
Revision history for this message
George Kraft (cynerva) wrote :

PRs to fix the charm default k8s versions:
https://github.com/charmed-kubernetes/charm-kubernetes-control-plane/pull/271
https://github.com/charmed-kubernetes/charm-kubernetes-worker/pull/136
https://github.com/charmed-kubernetes/charm-kubernetes-e2e/pull/27
https://github.com/charmed-kubernetes/bundle/pull/873

These should fix the incorrect k8s version in the deployment, but you will still need to set the kube-ovn default-gateway config to get it working.

Revision history for this message
George Kraft (cynerva) wrote :

It occurs to me that the kube-ovn charm should be doing more to inform the user about the mismatched config. This is a condition that the charm can easily check for, and it should enter a Blocked status with a clear message.

I was a little quick to mark this as invalid for kube-ovn. Sorry about that. :)

Changed in charm-kube-ovn:
importance: Undecided → Medium
status: Invalid → Triaged
Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

I'm surprised that this is a configuration issue since we have a passing rate of about 50% for this config, e.g. https://solutions.qa.canonical.com/v2/testruns/2a3a4099-b601-4d16-9157-9ec766869a37 is successful.

Revision history for this message
George Kraft (cynerva) wrote :

Interesting. It looks like the kube-ovn charm config differs between failing and passing runs. In failing runs the bundle.yaml has:

kube-ovn:
  bindings:
    ? ''
    : oam-space
  channel: latest/edge
  charm: kube-ovn
  options:
    default-cidr: 192.168.252.0/22

In passing runs the bundle.yaml has:

kube-ovn:
  bindings:
    ? ''
    : oam-space
  channel: latest/edge
  charm: kube-ovn

This is true for all of the test runs that I checked:

https://solutions.qa.canonical.com/v2/testruns/412c05fd-2a4d-4b9f-86f6-1497ddabbd1c (failed)
https://solutions.qa.canonical.com/v2/testruns/55e4a7b8-fd06-436c-806e-35277c83969c (success)
https://solutions.qa.canonical.com/v2/testruns/2a3a4099-b601-4d16-9157-9ec766869a37 (success)
https://solutions.qa.canonical.com/v2/testruns/53c2386d-807b-4317-a744-a7c51a07d6fc (failed)

These runs are all under the same SKU but somehow the generated bundle.yaml has different options for kube-ovn. I'm not really sure how to dig further into that side of the issue, but it does still make sense to me as a configuration issue at least from the charm's perspective.

Revision history for this message
Bas de Bruijne (basdbruijne) wrote :

Thanks for digging in on that, I found the problem. We have some hardware-specific config that we merge into the SKUs and we were adding the default-cidr in lab0 for some reason.

Revision history for this message
Kevin W Monroe (kwmonroe) wrote :

PRs from comment #5 are merged.

Changed in charm-kubernetes-master:
status: In Progress → Fix Committed
Changed in charm-kubernetes-worker:
status: In Progress → Fix Committed
Changed in charm-kubernetes-master:
milestone: none → 1.27
Changed in charm-kubernetes-worker:
milestone: none → 1.27
Changed in charm-kubernetes-master:
status: Fix Committed → Fix Released
Changed in charm-kubernetes-worker:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.