Activity log for bug #2008267

Date Who What changed Old value New value Message
2023-02-23 15:02:19 Nikolay Vinogradov bug added bug
2023-02-23 15:03:42 Nikolay Vinogradov attachment added model-stuck-juju-status.txt https://bugs.launchpad.net/charm-calico/+bug/2008267/+attachment/5649578/+files/model-stuck-juju-status.txt
2023-02-23 15:04:35 Nikolay Vinogradov description Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output). First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines; - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output). First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.
2023-02-23 15:08:00 Nikolay Vinogradov description Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output). First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample. First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.
2023-02-23 15:09:46 Nikolay Vinogradov description Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample. First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample. First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample: calico charm calls calicoctl to save some calico data, such as pool configuration into etcd cluster before the cluster has been initialized, which causes calicoctl to hang. As the charm calls calicoctl taking juju machine lock, this causes the whole machine to freeze waiting for calicoctl to terminate, and that never happens because of the calicoctl issue. Because calico runs on all the K8s nodes, it seems like the whole model gets stuck. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.
2023-02-23 17:55:38 George Kraft charm-calico: importance Undecided High
2023-02-23 17:55:43 George Kraft bug task added charm-etcd
2023-02-23 17:55:47 George Kraft charm-etcd: importance Undecided High
2023-02-23 17:55:50 George Kraft charm-calico: status New Triaged
2023-02-23 17:55:51 George Kraft charm-etcd: status New Triaged
2023-02-24 10:13:54 Adam Broadbent bug added subscriber Adam Broadbent
2023-02-28 15:44:21 Nikolay Vinogradov bug added subscriber Canonical Field High
2023-03-02 17:48:46 Adam Dyess charm-calico: status Triaged Fix Committed
2023-03-07 02:27:46 Kevin W Monroe charm-etcd: status Triaged Fix Committed
2023-03-07 02:27:56 Kevin W Monroe charm-calico: assignee Adam Dyess (addyess)
2023-03-07 02:27:58 Kevin W Monroe charm-calico: milestone 1.26+ck3
2023-03-07 02:28:05 Kevin W Monroe charm-etcd: assignee George Kraft (cynerva)
2023-03-07 02:28:08 Kevin W Monroe charm-etcd: milestone 1.26+ck3
2023-03-15 21:54:06 Adam Dyess tags backport-needed
2023-03-16 14:05:48 Adam Dyess tags backport-needed
2023-03-20 20:31:40 Adam Dyess charm-calico: status Fix Committed Fix Released
2023-03-20 20:32:08 Adam Dyess charm-etcd: status Fix Committed Fix Released