Etcd Charm

Bug #2008267
Activity log

Activity log for bug #2008267

Date	Who	What changed	Old value	New value	Message
2023-02-23 15:02:19	Nikolay Vinogradov	bug			added bug
2023-02-23 15:03:42	Nikolay Vinogradov	attachment added		model-stuck-juju-status.txt https://bugs.launchpad.net/charm-calico/+bug/2008267/+attachment/5649578/+files/model-stuck-juju-status.txt
2023-02-23 15:04:35	Nikolay Vinogradov	description	Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output). First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines; - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.	Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output). First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.
2023-02-23 15:08:00	Nikolay Vinogradov	description	Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output). First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.	Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample. First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.
2023-02-23 15:09:46	Nikolay Vinogradov	description	Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample. First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.	Hi team. We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample. First of all, I'll give a few observations: - In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico); - Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?); - calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266 - Calico charm may take machine-wide juju lock while calling calicoctl We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample: calico charm calls calicoctl to save some calico data, such as pool configuration into etcd cluster before the cluster has been initialized, which causes calicoctl to hang. As the charm calls calicoctl taking juju machine lock, this causes the whole machine to freeze waiting for calicoctl to terminate, and that never happens because of the calicoctl issue. Because calico runs on all the K8s nodes, it seems like the whole model gets stuck. I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.
2023-02-23 17:55:38	George Kraft	charm-calico: importance	Undecided	High
2023-02-23 17:55:43	George Kraft	bug task added		charm-etcd
2023-02-23 17:55:47	George Kraft	charm-etcd: importance	Undecided	High
2023-02-23 17:55:50	George Kraft	charm-calico: status	New	Triaged
2023-02-23 17:55:51	George Kraft	charm-etcd: status	New	Triaged
2023-02-24 10:13:54	Adam Broadbent	bug			added subscriber Adam Broadbent
2023-02-28 15:44:21	Nikolay Vinogradov	bug			added subscriber Canonical Field High
2023-03-02 17:48:46	Adam Dyess	charm-calico: status	Triaged	Fix Committed
2023-03-07 02:27:46	Kevin W Monroe	charm-etcd: status	Triaged	Fix Committed
2023-03-07 02:27:56	Kevin W Monroe	charm-calico: assignee		Adam Dyess (addyess)
2023-03-07 02:27:58	Kevin W Monroe	charm-calico: milestone		1.26+ck3
2023-03-07 02:28:05	Kevin W Monroe	charm-etcd: assignee		George Kraft (cynerva)
2023-03-07 02:28:08	Kevin W Monroe	charm-etcd: milestone		1.26+ck3
2023-03-15 21:54:06	Adam Dyess	tags		backport-needed
2023-03-16 14:05:48	Adam Dyess	tags	backport-needed
2023-03-20 20:31:40	Adam Dyess	charm-calico: status	Fix Committed	Fix Released
2023-03-20 20:32:08	Adam Dyess	charm-etcd: status	Fix Committed	Fix Released