2023-02-23 15:02:19 |
Nikolay Vinogradov |
bug |
|
|
added bug |
2023-02-23 15:03:42 |
Nikolay Vinogradov |
attachment added |
|
model-stuck-juju-status.txt https://bugs.launchpad.net/charm-calico/+bug/2008267/+attachment/5649578/+files/model-stuck-juju-status.txt |
|
2023-02-23 15:04:35 |
Nikolay Vinogradov |
description |
Hi team.
We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output).
First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines;
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266
- Calico charm may take machine-wide juju lock while calling calicoctl
We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample.
I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. |
Hi team.
We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output).
First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico);
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266
- Calico charm may take machine-wide juju lock while calling calicoctl
We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample.
I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. |
|
2023-02-23 15:08:00 |
Nikolay Vinogradov |
description |
Hi team.
We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck. This is a sample of deployment that stuck for more than 24hrs (see the attached juju status output).
First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico);
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266
- Calico charm may take machine-wide juju lock while calling calicoctl
We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample.
I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. |
Hi team.
We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample.
First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico);
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266
- Calico charm may take machine-wide juju lock while calling calicoctl
We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample.
I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. |
|
2023-02-23 15:09:46 |
Nikolay Vinogradov |
description |
Hi team.
We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample.
First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico);
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266
- Calico charm may take machine-wide juju lock while calling calicoctl
We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample.
I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. |
Hi team.
We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample.
First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico);
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266
- Calico charm may take machine-wide juju lock while calling calicoctl
We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample: calico charm calls calicoctl to save some calico data, such as pool configuration into etcd cluster before the cluster has been initialized, which causes calicoctl to hang. As the charm calls calicoctl taking juju machine lock, this causes the whole machine to freeze waiting for calicoctl to terminate, and that never happens because of the calicoctl issue. Because calico runs on all the K8s nodes, it seems like the whole model gets stuck.
I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project. |
|
2023-02-23 17:55:38 |
George Kraft |
charm-calico: importance |
Undecided |
High |
|
2023-02-23 17:55:43 |
George Kraft |
bug task added |
|
charm-etcd |
|
2023-02-23 17:55:47 |
George Kraft |
charm-etcd: importance |
Undecided |
High |
|
2023-02-23 17:55:50 |
George Kraft |
charm-calico: status |
New |
Triaged |
|
2023-02-23 17:55:51 |
George Kraft |
charm-etcd: status |
New |
Triaged |
|
2023-02-24 10:13:54 |
Adam Broadbent |
bug |
|
|
added subscriber Adam Broadbent |
2023-02-28 15:44:21 |
Nikolay Vinogradov |
bug |
|
|
added subscriber Canonical Field High |
2023-03-02 17:48:46 |
Adam Dyess |
charm-calico: status |
Triaged |
Fix Committed |
|
2023-03-07 02:27:46 |
Kevin W Monroe |
charm-etcd: status |
Triaged |
Fix Committed |
|
2023-03-07 02:27:56 |
Kevin W Monroe |
charm-calico: assignee |
|
Adam Dyess (addyess) |
|
2023-03-07 02:27:58 |
Kevin W Monroe |
charm-calico: milestone |
|
1.26+ck3 |
|
2023-03-07 02:28:05 |
Kevin W Monroe |
charm-etcd: assignee |
|
George Kraft (cynerva) |
|
2023-03-07 02:28:08 |
Kevin W Monroe |
charm-etcd: milestone |
|
1.26+ck3 |
|
2023-03-15 21:54:06 |
Adam Dyess |
tags |
|
backport-needed |
|
2023-03-16 14:05:48 |
Adam Dyess |
tags |
backport-needed |
|
|
2023-03-20 20:31:40 |
Adam Dyess |
charm-calico: status |
Fix Committed |
Fix Released |
|
2023-03-20 20:32:08 |
Adam Dyess |
charm-etcd: status |
Fix Committed |
Fix Released |
|