Sometimes CK model deployment gets stuck with etcd and calico colocated on the same machine

Bug #2008267 reported by Nikolay Vinogradov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Calico Charm
Fix Released
High
Adam Dyess
Etcd Charm
Fix Released
High
George Kraft

Bug Description

Hi team.

We're facing intermittent issue with certain Charmed Kubernetes deployments getting stuck: see the attached juju status output sample.

First of all, I'll give a few observations:
- In this specific case etcd and calico charms are co-located on the same machines (if etcd units were placed in a lxd containers there would be no calico);
- Sometimes calico charm tries to access etcd cluster before it was actually initialized (race condition?);
- calicoctl doesn't have a timeout in case something goes wrong: https://github.com/projectcalico/calico/issues/5266
- Calico charm may take machine-wide juju lock while calling calicoctl

We suppose that if all those factors come together there is a chance for the deployment to become stuck like in the aforementioned sample: calico charm calls calicoctl to save some calico data, such as pool configuration into etcd cluster before the cluster has been initialized, which causes calicoctl to hang. As the charm calls calicoctl taking juju machine lock, this causes the whole machine to freeze waiting for calicoctl to terminate, and that never happens because of the calicoctl issue. Because calico runs on all the K8s nodes, it seems like the whole model gets stuck.

I'm not sure if this is a calico or etcd charm bug, filing on calico-charm project initially. Please feel free to reassign it to the proper project.

Revision history for this message
Nikolay Vinogradov (nikolay.vinogradov) wrote :
description: updated
description: updated
description: updated
Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the report. I would say this affects both etcd and calico.

For etcd: Units are sending cluster connection details before etcd is ready[1]. It should delay sending cluster connection details until after etcd has successfully registered (i.e. wait for the "etcd.registered" flag).

For calico: Units are letting a hung calicoctl process block the machine lock indefinitely. It should wrap calicoctl calls[2] with a timeout so that the cluster can eventually unstick itself in case of similar issues.

As a workaround, I suspect if you kill hung calicoctl processes repeatedly, Juju will eventually get through its backlog of hooks and allow the etcd units to progress.

[1]: https://github.com/charmed-kubernetes/layer-etcd/blob/ae98be0046953ced628f682eee266d0e875a62b0/reactive/etcd.py#L283-L287
[2]: https://github.com/charmed-kubernetes/layer-calico/blob/2287a08ea5c7940bbe9b07be179e1da15b51cba1/reactive/calico.py#L615-L624

Changed in charm-calico:
importance: Undecided → High
Changed in charm-etcd:
importance: Undecided → High
Changed in charm-calico:
status: New → Triaged
Changed in charm-etcd:
status: New → Triaged
Revision history for this message
Adam Dyess (addyess) wrote :
Revision history for this message
Adam Dyess (addyess) wrote :
Adam Dyess (addyess)
Changed in charm-calico:
status: Triaged → Fix Committed
Changed in charm-etcd:
status: Triaged → Fix Committed
Changed in charm-calico:
assignee: nobody → Adam Dyess (addyess)
milestone: none → 1.26+ck3
Changed in charm-etcd:
assignee: nobody → George Kraft (cynerva)
milestone: none → 1.26+ck3
Adam Dyess (addyess)
tags: added: backport-needed
Revision history for this message
Adam Dyess (addyess) wrote :

rebased calico with single commit not in release_1.26 (cherry-pick wasn't necessary)
https://github.com/charmed-kubernetes/layer-calico/commit/a164af47a9824e17742d732675d4edeedabfb159

cherry-pick backport into etcd release_1.26
https://github.com/charmed-kubernetes/layer-etcd/commit/42c81a73515c99545972398f0ce57a5fc8ae7117

tags: removed: backport-needed
Adam Dyess (addyess)
Changed in charm-calico:
status: Fix Committed → Fix Released
Changed in charm-etcd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.