Comment 2 for bug 2008267

Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the report. I would say this affects both etcd and calico.

For etcd: Units are sending cluster connection details before etcd is ready[1]. It should delay sending cluster connection details until after etcd has successfully registered (i.e. wait for the "etcd.registered" flag).

For calico: Units are letting a hung calicoctl process block the machine lock indefinitely. It should wrap calicoctl calls[2] with a timeout so that the cluster can eventually unstick itself in case of similar issues.

As a workaround, I suspect if you kill hung calicoctl processes repeatedly, Juju will eventually get through its backlog of hooks and allow the etcd units to progress.

[1]: https://github.com/charmed-kubernetes/layer-etcd/blob/ae98be0046953ced628f682eee266d0e875a62b0/reactive/etcd.py#L283-L287
[2]: https://github.com/charmed-kubernetes/layer-calico/blob/2287a08ea5c7940bbe9b07be179e1da15b51cba1/reactive/calico.py#L615-L624