Intermittently calico is not set properly when upgrading
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Calico Charm |
Fix Released
|
Medium
|
George Kraft | ||
Canal Charm |
Fix Released
|
Medium
|
George Kraft | ||
Kubernetes Control Plane Charm |
Invalid
|
Undecided
|
Unassigned |
Bug Description
Hello
When upgrading calico, intermittently it doesn't set calico container or network
Sometimes, calico container is not created. Sometimes, network doesn't seem to be set.
I assume that this is race condition issue. we may need to put a flag. but I'm not sure where I need to put it.
reproducer is like below
1. bundle[1] and overlay for Calico[2].
2. deploy
$juju deploy ./dev-bundle.yaml --overlay calico-overlay.yaml
3. deploy a deployment.
$kubectl create test-nginx --image=nginx
4. upgrade calico charm
$juju upgrade-charm calico
5. deploy a new pod or increase the replicas.
$kubectl edit deployments test-nginx.
- pods are not created and you would see a lot of errors related to Calico.
bundles
https:/
https:/
I created script to run this.
https:/
I only can reproduce the first case ( calico container is not there )
but my colleague found out that below facts in another case,
datastoreReady is still false in this case,
ETCDCTL_API=3 etcdctl --endpoints https:/
{"kind"
Could you please advice me about this?
Thanks.
Changed in charm-calico: | |
milestone: | none → 1.16+ck1 |
Changed in charm-canal: | |
milestone: | none → 1.16+ck1 |
Changed in charm-calico: | |
importance: | Undecided → Medium |
Changed in charm-canal: | |
importance: | Undecided → Medium |
Changed in charm-calico: | |
status: | In Progress → Fix Committed |
Changed in charm-canal: | |
status: | In Progress → Fix Committed |
Changed in charm-calico: | |
status: | Fix Committed → Fix Released |
Changed in charm-canal: | |
status: | Fix Committed → Fix Released |
This appears to be a bug in both the calico and canal charms.
FWICT, the problem arises when the calico/canal charm does not invoke the canal_upgrade. complete( ). This occurs when the leader unit is on the same node that a kubernetes-master is installed on. The canal_upgrade. complete( ) is called as part of the upgrade_ v3_complete( ) method, which requires that the network policy controller is deployed. The problem is that the *only* a worker node will actually deploy the network policy controller, so if the leader is on a master the calico.npc.deployed flag will never be set and the upgrade will not be marked as completed.
Furthermore, the network policy controller is deployed by *all* of the worker nodes, which on the surface doesn't feel necessary as nothing in the rendered deployment yaml is specific to the local node. I believe the calico.npc.deployed should be moved to leader storage to ensure it is only deployed a single time.
The immediate fix should be relatively straightforward, remove the 'cni.is-worker' check from the deploy_ network_ policy_ controller method in reactive/calico.py.
I believe an improved version would be to use leader storage only (and not local storage) for the calico.npc.deployed configuration. The leadership. is_leader flag should then be added to the deploy_ network_ policy_ controller so that only one node deploys the policy controller.