Calico Charm

Intermittently calico is not set properly when upgrading

Bug #1844605 reported by Seyeong Kim on 2019-09-19

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Calico Charm	Fix Released	Medium	George Kraft	Calico Charm 1.16+ck1
Canal Charm	Fix Released	Medium	George Kraft	Canal Charm 1.16+ck1
Kubernetes Control Plane Charm	Invalid	Undecided	Unassigned

Bug Description

Hello

When upgrading calico, intermittently it doesn't set calico container or network

Sometimes, calico container is not created. Sometimes, network doesn't seem to be set.

I assume that this is race condition issue. we may need to put a flag. but I'm not sure where I need to put it.

reproducer is like below

1. bundle[1] and overlay for Calico[2].
2. deploy
$juju deploy ./dev-bundle.yaml --overlay calico-overlay.yaml
3. deploy a deployment.
$kubectl create test-nginx --image=nginx
4. upgrade calico charm
$juju upgrade-charm calico
5. deploy a new pod or increase the replicas.
$kubectl edit deployments test-nginx.
- pods are not created and you would see a lot of errors related to Calico.

bundles

https://pastebin.canonical.com/p/PkPtR8m9t8/
https://pastebin.canonical.com/p/t62qbKx39D/

I created script to run this.

https://pastebin.canonical.com/p/Rf6cTy3gVP/

I only can reproduce the first case ( calico container is not there )

but my colleague found out that below facts in another case,

datastoreReady is still false in this case,

ETCDCTL_API=3 etcdctl --endpoints https://172.30.1.174:2379 --cacert /opt/calicoctl/etcd-ca --cert /opt/calicoctl/etcd-cert --key /opt/calicoctl/etcd-key get /calico/resources/v3/projectcalico.org/clusterinformations/default/calico/resources/v3/projectcalico.org/clusterinformations/default
{"kind":"ClusterInformation","apiVersion":"projectcalico.org/v3","metadata":{"name":"default","uid":"e41d8df0-cc87-11e9-9fae-c03fd5623e36","creationTimestamp":"2019-09-01T07:12:43Z"},"spec":{"clusterGUID":"132df401ab59436d93a3fcb14dab4632","clusterType":"k8s","calicoVersion":"v3.6.1","datastoreReady":false}}

Could you please advice me about this?

Thanks.

Tags:

Revision history for this message

Billy Olsen (billy-olsen) wrote on 2019-09-20:

This appears to be a bug in both the calico and canal charms.

FWICT, the problem arises when the calico/canal charm does not invoke the canal_upgrade.complete(). This occurs when the leader unit is on the same node that a kubernetes-master is installed on. The canal_upgrade.complete() is called as part of the upgrade_v3_complete() method, which requires that the network policy controller is deployed. The problem is that the *only* a worker node will actually deploy the network policy controller, so if the leader is on a master the calico.npc.deployed flag will never be set and the upgrade will not be marked as completed.

Furthermore, the network policy controller is deployed by *all* of the worker nodes, which on the surface doesn't feel necessary as nothing in the rendered deployment yaml is specific to the local node. I believe the calico.npc.deployed should be moved to leader storage to ensure it is only deployed a single time.

The immediate fix should be relatively straightforward, remove the 'cni.is-worker' check from the deploy_network_policy_controller method in reactive/calico.py.

I believe an improved version would be to use leader storage only (and not local storage) for the calico.npc.deployed configuration. The leadership.is_leader flag should then be added to the deploy_network_policy_controller so that only one node deploys the policy controller.

Changed in charm-kubernetes-master:
status:	New → Invalid
Changed in charm-canal:
status:	New → Confirmed
Changed in charm-calico:
status:	New → Confirmed

Revision history for this message

Nick Niehoff (nniehoff) wrote on 2019-09-20:

The status of the charm is also incorrectly reported as active/idle with the standard message of Flannel subnet x.x.x.x/z when this error state occurs making hard to identify the problem.

Revision history for this message

George Kraft (cynerva) wrote on 2019-09-23:

I agree completely with Billy's analysis and proposed solution. Thanks for that. I'm working on this now.

Changed in charm-calico:
assignee:	nobody → George Kraft (cynerva)
Changed in charm-canal:
assignee:	nobody → George Kraft (cynerva)
Changed in charm-calico:
status:	Confirmed → In Progress
Changed in charm-canal:
status:	Confirmed → In Progress

Revision history for this message

George Kraft (cynerva) wrote on 2019-09-23:

PRs:
https://github.com/charmed-kubernetes/layer-calico/pull/47
https://github.com/charmed-kubernetes/layer-canal/pull/44

Tim Van Steenburgh (tvansteenburgh) on 2019-09-24

Changed in charm-calico:
milestone:	none → 1.16+ck1
Changed in charm-canal:
milestone:	none → 1.16+ck1
Changed in charm-calico:
importance:	Undecided → Medium
Changed in charm-canal:
importance:	Undecided → Medium