Intermittently calico is not set properly when upgrading

Bug #1844605 reported by Seyeong Kim
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Calico Charm
Fix Released
Medium
George Kraft
Canal Charm
Fix Released
Medium
George Kraft
Kubernetes Control Plane Charm
Invalid
Undecided
Unassigned

Bug Description

Hello

When upgrading calico, intermittently it doesn't set calico container or network

Sometimes, calico container is not created. Sometimes, network doesn't seem to be set.

I assume that this is race condition issue. we may need to put a flag. but I'm not sure where I need to put it.

reproducer is like below

1. bundle[1] and overlay for Calico[2].
2. deploy
$juju deploy ./dev-bundle.yaml --overlay calico-overlay.yaml
3. deploy a deployment.
$kubectl create test-nginx --image=nginx
4. upgrade calico charm
$juju upgrade-charm calico
5. deploy a new pod or increase the replicas.
$kubectl edit deployments test-nginx.
- pods are not created and you would see a lot of errors related to Calico.

bundles

https://pastebin.canonical.com/p/PkPtR8m9t8/
https://pastebin.canonical.com/p/t62qbKx39D/

I created script to run this.

https://pastebin.canonical.com/p/Rf6cTy3gVP/

I only can reproduce the first case ( calico container is not there )

but my colleague found out that below facts in another case,

datastoreReady is still false in this case,

ETCDCTL_API=3 etcdctl --endpoints https://172.30.1.174:2379 --cacert /opt/calicoctl/etcd-ca --cert /opt/calicoctl/etcd-cert --key /opt/calicoctl/etcd-key get /calico/resources/v3/projectcalico.org/clusterinformations/default/calico/resources/v3/projectcalico.org/clusterinformations/default
{"kind":"ClusterInformation","apiVersion":"projectcalico.org/v3","metadata":{"name":"default","uid":"e41d8df0-cc87-11e9-9fae-c03fd5623e36","creationTimestamp":"2019-09-01T07:12:43Z"},"spec":{"clusterGUID":"132df401ab59436d93a3fcb14dab4632","clusterType":"k8s","calicoVersion":"v3.6.1","datastoreReady":false}}

Could you please advice me about this?

Thanks.

Tags: sts
Revision history for this message
Billy Olsen (billy-olsen) wrote :

This appears to be a bug in both the calico and canal charms.

FWICT, the problem arises when the calico/canal charm does not invoke the canal_upgrade.complete(). This occurs when the leader unit is on the same node that a kubernetes-master is installed on. The canal_upgrade.complete() is called as part of the upgrade_v3_complete() method, which requires that the network policy controller is deployed. The problem is that the *only* a worker node will actually deploy the network policy controller, so if the leader is on a master the calico.npc.deployed flag will never be set and the upgrade will not be marked as completed.

Furthermore, the network policy controller is deployed by *all* of the worker nodes, which on the surface doesn't feel necessary as nothing in the rendered deployment yaml is specific to the local node. I believe the calico.npc.deployed should be moved to leader storage to ensure it is only deployed a single time.

The immediate fix should be relatively straightforward, remove the 'cni.is-worker' check from the deploy_network_policy_controller method in reactive/calico.py.

I believe an improved version would be to use leader storage only (and not local storage) for the calico.npc.deployed configuration. The leadership.is_leader flag should then be added to the deploy_network_policy_controller so that only one node deploys the policy controller.

Changed in charm-kubernetes-master:
status: New → Invalid
Changed in charm-canal:
status: New → Confirmed
Changed in charm-calico:
status: New → Confirmed
Revision history for this message
Nick Niehoff (nniehoff) wrote :

The status of the charm is also incorrectly reported as active/idle with the standard message of Flannel subnet x.x.x.x/z when this error state occurs making hard to identify the problem.

Revision history for this message
George Kraft (cynerva) wrote :

I agree completely with Billy's analysis and proposed solution. Thanks for that. I'm working on this now.

Changed in charm-calico:
assignee: nobody → George Kraft (cynerva)
Changed in charm-canal:
assignee: nobody → George Kraft (cynerva)
Changed in charm-calico:
status: Confirmed → In Progress
Changed in charm-canal:
status: Confirmed → In Progress
Revision history for this message
George Kraft (cynerva) wrote :
Changed in charm-calico:
milestone: none → 1.16+ck1
Changed in charm-canal:
milestone: none → 1.16+ck1
Changed in charm-calico:
importance: Undecided → Medium
Changed in charm-canal:
importance: Undecided → Medium
George Kraft (cynerva)
Changed in charm-calico:
status: In Progress → Fix Committed
Changed in charm-canal:
status: In Progress → Fix Committed
Revision history for this message
Kevin W Monroe (kwmonroe) wrote :
Changed in charm-calico:
status: Fix Committed → Fix Released
Changed in charm-canal:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.