This appears to be a bug in both the calico and canal charms.
FWICT, the problem arises when the calico/canal charm does not invoke the canal_upgrade.complete(). This occurs when the leader unit is on the same node that a kubernetes-master is installed on. The canal_upgrade.complete() is called as part of the upgrade_v3_complete() method, which requires that the network policy controller is deployed. The problem is that the *only* a worker node will actually deploy the network policy controller, so if the leader is on a master the calico.npc.deployed flag will never be set and the upgrade will not be marked as completed.
Furthermore, the network policy controller is deployed by *all* of the worker nodes, which on the surface doesn't feel necessary as nothing in the rendered deployment yaml is specific to the local node. I believe the calico.npc.deployed should be moved to leader storage to ensure it is only deployed a single time.
The immediate fix should be relatively straightforward, remove the 'cni.is-worker' check from the deploy_network_policy_controller method in reactive/calico.py.
I believe an improved version would be to use leader storage only (and not local storage) for the calico.npc.deployed configuration. The leadership.is_leader flag should then be added to the deploy_network_policy_controller so that only one node deploys the policy controller.
This appears to be a bug in both the calico and canal charms.
FWICT, the problem arises when the calico/canal charm does not invoke the canal_upgrade. complete( ). This occurs when the leader unit is on the same node that a kubernetes-master is installed on. The canal_upgrade. complete( ) is called as part of the upgrade_ v3_complete( ) method, which requires that the network policy controller is deployed. The problem is that the *only* a worker node will actually deploy the network policy controller, so if the leader is on a master the calico.npc.deployed flag will never be set and the upgrade will not be marked as completed.
Furthermore, the network policy controller is deployed by *all* of the worker nodes, which on the surface doesn't feel necessary as nothing in the rendered deployment yaml is specific to the local node. I believe the calico.npc.deployed should be moved to leader storage to ensure it is only deployed a single time.
The immediate fix should be relatively straightforward, remove the 'cni.is-worker' check from the deploy_ network_ policy_ controller method in reactive/calico.py.
I believe an improved version would be to use leader storage only (and not local storage) for the calico.npc.deployed configuration. The leadership. is_leader flag should then be added to the deploy_ network_ policy_ controller so that only one node deploys the policy controller.