Calico tries to build new nodes with already reserved IPs from missing units

Bug #1899007 reported by Diko Parvanov
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Calico Charm
Triaged
Medium
Unassigned

Bug Description

Oct 8 10:17:20 juju-3f93e1-kubernetes-53 charm-env[3920]: 2020-10-08 10:17:20.961 [INFO][10] startup.go 385: Initialize BGP data
Oct 8 10:17:20 juju-3f93e1-kubernetes-53 charm-env[3920]: 2020-10-08 10:17:20.961 [INFO][10] startup.go 479: Using IPv4 address from environment: IP=10.66.0.21
Oct 8 10:17:20 juju-3f93e1-kubernetes-53 charm-env[3920]: 2020-10-08 10:17:20.962 [INFO][10] startup.go 512: IPv4 address 10.66.0.21 discovered on interface ens3
Oct 8 10:17:20 juju-3f93e1-kubernetes-53 charm-env[3920]: 2020-10-08 10:17:20.962 [INFO][10] startup.go 455: Node IPv4 changed, will check for conflicts
Oct 8 10:17:20 juju-3f93e1-kubernetes-53 charm-env[3920]: 2020-10-08 10:17:20.964 [WARNING][10] startup.go 943: Calico node 'juju-3f93e1-kubernetes-41' is already using the IPv4 address 10.66.0.21.
Oct 8 10:17:20 juju-3f93e1-kubernetes-53 charm-env[3920]: 2020-10-08 10:17:20.964 [WARNING][10] startup.go 1122: Terminating
Oct 8 10:17:20 juju-3f93e1-kubernetes-53 charm-env[3920]: Calico node failed to start
Oct 8 10:17:21 juju-3f93e1-kubernetes-53 containerd[30033]: time="2020-10-08T10:17:21.062749970Z" level=info msg="shim disconnected" id=calico-node
Oct 8 10:17:21 juju-3f93e1-kubernetes-53 systemd[1]: calico-node.service: Main process exited, code=exited, status=1/FAILURE
Oct 8 10:17:21 juju-3f93e1-kubernetes-53 systemd[1]: calico-node.service: Failed with result 'exit-code'.

root@juju-3f93e1-kubernetes-53:~# calicoctl get nodes -o wide
NAME ASN IPV4 IPV6
juju-3f93e1-kubernetes-29 (64512) 10.66.0.5/32
juju-3f93e1-kubernetes-30 (64512) 10.66.0.10/32
juju-3f93e1-kubernetes-31 (64512) 10.66.0.11/32
juju-3f93e1-kubernetes-32 (64512) 10.66.0.16/32
juju-3f93e1-kubernetes-35 (64512) 10.66.0.15/32
juju-3f93e1-kubernetes-39 (64512) 10.66.0.18/32
juju-3f93e1-kubernetes-4 (64512) 10.66.0.14/32
juju-3f93e1-kubernetes-40 (64512) 10.66.0.19/32
juju-3f93e1-kubernetes-41 (64512) 10.66.0.21/32
juju-3f93e1-kubernetes-42 (64512) 10.66.0.22/32
juju-3f93e1-kubernetes-46 (64512) 10.66.0.23/32
juju-3f93e1-kubernetes-47 (64512) 10.66.0.27/32
juju-3f93e1-kubernetes-48 (64512) 10.66.0.28/32
juju-3f93e1-kubernetes-6 (64512) 10.66.0.13/32

however machine juju-3f93e1-kubernetes-41 no longer exists in the model (removed for some reason).

Using charm revision 689.

Revision history for this message
George Kraft (cynerva) wrote :

Thanks for the report. Looks like the calico charm needs to implement a stop hook to clean up after itself.

To work around this, you should be able to delete the old node manually. I -think- this command should work:

juju run --unit calico/leader -- calicoctl delete node juju-3f93e1-kubernetes-41

Changed in charm-calico:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Diko Parvanov (dparv) wrote :

Yes, cleanup was done properly with calicoctl delete. Also another thing - if you force remove the machine or juju remove-unit --force --no-wait the stop hook won't be executed and will have stale entries in the db. Maybe before starting up calico charm can check if this node with IP exists, if it does - delete it and re-try - meaning the unit got an IP that was already released.

Revision history for this message
George Kraft (cynerva) wrote :

Good point. Cleaning up conflicting entries during startup sounds good to me.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.