IPv6: calico binds to the floating IP after pod restart, causing failures on swact
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
StarlingX |
Fix Released
|
High
|
Joseph Richard |
Bug Description
Brief Description
-----------------
It was observed that occasionally after a swact, the calico BGP peering is failing.
This is the result of Calico choosing the floating IP on the cluster-host. The unit IP should be used instead. This happens if the calico-node pod restarts on the same host that currently has the floating IP.
If a system is in this condition, a swact results in the floating IP moving, so Calico loses communication with the BGP peers.
Severity
--------
Major - calico issues after swact
Steps to Reproduce
------------------
- Bring up system
- Check address calico is using >> should be the unit IP
- Restart the calico pod on the host w/ the floating IP
- Check the address calico is using >> will now be the floating IP
- Perform a swact
- Verify that calico loses peering with the BGP peers
Expected Behavior
------------------
calico should always use the unit IP address
Actual Behavior
----------------
calico uses the floating IP address if the calico pod is restarted
Reproducibility
---------------
Reproducible given the steps above
System Configuration
-------
Any 2-node system w/ IPv6 configured
Branch/Pull Time/Commit
-------
Seen on stx master 2020-06-27, but is a day 1 issue
Last Pass
---------
Unknown
Timestamp/Logs
--------------
N/A
Test Activity
-------------
Regression Testing
Workaround
----------
Changed in starlingx: | |
assignee: | nobody → Joseph Richard (josephrichard) |
summary: |
- calico binds to the floating IP causing failures on swact + calico binds to the floating IP after pod restart, causing failures on + swact |
description: | updated |
tags: | added: stx.containers stx.networking |
summary: |
- calico binds to the floating IP after pod restart, causing failures on - swact + IPv6: calico binds to the floating IP after pod restart, causing + failures on swact |
description: | updated |
Changed in starlingx: | |
importance: | Undecided → High |
status: | New → Triaged |
tags: | added: stx.4.0 |
Calico uses can-reach option for node IP auto detection of the IP address on the cluster-host network. To perform this operation it opens a UDP socket to the can-reach destination and checks which source IP address is selected for the local address.
Linux uses the rules defined in RFC6724[0] for source address selection. The first rule stipulates that the same address as the destination be chosen if the address is local.
In the scenario described and the current Calico configuration to use the floating IP as auto detection address, the selected node IP address is the floating IP address.
To avoid selecting the floating IP address due to Rule 1, the Calico can-reach auto detection address should be configured to use the controller-0 cluster host address. Since the floating IP is marked as deprecated, the controller unit address of the cluster-host network will be chosen. This will work for all hosts, even if controller-0 is not currently available, since the auto detection address is used only to identify the network and the local IP address.
[0] https:/ /tools. ietf.org/ html/rfc6724