IPv6: calico binds to the floating IP after pod restart, causing failures on swact

Bug #1885582 reported by Ghada Khalil
38
This bug affects 4 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
High
Joseph Richard

Bug Description

Brief Description
-----------------
It was observed that occasionally after a swact, the calico BGP peering is failing.

This is the result of Calico choosing the floating IP on the cluster-host. The unit IP should be used instead. This happens if the calico-node pod restarts on the same host that currently has the floating IP.

If a system is in this condition, a swact results in the floating IP moving, so Calico loses communication with the BGP peers.

Severity
--------
Major - calico issues after swact

Steps to Reproduce
------------------
- Bring up system
- Check address calico is using >> should be the unit IP
- Restart the calico pod on the host w/ the floating IP
- Check the address calico is using >> will now be the floating IP
- Perform a swact
- Verify that calico loses peering with the BGP peers

Expected Behavior
------------------
calico should always use the unit IP address

Actual Behavior
----------------
calico uses the floating IP address if the calico pod is restarted

Reproducibility
---------------
Reproducible given the steps above

System Configuration
--------------------
Any 2-node system w/ IPv6 configured

Branch/Pull Time/Commit
-----------------------
Seen on stx master 2020-06-27, but is a day 1 issue

Last Pass
---------
Unknown

Timestamp/Logs
--------------
N/A

Test Activity
-------------
Regression Testing

Workaround
----------

Ghada Khalil (gkhalil)
Changed in starlingx:
assignee: nobody → Joseph Richard (josephrichard)
Ghada Khalil (gkhalil)
summary: - calico binds to the floating IP causing failures on swact
+ calico binds to the floating IP after pod restart, causing failures on
+ swact
Ghada Khalil (gkhalil)
description: updated
Revision history for this message
Matt Peters (mpeters-wrs) wrote : Re: calico binds to the floating IP after pod restart, causing failures on swact

Calico uses can-reach option for node IP auto detection of the IP address on the cluster-host network. To perform this operation it opens a UDP socket to the can-reach destination and checks which source IP address is selected for the local address.

Linux uses the rules defined in RFC6724[0] for source address selection. The first rule stipulates that the same address as the destination be chosen if the address is local.

In the scenario described and the current Calico configuration to use the floating IP as auto detection address, the selected node IP address is the floating IP address.

To avoid selecting the floating IP address due to Rule 1, the Calico can-reach auto detection address should be configured to use the controller-0 cluster host address. Since the floating IP is marked as deprecated, the controller unit address of the cluster-host network will be chosen. This will work for all hosts, even if controller-0 is not currently available, since the auto detection address is used only to identify the network and the local IP address.

[0] https://tools.ietf.org/html/rfc6724

Ghada Khalil (gkhalil)
tags: added: stx.containers stx.networking
Ghada Khalil (gkhalil)
summary: - calico binds to the floating IP after pod restart, causing failures on
- swact
+ IPv6: calico binds to the floating IP after pod restart, causing
+ failures on swact
description: updated
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → High
status: New → Triaged
tags: added: stx.4.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to ansible-playbooks (master)

Fix proposed to branch: master
Review: https://review.opendev.org/738720

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to ansible-playbooks (master)

Reviewed: https://review.opendev.org/738720
Committed: https://git.openstack.org/cgit/starlingx/ansible-playbooks/commit/?id=075d3fcdfcc0e3e3480918fde22d16163df401ab
Submitter: Zuul
Branch: master

commit 075d3fcdfcc0e3e3480918fde22d16163df401ab
Author: Joseph Richard <email address hidden>
Date: Tue Jun 30 15:33:13 2020 -0400

    Use controller-0 ip for calico-node can-reach dest

    Because cluster floating IP host can change, it should not be used for
    calico node address, and doing so has been observed to cause an error
    when calico-node is rebooted on the active controller and then a swact
    is executed, causing BGP peering to be lost.

    This commit switches to using controller-0 cluster host address for
    route selection, in order to ensure a consistent route selection is
    used.

    Closes-Bug: 1885582
    Change-Id: I56c5ddf657eb557b83ce0fd3ce7beb71011d6266
    Signed-off-by: Joseph Richard <email address hidden>

Changed in starlingx:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.