OVN: HA chassis group priority is different than gateway chassis priority

Bug #1995078 reported by Michal Nasiadka
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Networking ML2 Generic Switch
Triaged
High
Unassigned
neutron
In Progress
High
Rodolfo Alonso

Bug Description

OpenStack release affected - Wallaby, Xena and Yoga for sure
OVN version: 21.12 (from CentOS NFV SIG repos)
Host OS: CentOS Stream 8

Neutron creates External ports for bare metal instances and uses ha_chassis_group.
Neutron normally defines a different priority for Routers LRP gateway chassis and ha_chassis_group.

I have a router with two VLANs attached - external (used for internet connectivity - SNAT or DNAT/Floating IP) and internal VLAN network hosting bare metal servers (and some Geneve networks for VMs).

If an External port’s HA chassis group active chassis is different than gateway chassis (external vlan network) active chassis - those bare metal servers have intermittent network connectivity for any traffic going through that router.

In a thread on ovs-discuss ML - Numan Siddique wrote that "it is recommended that the
same controller which is actively handling the gateway traffic also
handles the external ports"

More information in this thread - https://mail.openvswitch.org/pipermail/ovs-discuss/2022-October/052067.html

Bugzilla reference:
* (OSP17): https://bugzilla.redhat.com/show_bug.cgi?id=1826364
* (OSP17): https://bugzilla.redhat.com/show_bug.cgi?id=2259161

Tags: ovn
Revision history for this message
Elvira García Ruiz (elviragr) wrote :

I don't have a deployment with baremetal servers but this looks like a legit bug.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
tags: added: ovn
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Michal:

It is quite difficult for me to create an environment with baremetal ports. Can you check what Numan is suggesting in the mail?

"That could be the issue. You can perhaps arp for the router ip from
your bare metal machine and see if you get 2 arp replies - one from
the controller which binds the external port and one from the gateway
chassis controller."

If I'm not wrong, the HA chassis group will assign the highest priority chassis in "HA_Chassis" table and will detect failovers. Other ports (not external ones) should use the same chassis. In your case you are using VLAN that implies we are explicitly sending this traffic to a centralized router port, that is in the chassis hosting the distributed GW port [1]. I would need to check this issue with Lucas.

Regards.

[1]https://github.com/openvswitch/ovs/commit/85706c34d53d4810f54bec1de662392a3c06a996

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello again:

This issue you are hitting here is similar to [1]. For distributed VLAN traffic, what you need to do is to configure the GW chassis to define this [2]. "external_ids:ovn-chassis-mac-mappings" is a list of key-pairs. The key is the physnet, the value is the MAC address. The OVN controller will replace the local LRP with the defined MAC address if a packet is for a distributed port. That will solve the issue you have in your deployment using VLAN networks.

Please let me know if that helped you.

Regards.

[1]https://bugzilla.redhat.com/show_bug.cgi?id=1766930
[2]https://github.com/ovn-org/ovn/blob/main/controller/ovn-controller.8.xml#L239

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (master)
Changed in kolla-ansible:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (master)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/864510
Committed: https://opendev.org/openstack/kolla-ansible/commit/8bf8656dbad3def707eca2d8ddd2c9bfed389b86
Submitter: "Zuul (22348)"
Branch: master

commit 8bf8656dbad3def707eca2d8ddd2c9bfed389b86
Author: Bartosz Bezak <email address hidden>
Date: Tue Nov 15 11:08:15 2022 +0100

    Generate ovn-chassis-mac-mappings on ovn-controller group

    Previously ovn-chassis-mac-mappings [1] has been added only to
    ovn-controller-compute group. However external ports are being
    scheduled on network nodes, therefore we need also do that there.

    Closes-Bug: 1995078

    [1] https://github.com/ovn-org/ovn/blob/v22.09.0/controller/ovn-controller.8.xml#L239

    Change-Id: Ie62e9220bad56262cad602ca1480e6ca65827819

Changed in kolla-ansible:
status: In Progress → Fix Released
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Michal:

Can we consider this bug in Neutron as not valid? So far, I see the problem you had was the definition of the "ovn-chassis-mac-mappings" in the wrong group [1]; this is the parameter commented in c#4 that should be added to the GW chassis. Did that solved the issue?

Regards.

[1]https://review.opendev.org/c/openstack/kolla-ansible/+/864510

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

Hi Rodolfo,

At the moment it seems that it has fixed the issue. Basically we added that in the past not thinking about external ports.

I marked it as invalid in Neutron, if we'll see any issues - we'll reopen in future.
Thanks for your help!

Changed in neutron:
status: Confirmed → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/yoga)

Fix proposed to branch: stable/yoga
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/864482

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/864483

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla-ansible (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/kolla-ansible/+/864484

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Michal:

Nice to read that!

Regards.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/yoga)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/864482
Committed: https://opendev.org/openstack/kolla-ansible/commit/86e2f2df428cc8e8157942b8776600fa9674ff99
Submitter: "Zuul (22348)"
Branch: stable/yoga

commit 86e2f2df428cc8e8157942b8776600fa9674ff99
Author: Bartosz Bezak <email address hidden>
Date: Tue Nov 15 11:08:15 2022 +0100

    Generate ovn-chassis-mac-mappings on ovn-controller group

    Previously ovn-chassis-mac-mappings [1] has been added only to
    ovn-controller-compute group. However external ports are being
    scheduled on network nodes, therefore we need also do that there.

    Closes-Bug: 1995078

    [1] https://github.com/ovn-org/ovn/blob/v22.09.0/controller/ovn-controller.8.xml#L239

    Change-Id: Ie62e9220bad56262cad602ca1480e6ca65827819
    (cherry picked from commit 8bf8656dbad3def707eca2d8ddd2c9bfed389b86)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/864483
Committed: https://opendev.org/openstack/kolla-ansible/commit/55fb10b7820fa6c5de45774ca793427bc27a3e38
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 55fb10b7820fa6c5de45774ca793427bc27a3e38
Author: Bartosz Bezak <email address hidden>
Date: Tue Nov 15 11:08:15 2022 +0100

    Generate ovn-chassis-mac-mappings on ovn-controller group

    Previously ovn-chassis-mac-mappings [1] has been added only to
    ovn-controller-compute group. However external ports are being
    scheduled on network nodes, therefore we need also do that there.

    Closes-Bug: 1995078

    [1] https://github.com/ovn-org/ovn/blob/v22.09.0/controller/ovn-controller.8.xml#L239

    Change-Id: Ie62e9220bad56262cad602ca1480e6ca65827819
    (cherry picked from commit 8bf8656dbad3def707eca2d8ddd2c9bfed389b86)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to kolla-ansible (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/kolla-ansible/+/864484
Committed: https://opendev.org/openstack/kolla-ansible/commit/40a4240e88573b43e3b7a7b98a9949f8029a1d18
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 40a4240e88573b43e3b7a7b98a9949f8029a1d18
Author: Bartosz Bezak <email address hidden>
Date: Tue Nov 15 11:08:15 2022 +0100

    Generate ovn-chassis-mac-mappings on ovn-controller group

    Previously ovn-chassis-mac-mappings [1] has been added only to
    ovn-controller-compute group. However external ports are being
    scheduled on network nodes, therefore we need also do that there.

    Closes-Bug: 1995078

    [1] https://github.com/ovn-org/ovn/blob/v22.09.0/controller/ovn-controller.8.xml#L239

    Change-Id: Ie62e9220bad56262cad602ca1480e6ca65827819
    (cherry picked from commit 8bf8656dbad3def707eca2d8ddd2c9bfed389b86)

Revision history for this message
Michal Nasiadka (mnasiadka) wrote (last edit ):

After doing all those changes in kolla-ansible - we found that data traffic bandwidth for External VLAN ports is kilobytes per second (around 30-100 kbytes/sec) compared to 200 megabytes/sec before.
Probably that's related to MAC flooding issues in OVN.

If we remove ovn-chassis-mac-mappings on network nodes - and ha_chassis_group active chassis and gateway_chassis active chassis are the same - everything works fine (except HA).

Any other ideas?

Changed in neutron:
status: Invalid → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 15.0.0.0rc1

This issue was fixed in the openstack/kolla-ansible 15.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 13.7.0

This issue was fixed in the openstack/kolla-ansible 13.7.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible 14.7.0

This issue was fixed in the openstack/kolla-ansible 14.7.0 release.

no longer affects: kolla-ansible
Revision history for this message
Bartosz Bezak (bbezak) wrote (last edit ):

In Centralised configuration (no DVR), this problem still persist: i.e. Traffic from VLAN external ports (Baremetal) is not reaching router as external port HA chassis is scheduled on different active chassis than gateway chassis (external vlan network) active chassis.

Traffic starts to go normally when chassis priorities got manually altered.

Tested on yoga with OVN 22.09

Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872023

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033

Changed in neutron:
status: Confirmed → In Progress
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/872023
Committed: https://opendev.org/openstack/neutron/commit/ac231c817473c018dde8fa31594b1c9a78a36c13
Submitter: "Zuul (22348)"
Branch: master

commit ac231c817473c018dde8fa31594b1c9a78a36c13
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Jan 23 16:17:46 2023 +0100

    Improve "sync_ha_chassis_group" method

    The method "sync_ha_chassis_group" now creates (or retrieves) a
    HA Chassis Group register and updates the needed HA Chassis registers
    in a single transaction. That is possible using the new ovsdbapp
    release 2.2.1 (check the depends-on patch).

    Depends-On: https://review.opendev.org/c/openstack/ovsdbapp/+/871836

    Related-Bug: #1995078
    Change-Id: I936855214c635de0e89d5d13a86562f5b282633c

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/kolla-ansible wallaby-eol

This issue was fixed in the openstack/kolla-ansible wallaby-eol release.

Bartosz Bezak (bbezak)
no longer affects: kolla-ansible/wallaby
no longer affects: kolla-ansible/xena
no longer affects: kolla-ansible/yoga
no longer affects: kolla-ansible/zed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/2023.1)

Related fix proposed to branch: stable/2023.1
Review: https://review.opendev.org/c/openstack/neutron/+/903897

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/2023.1)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/903897
Committed: https://opendev.org/openstack/neutron/commit/8c26736027fd2c066eef6cd05c89ff2364a570c0
Submitter: "Zuul (22348)"
Branch: stable/2023.1

commit 8c26736027fd2c066eef6cd05c89ff2364a570c0
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Mon Jan 23 16:17:46 2023 +0100

    Improve "sync_ha_chassis_group" method

    The method "sync_ha_chassis_group" now creates (or retrieves) a
    HA Chassis Group register and updates the needed HA Chassis registers
    in a single transaction. That is possible using the new ovsdbapp
    release 2.2.1 (check the depends-on patch).

    Depends-On: https://review.opendev.org/c/openstack/ovsdbapp/+/871836

    Conflicts:
      neutron/common/ovn/utils.py
      neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_impl_idl.py
      neutron/tests/functional/plugins/ml2/drivers/ovn/mech_driver/ovsdb/test_maintenance.py

    Related-Bug: #1995078
    Change-Id: I936855214c635de0e89d5d13a86562f5b282633c
    (cherry picked from commit ac231c817473c018dde8fa31594b1c9a78a36c13)

Revision history for this message
Austin Cormier (acormier86) wrote :

Hi Rodolfo,

We are hitting this issue in our environment. I'm assuming anyone who is attempting to use OVN with baremetal/external ports in VLAN tenant networks in a HA environment would hit this. I'm assuming the workaround here is to centralize all external traffic to a single node?

Does the following have any dependencies on other fixes?
 https://review.opendev.org/c/openstack/neutron/+/872033

We would be willing to test any patches you may have available to help.

Revision history for this message
Austin Cormier (acormier86) wrote :

It wasn't clear whether https://review.opendev.org/c/openstack/neutron/+/903897 was a blocker for this.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/872033
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in networking-generic-switch:
status: New → Triaged
importance: Undecided → High
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.