OVN: Removal of chassis results in unbalanced distribution of LRPs

Bug #2023993 reported by Ihtisham ul Haq
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
In Progress
Medium
Rodolfo Alonso

Bug Description

Consider the following

Router Priority Chassis
r_a 5 gtw06
r_a 4 gtw05
r_a 3 gtw04
r_a 2 gtw03
r_a 1 gtw02

Note the r_a doesn't have any priority on gtw01 but now if we stop gtw06(using ovn-appctl exit) due to maintenance reasons, afterwards the situation becomes:

Router Priority Chassis
r_a 5 gtw05
r_a 4 gtw04
r_a 3 gtw03
r_a 2 gtw02
r_a 1 gtw01

So basically neutron promotes the priorities for that router, when it detects that chassis(gtw06) is down, and I believe it does that to avoid moving the active LRP more then once, as the router is already failed over to priority 4(gtw05), and when the gtw06 goes down and afterwards it only updates gtw05 to priority 5 and similarly for the other priorities<5.

And the issue arises because of that is when we have many priority 5 routers on gtw06, and the rescheduling(due to failover of the chassis) doesn't result in a balanced distribution of the routers. And to resolve that we currently have to run another external script to rebalances the LRPs.

I am not yet sure if that is case by design and the operator has to make sure they routers are rebalanced manually or if there is better solution here so we have rebalanced the LRP while keeping in mind to have least amount of failovers for the LRP.

Neutron version: Yoga

Tags: l3-ha ovn
tags: added: l3-ha ovn
Changed in neutron:
importance: Undecided → Medium
Ihtisham ul Haq (iulhaq)
description: updated
Changed in neutron:
assignee: nobody → Rodolfo Alonso (rodolfo-alonso-hernandez)
Revision history for this message
Ihtisham ul Haq (iulhaq) wrote :

An idea to solve this would be that after the fail over, we can rebalance the priorities<5(or the highest priority, if total # of Chassis are less then 5), so that the next failover of a Chassis potentially improves the situation instead of making it worse. Ofcourse this doesn't guarantee that we will have a balanced distribution after the second(or subsequent) failover.

@RodolfoAlonso Any opinions?

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

I guess you are using the default ``OVNGatewayLeastLoadedScheduler`` class. When a new router port is created, the OVN L3 scheduler creates a list of "Gateway_Chassis" registers, up to 5 (this value is hardcoded). Each "Gateway_Chassis" has a 1:1 association with a GW chassis and a priority. The scheduler first checks the number of "Gateway_Chassis" associated to each "Chassis" and then creates a list of "Gateway_Chassis" that will be assigned to the "Logical_Router_Port", in the "gateway_chassis" column.

The highest priority "Gateway_Chassis" (=="Chassis") will be used to schedule on it the LRP, as seen the SB "Port_Bindings" register of the cr-lrp-xxx port. If the "Chassis" fails, the internal OVN scheduler (**not Neutron**) will use the second highest "Gateway_Chassis". I would like to highlight that if a "Chassis" fails, the Neutron OVN Scheduler does not rebalance the "Logical_Router_Port" again.

I've tested with up to 5 network nodes (GW chassis), with several LRPs, testing stopping several of these network nodes. OVN re-assigns the LRP using the "Gateway_Chassis" correctly. Because the Neutron OVN L3 scheduler creates the "Gateway_Chassis" randomly, the LRP affected by the network node tear down are not always re-scheduled to the same GW chassis. On the contrary, the LRPs affected are scheduled to different GW nodes.

I'm not sure if I'm understanding correctly the issue you are reporting, but I can't reproduce what you are explaining. Just the opposite: when one GW node is removed, the LRPs are assigned to other GW nodes randomly, not all these affected LRPs to the same other GW node.

Some questions:
* How many LRPs do you have?
* How many GW nodes?
* How many GW nodes are stopped at the same time?
* Can you specify, in your environment, when you stop one GW node how is the distribution of the affected LRPs? In other words, when a GW node is stopped, to which GW node the LRPs are re-scheduled.

Regards.

Revision history for this message
Ihtisham ul Haq (iulhaq) wrote :

Hi Rodolfo,

`If the "Chassis" fails, the internal OVN scheduler (**not Neutron**) will use the second highest "Gateway_Chassis".`

Right and at this point as there is no rebalancing(but rather rescheduling), and that is why we get uneven distribution. Neither OVN nor Neutron triggers any rebalancing, so that priorities of the LRPs are balanced again across all the `Chassis`. Which is the issue.

* How many LRPs do you have?
We have about 2000 LRPs

* How many GW nodes?
We have 9 GTW nodes

* How many GW nodes are stopped at the same time?
Usually we stop a single one when we are doing any maintenance on it

* Can you specify, in your environment, when you stop one GW node how is the distribution of the affected LRPs? In other words, when a GW node is stopped, to which GW node the LRPs are re-scheduled.
We regularly do maintenance/upgrades of our GTW nodes and as we stop one of the nodes(**using** ovn-appctl exit), the priorities gets rescheduled(promoted) e.g LRP priority 5(if it is on the stopped GTW node) goes away, and prio. 4 of that LRP is promoted to 5 and same for other priorities. And prio. 1 is scheduled on a new GTW using. All of that is done by OVN.

And this process creates uneven distribution of all the priorities, because priority 4 of the priority 5 LRPs(on the stopped GTW node) are promoted to priority 5s.

And if we do maintenance one by one for several GTW nodes we end up in even worse LRP priority distribution.

Its a bit complicated to explain but I hope that helps. If not then I can provide a more thorough example.

Thank you.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Ihtisham:

If you have 2000 LRP and 9 GW chassis, that means you'll have around 222 LRP per chassis. With this amount of ports, the distribution of the LRP "Gateway_Chassis" will be more uniform. That means when one GW chassis is stopped, the LRPs should be distributed evenly among the other 8 remaining GW. You should have around new 28 LRPs per GW chassis from the "orphan" 222 LRPs from the stopped chassis. The "OVNGatewayLeastLoadedScheduler" guarantees that when creating the "Gateway_Chassis" list per LRP.

About my question "to which GW node the LRPs are re-scheduled", what I was asking was to what chassis the orphan LRP were assigned. I've done some manual testing in a live environment and I can confirm what I said before: the "orphan" LRPs are evenly distributed to the other GW chassis.

In any case, if you really want a method to re-schedule the LRPs once a GW chassis is stopped, how do you suggest to implement it? Remember that "rebalancing" a LRP will mean to assign another chassis and that will imply a traffic disconnection; so this can't be an automatic process but a manual one.

Regards.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello, any update on this bug?

Revision history for this message
Felix Huettner (felix.huettner) wrote :

Hi Rodolfo,

i agree that for a single failover the behaviour works fine as expected. However for multiple failovers (one after each other) we can see a buildup of routers on individual priorities.

To illustrate this issue i have written the example below. This example is based on the usage of 5 gw chassis and 20 LRPs. It is split into sections that show what happens after an action done by ovn/neutron. Each section contains one line per priority. The columns are the individual LRPs while the digit is the Chassis this lrp is assigned to for this priority:

```
initial setup

p5 12345123451234512345
p4 23451234512345123451
p3 34512345123451234512
p2 45123451234512345123
p1 51234512345123451234

Chassis 5 breaks

p5 1234x1234x1234x1234x
p4 234x1234x1234x1234x1
p3 34x1234x1234x1234x12
p2 4x1234x1234x1234x123
p1 x1234x1234x1234x1234

Neutron promotes from lower prio and adds p1 again

p5 12341123411234112341
p4 23412234122341223412
p3 34123341233412334123
p2 41234412344123441234
p1 12341234123412341234

chassis 5 returns, nothing happens (since everything is still assinged)

chassis 4 breaks

p5 123x1123x1123x1123x1
p4 23x1223x1223x1223x12
p3 3x1233x1233x1233x123
p2 x123xx123xx123xx123x
p1 123x123x123x123x123x

Neutron promotes from lower prio

p5 12311123111231112311
p4 23122231222312223122
p3 31233312333123331233
p2 123x123x123x123x123x
p1 xxxxxxxxxxxxxxxxxxxx

neutron adds p1 again

p5 12311123111231112311
p4 23122231222312223122
p3 31233312333123331233
p2 12351235123512351235
p1 53215321532153215321
```

You can see here that now the priority 5 chassis are only running on 3 out of 5 gw chassis. Also chassis 1 is significantly overrepresented.

---

For rebalancing i would never rebalance the highest priority (p5), but only rebalance p4 to p1. This would ensure there is no traffic disconnection.

I will try to build an initial implementation and to also add some tests that validate the initial issue and can check it is fixed afterwards

Revision history for this message
Felix Huettner (felix.huettner) wrote :
Download full text (8.2 KiB)

I just built a testcase that was able to reproduce that behaviour. It can be reproduced with the following code in `neutron/tests/functional/services/ovn_l3/test_plugin.py`

```
    def test_gateway_chassis_does_not_become_unbalanced(self):
        def print_prio_by_chassis():
            chassis = self.nb_api.get_all_chassis_gateway_bindings()
            chassis_prios = {}
            for chassis_name in chassis:
                prios = {}
                for (_, prio) in chassis[chassis_name]:
                    prios[prio] = prios.setdefault(prio, 0) + 1
                chassis_prios[chassis_name] = prios
            for chassis_name in sorted(chassis_prios):
                print("%s:" % chassis_name)
                for prio, count in sorted(chassis_prios[chassis_name].items()):
                    print("\tprio %s: %s routers" % (prio, count))

        # test with 6 gw chassis and 24 lrps
        chassis_list = []
        for i in range(0, ovn_const.MAX_GW_CHASSIS + 1):
            name = 'ovs-host%s' % i
            chassis_list.append(
                self.add_fake_chassis(
                    name, physical_nets=['physnet1'], name=name,
                    other_config={'ovn-cms-options': 'enable-chassis-as-gw'}))

        ext1 = self._create_ext_network(
            'ext1', 'vlan', 'physnet1', 1, "10.0.0.1", "10.0.0.0/24")
        gw_info = {'network_id': ext1['network']['id']}
        for i in range(0, 24):
            router = self._create_router('router-%s' % i, gw_info=gw_info)
            gw_port_id = router.get('gw_port_id')
            logical_port = 'cr-lrp-%s' % gw_port_id
            self.assertTrue(self.cr_lrp_pb_event.wait(logical_port),
                            msg='lrp %s failed to bind' % logical_port)
            self.sb_api.lsp_bind(logical_port, chassis_list[0],
                                may_exist=True).execute(check_error=True)

        print("Initial setup")
        self.l3_plugin.schedule_unhosted_gateways()
        print_prio_by_chassis()
        print()
        print()

        print("Now evicting each router once")
        for i in range(0, ovn_const.MAX_GW_CHASSIS + 1):
            c = chassis_list[i]
            del chassis_list[i]
            self.del_fake_chassis(c)
            print("host %s now gone" % i)
            self.l3_plugin.schedule_unhosted_gateways()
            print_prio_by_chassis()
            print()
            print()

            name = 'ovs-host%s' % i
            chassis_list.insert(
                i,
                self.add_fake_chassis(
                    name, physical_nets=['physnet1'], name=name,
                    other_config={'ovn-cms-options': 'enable-chassis-as-gw'}))
            print("host %s now back" % i)
            self.l3_plugin.schedule_unhosted_gateways()
            print_prio_by_chassis()
            print()
            print()

        self.assertFalse(True)
```

---

This will output the following data clearly showing our issue:

```
    Initial setup
ovs-host0:
        prio 1: 5 routers
        prio 2: 2 routers
        prio 3: 5 routers
        prio 4: 4 routers
        prio 5: 4 routers
ovs-host1:
        prio 1: 4 routers
        prio 2: 5 ro...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893653

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893654

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893655

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893656

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893657

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893659

Revision history for this message
Felix Huettner (felix.huettner) wrote :
Download full text (10.3 KiB)

With all the changes above we now end up in the state below. From my perspective this is significantly better and also a lot more stable over more runs:

```
Initial setup
ovs-host0:
        prio 1: 5 routers
        prio 2: 5 routers
        prio 3: 5 routers
        prio 4: 5 routers
        prio 5: 5 routers
ovs-host1:
        prio 1: 5 routers
        prio 2: 6 routers
        prio 3: 5 routers
        prio 4: 5 routers
        prio 5: 5 routers
ovs-host2:
        prio 1: 7 routers
        prio 2: 4 routers
        prio 3: 5 routers
        prio 4: 5 routers
        prio 5: 5 routers
ovs-host3:
        prio 1: 4 routers
        prio 2: 6 routers
        prio 3: 6 routers
        prio 4: 4 routers
        prio 5: 5 routers
ovs-host4:
        prio 1: 4 routers
        prio 2: 4 routers
        prio 3: 5 routers
        prio 4: 6 routers
        prio 5: 5 routers
ovs-host5:
        prio 1: 5 routers
        prio 2: 5 routers
        prio 3: 4 routers
        prio 4: 5 routers
        prio 5: 5 routers
Movements so far: 0

host 0 now gone
ovs-host1:
        prio 1: 5 routers
        prio 2: 4 routers
        prio 3: 9 routers
        prio 4: 6 routers
        prio 5: 6 routers
ovs-host2:
        prio 1: 6 routers
        prio 2: 6 routers
        prio 3: 5 routers
        prio 4: 8 routers
        prio 5: 5 routers
ovs-host3:
        prio 1: 6 routers
        prio 2: 9 routers
        prio 3: 6 routers
        prio 4: 3 routers
        prio 5: 6 routers
ovs-host4:
        prio 1: 8 routers
        prio 2: 5 routers
        prio 3: 4 routers
        prio 4: 6 routers
        prio 5: 7 routers
ovs-host5:
        prio 1: 5 routers
        prio 2: 6 routers
        prio 3: 6 routers
        prio 4: 7 routers
        prio 5: 6 routers
Movements so far: 64

host 0 now back
ovs-host0:
        prio 1: 5 routers
        prio 2: 4 routers
        prio 3: 5 routers
        prio 4: 11 routers
ovs-host1:
        prio 1: 4 routers
        prio 2: 4 routers
        prio 3: 8 routers
        prio 4: 3 routers
        prio 5: 6 routers
ovs-host2:
        prio 1: 6 routers
        prio 2: 6 routers
        prio 3: 3 routers
        prio 4: 5 routers
        prio 5: 5 routers
ovs-host3:
        prio 1: 6 routers
        prio 2: 9 routers
        prio 3: 6 routers
        prio 4: 2 routers
        prio 5: 6 routers
ovs-host4:
        prio 1: 5 routers
        prio 2: 4 routers
        prio 3: 3 routers
        prio 4: 4 routers
        prio 5: 7 routers
ovs-host5:
        prio 1: 4 routers
        prio 2: 3 routers
        prio 3: 5 routers
        prio 4: 5 routers
        prio 5: 6 routers
Movements so far: 89

host 1 now gone
ovs-host0:
        prio 1: 2 routers
        prio 2: 6 routers
        prio 3: 10 routers
        prio 4: 9 routers
        prio 5: 3 routers
ovs-host2:
        prio 1: 8 routers
        prio 2: 7 routers
        prio 3: 5 routers
        prio 4: 3 routers
        prio 5: 7 routers
ovs-host3:
        prio 1: 4 routers
        prio 2: 7 routers
        prio 3: 8 routers
        prio 4: 4 routers
        prio 5: 7 routers
ovs-host4:
        prio 1: 8 routers
        prio 2: 6 routers
        prio 3: 3 routers
        prio 4: 6 routers
 ...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893655
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893659
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893657
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/893653
Committed: https://opendev.org/openstack/neutron/commit/b5f5f3def3cec69558f7eb6edb04b95ede34b0d3
Submitter: "Zuul (22348)"
Branch: master

commit b5f5f3def3cec69558f7eb6edb04b95ede34b0d3
Author: Felix Huettner <email address hidden>
Date: Thu Aug 31 16:05:38 2023 +0200

    ovn-l3 scheduler: calculate load of chassis per priority

    previously we calculated the "load" of a chassis across the highest
    priority of each of the chassis. This can lead to suboptimal results in
    the following situation:
    * you have gateway chassis: hv1, hv2, hv3
    * you have routers:
       * g1: with priority 3 on hv1, priority 2 on hv2, priority 1 on hv3
       * g2: with priority 3 on hv1, priority 2 on hv2, priority 1 on hv3
       * g3: with priority 3 on hv3, priority 2 on hv2, priority 1 on hv1
       * g4: with priority 3 on hv3, priority 2 on hv2, priority 1 on hv1

    When now creating a new router the previous algorythm would have placed
    prio 3 of it either on hv1 or hv3 since their count of highest
    priorities (2 of prio 3) is lower than the count of the higest priority
    of hv2 (4 of prio 2). So it might have looked like:
    * g5: with priority 3 on hv3, priority 2 on hv1, priority 1 on hv3
    (This case has been implemented as `test_least_loaded_chassis_per_priority2`).

    However this is actually a undesired result. In OVN the gateway chassis
    with the highest priority actually hosts the router and processes all of
    its external traffic. This means it is highly important that the highest
    priority is well balanced.

    To do this now we no longer blindly use the count of routers of the
    highest priority per chassis, but we only count the routers of the
    priority we are currently searching a chassis for. This ensures that in
    the above case we would have picked hv2 for priority 3, since it has not
    actually active router running.

    The algorithm implemented now is based upon the assumption, that amount
    of priorities scheduled per router is equal over all routers. This means
    it will perform suboptimally if some phyiscal network is available on 5
    gateway chassis, while another one is only available on 2. (It is
    however unclear if the previous implementation would have been better
    there).

    In this commit we also adopt the testcases in test_l3_ovn_scheduler to match
    to this assumption. Previously the distribution data used for testing
    had been unrelasitic as it mostly scheduled one gateway chassis for each
    router.

    It also fixes the previously broken priority calculation in the
    testcase, that would just assign prio 0 to all gateways.

    Partial-Bug: #2023993
    Change-Id: If2afcd546a1da9964704bcebbfa39d8348e14fe8

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/neutron/+/893654
Committed: https://opendev.org/openstack/neutron/commit/3d5d82a197f985ad9e08fd31a7d7548b875386c0
Submitter: "Zuul (22348)"
Branch: master

commit 3d5d82a197f985ad9e08fd31a7d7548b875386c0
Author: Felix Huettner <email address hidden>
Date: Fri Sep 1 13:00:44 2023 +0200

    ovn-l3: reschedule lower priorities

    if a gateway chassis is removed we previously only plugged the hole it
    left in the priorities of the lrps. This can lead to bad choice since we
    are bound by all other currently used chassis.
    By allowing us to also reschedule the lower priorities we get
    significantly more freedom in choosing the most appropriate chassis and
    prevent overloading an individual one.

    As an example from the new testcase:
    previously we would have had all prio 2 schedules on chassis3, but with
    this change now this distributes better also to chassis4.

    Partial-Bug: #2023993
    Change-Id: I786ff6c0c4d3403b79819df95f9b1d6ac5e8675f

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/906277

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/906277
Committed: https://opendev.org/openstack/neutron/commit/188fe6c9538861d0adc7bc283e56899767c7d666
Submitter: "Zuul (22348)"
Branch: master

commit 188fe6c9538861d0adc7bc283e56899767c7d666
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed Jan 17 18:20:04 2024 +0000

    [OVN] Document the OVN L3 scheduler

    This new document adds:
    * A definition of the OVN L3 scheduler
    * A description of the different OVN L3 schedulers
    * How the LRP are re-scheduled if the gateway chassis list
      changes.

    Related-Bug: #2023993
    Change-Id: Idcc0e34227e47df53a1f395c8fd163723d54b933

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893659
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893657
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/893656
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.