Not rescheduling gateways upon Chassis addition can lead to routers not being in HA

Bug #1762691 reported by Daniel Alvarez
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
networking-ovn
Fix Released
Undecided
Unassigned

Bug Description

When we are scheduling a gateway on the available chassis, it can happen that only one node is available at that time (because the rest of then network nodes may be down for whatever reason). Therefore, the router will be scheduled on it becoming the active chassis for its gateway.

When the rest of the nodes come up back again in the cluster, no rescheduling happens and the previous router is not effectively in HA. If the active node goes down, North-South traffic to that router will be disrupted as it won't be failed over to any of the network nodes available.

The solution would be to schedule all routers which have been scheduled in less than MAX_GW_CHASSIS [0] chassis in the new nodes (perhaps with a lower priority to avoid failovers although it may cause some unbalance).

[0] http://git.openstack.org/cgit/openstack/networking-ovn/tree/networking_ovn/l3/l3_ovn_scheduler.py?id=d40470a51314fc0c60353c9882e0d2d44c9d2aa5#n31

Changed in networking-ovn:
assignee: nobody → venkata anil (anil-venkata)
Changed in networking-ovn:
assignee: venkata anil (anil-venkata) → nobody
Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

@dalvarez did you already handle this, or was it something else?

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

Hey Miguel, I handled the scheduling but not the rescheduling.
Anil had something in mind for this. @Anil can you share it here so that someone else can take over?
Thanks!

Revision history for this message
venkata anil (anil-venkata) wrote :

When a new chassis is added, schedule_unhosted_gateways[1] will be called to host unhosted gateways. In this function, we can add this chassis to all gateways which are hosted on less than MAX_GW_CHASSIS[2].

1) schedule_unhosted_gateways[1] calls get_unhosted_gateways[3]. get_unhosted_gateways() should be rewritten
   a) to look for only gateway routers(exclude routers without gateway port)
   b) if the router's hosted gateway chassis is less than MAX_GW_CHASSIS, return that router along with it's hosting chassis list (with priorities)
   c) if any gateway router is not at all hosted, return empty chassis list
   d) return value should be for example {'R1': {}, 'R2': {'C1':2, 'C2': 1}}

2) schedule_unhosted_gateways() after getting the ordered chassis list (i.e self.scheduler.select() [4] ) should consider existing list, and give priority to them by adding them into start of chassis list before calling update_lrouter_port[5]. For example, if C3 is the new chassis added, then self.scheduler.select() should have returned [C2,C3,C1] for R2 as chassis list. In that case, we should change it to [C1,C2,C3] before calling update_lrouter_port()

[1] https://github.com/openstack/networking-ovn/blob/master/networking_ovn/l3/l3_ovn.py#L390
[2] https://github.com/openstack/networking-ovn/blob/master/networking_ovn/l3/l3_ovn_scheduler.py#L31
[3] https://github.com/openstack/networking-ovn/blob/master/networking_ovn/l3/l3_ovn.py#L394
[4] https://github.com/openstack/networking-ovn/blob/master/networking_ovn/l3/l3_ovn.py#L400
[5] https://github.com/openstack/networking-ovn/blob/master/networking_ovn/l3/l3_ovn.py#L405

Changed in networking-ovn:
assignee: nobody → Reedip (reedip-banerjee)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to networking-ovn (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/653718

Changed in networking-ovn:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-ovn (master)

Fix proposed to branch: master
Review: https://review.opendev.org/657794

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on networking-ovn (master)

Change abandoned by Reedip (<email address hidden>) on branch: master
Review: https://review.opendev.org/657794

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-ovn (master)

Reviewed: https://review.opendev.org/591461
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=12070403db4851ef73011c52124cd335b6807521
Submitter: Zuul
Branch: master

commit 12070403db4851ef73011c52124cd335b6807521
Author: reedip <email address hidden>
Date: Sat May 4 03:36:07 2019 +0000

    Support for Router Scheduling on addition/removal of chassis

    The following patch provides L3HA Rescheduling of gateways when chassis
    are added/removed. It reschedules the gateway ports when a new chassis
    is added/old one removed. However, the number of chassis where a
    gateway can be hosted is limited by the constant MAX_GW_CHASSIS.

    Co-Authored-By: Maciej Józefczyk <email address hidden>

    Change-Id: I0d96efe4ceef4168039930738285c19d5c003951
    Closes-Bug: #1762691

Changed in networking-ovn:
status: In Progress → Fix Released
tags: added: networking-ovn-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn 7.0.0.0b1

This issue was fixed in the openstack/networking-ovn 7.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to networking-ovn (master)

Reviewed: https://review.opendev.org/653718
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=2aac5afcf9d850b3a2b40da4b2f0668fd464a2a7
Submitter: Zuul
Branch: master

commit 2aac5afcf9d850b3a2b40da4b2f0668fd464a2a7
Author: reedip <email address hidden>
Date: Thu Apr 18 11:00:42 2019 +0000

    Add Design documentation for L3 HA Rescheduling

    The following patch introduces a basic design for L3 HA Rescheduling
    which is being taken care in [1].

    [1]: https://review.openstack.org/591461

    Co-Authored-By: Maciej Józefczyk <email address hidden>

    Related-Bug: #1762691
    Change-Id: I123d8d44eb223f0f82ababff4f49679d5369e9dd

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-ovn (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/694362

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-ovn (stable/stein)

Reviewed: https://review.opendev.org/694362
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=156d271ae721ef219c0b64c659dd60e19c95858c
Submitter: Zuul
Branch: stable/stein

commit 156d271ae721ef219c0b64c659dd60e19c95858c
Author: reedip <email address hidden>
Date: Sat May 4 03:36:07 2019 +0000

    Support for Router Scheduling on addition/removal of chassis

    The following patch provides L3HA Rescheduling of gateways when chassis
    are added/removed. It reschedules the gateway ports when a new chassis
    is added/old one removed. However, the number of chassis where a
    gateway can be hosted is limited by the constant MAX_GW_CHASSIS.

    Co-Authored-By: Maciej Józefczyk <email address hidden>

    Change-Id: I0d96efe4ceef4168039930738285c19d5c003951
    Closes-Bug: #1762691
    (cherry picked from commit 12070403db4851ef73011c52124cd335b6807521)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-ovn (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/698252

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to networking-ovn (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/698254

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-ovn (stable/rocky)

Reviewed: https://review.opendev.org/698252
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=274cde7043fbb92fed7f595e4a6b8a0dc77c5fb3
Submitter: Zuul
Branch: stable/rocky

commit 274cde7043fbb92fed7f595e4a6b8a0dc77c5fb3
Author: reedip <email address hidden>
Date: Sat May 4 03:36:07 2019 +0000

    Support for Router Scheduling on addition/removal of chassis

    The following patch provides L3HA Rescheduling of gateways when chassis
    are added/removed. It reschedules the gateway ports when a new chassis
    is added/old one removed. However, the number of chassis where a
    gateway can be hosted is limited by the constant MAX_GW_CHASSIS.

    Co-Authored-By: Maciej Józefczyk <email address hidden>

    Change-Id: I0d96efe4ceef4168039930738285c19d5c003951
    Closes-Bug: #1762691
    (cherry picked from commit 12070403db4851ef73011c52124cd335b6807521)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to networking-ovn (stable/queens)

Reviewed: https://review.opendev.org/698254
Committed: https://git.openstack.org/cgit/openstack/networking-ovn/commit/?id=70fdad3175b9cde09a9f926a6373fea7df89c8f3
Submitter: Zuul
Branch: stable/queens

commit 70fdad3175b9cde09a9f926a6373fea7df89c8f3
Author: Maciej Józefczyk <email address hidden>
Date: Tue Jan 7 13:33:19 2020 +0000

    Support for Router Scheduling on addition/removal of chassis

    The following patch provides L3HA Rescheduling of gateways when chassis
    are added/removed. It reschedules the gateway ports when a new chassis
    is added/old one removed. However, the number of chassis where a
    gateway can be hosted is limited by the constant MAX_GW_CHASSIS.

    Co-Authored-By: Maciej Józefczyk <email address hidden>

    Conflicts:
       networking_ovn/tests/functional/base.py
       networking_ovn/tests/functional/test_router.py
       networking_ovn/l3/l3_ovn_scheduler.py

    Change-Id: I0d96efe4ceef4168039930738285c19d5c003951
    Closes-Bug: #1762691
    (cherry picked from commit 12070403db4851ef73011c52124cd335b6807521)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn 6.0.1

This issue was fixed in the openstack/networking-ovn 6.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn 5.1.0

This issue was fixed in the openstack/networking-ovn 5.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn queens-eol

This issue was fixed in the openstack/networking-ovn queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.