[RFE] BGP Speaker peer sessions down when rabbitmq offline

Bug #2006145 reported by Maximilian Stinsky
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
In Progress
Undecided
Unassigned

Bug Description

Greetings,

While we tested a couple of disaster scenarios in our lab environment we noticed that when we stop our rabbitmq cluster completely, the neutron dynamic routing bgp speaker shuts down all bgp sessions to its peers.

This results in all announced floating ip's or subnet pools to go offline.

We are running neutron wallaby (18.5.0) with the StaticScheduler for the neutron bgp part.

In my opinion the bgp speaker should continue to announce its local cached state until the rabbitmq connection can be reestablished.
As most rabbitmq upgrades require a full downtime, upgrades to rabbitmq are almost impossible to do without openstack to be offline when using neutron dynamic routing.

tags: added: l3-bgp
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

The latest versions of rabbitmq support rolling upgrades, so I'm not sure whether that scenario will still be relevant looking forward. Also if a node looses connectivity to rabbitmq, I'm not sure that serving stale data is better than stopping announcements and hoping other nodes would take over. At least this change of behavior would need to be configurable and possibly include a timeout after which announcements still would be dropped. Like maybe 5 minutes, similar to what would happen in a graceful restart scenario. So I'd suggest to treat this as a feature request rather than a bug.

Revision history for this message
Maximilian Stinsky (mstinsky) wrote :

In newer versions of rabbitmq there is a high chance that rolling upgrades are working but there is still the possibility that some versions adds new feature flags that dont support rolling upgrades.

I understand the thought process about one agent maybe serving stale data if a single node is losing its connection to rabbitmq, but to lose all public connectivity to the cloud on a rabbitmq problem seems like too much impact.

In my opinion the agent should continue to announce its stale state and an operator needs to have alerting that the agent is down because it lost its connection to rabbitmq and then act accordingly to the situation.

When thinking about it dont other agent more or less do the same and just keep its local state like l3, ovs-agent and so on. They dont just remove router ip's, floating ip's, security groups or ovs-flows when they lose their connection to rabbitmq.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello Maximiliam:

I understand the inconvenience that means to have the BGP agent down if the MQ service is down too. However, this is how is currently implemented in n-d-r.

This could be considered as an improvement (a very good one) and this is why I'm marking this bug as a RFE [1]. We are currently short of working cycles, thus if you can work on this new functionality, that will be great.

Regards.

[1]https://meetings.opendev.org/meetings/networking/2023/networking.2023-02-21-14.00.log.html#l-108

summary: - BGP Speaker peer sessions down when rabbitmq offline
+ [RFE] BGP Speaker peer sessions down when rabbitmq offline
tags: added: rfe
Revision history for this message
Brian Haley (brian-haley) wrote :

Closed https://bugs.launchpad.net/neutron/+bug/2039812 as a duplicate of this, but it might have some info that is useful when proposing an RFE.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron-specs (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-specs/+/899209

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron-dynamic-routing (master)
Changed in neutron:
status: New → In Progress
tags: added: rfe-approved
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron-specs (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-specs/+/899209
Committed: https://opendev.org/openstack/neutron-specs/commit/9623a1c850b139ee42f2facaa31b61aed18c0bbc
Submitter: "Zuul (22348)"
Branch: master

commit 9623a1c850b139ee42f2facaa31b61aed18c0bbc
Author: Roberto Bartzen Acosta <email address hidden>
Date: Tue Oct 24 17:05:27 2023 -0300

    Add spec for BGP speaker peer sessions resilient - RFE

    Depends-On: https://review.opendev.org/c/openstack/neutron-specs/+/914043
    Related-bug: #2006145
    Change-Id: Ib365b9641dd5e932df705bb263bad9e0f73c508b

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.