neutron

[RFE] BGP Speaker peer sessions down when rabbitmq offline

Bug #2006145 reported by Maximilian Stinsky on 2023-02-06

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	In Progress	Undecided	Unassigned

Bug Description

Greetings,

While we tested a couple of disaster scenarios in our lab environment we noticed that when we stop our rabbitmq cluster completely, the neutron dynamic routing bgp speaker shuts down all bgp sessions to its peers.

This results in all announced floating ip's or subnet pools to go offline.

We are running neutron wallaby (18.5.0) with the StaticScheduler for the neutron bgp part.

In my opinion the bgp speaker should continue to announce its local cached state until the rabbitmq connection can be reestablished.
As most rabbitmq upgrades require a full downtime, upgrades to rabbitmq are almost impossible to do without openstack to be offline when using neutron dynamic routing.

Tags:

Elvira García Ruiz (elviragr) on 2023-02-07

tags:

added: l3-bgp

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2023-02-16:

The latest versions of rabbitmq support rolling upgrades, so I'm not sure whether that scenario will still be relevant looking forward. Also if a node looses connectivity to rabbitmq, I'm not sure that serving stale data is better than stopping announcements and hoping other nodes would take over. At least this change of behavior would need to be configurable and possibly include a timeout after which announcements still would be dropped. Like maybe 5 minutes, similar to what would happen in a graceful restart scenario. So I'd suggest to treat this as a feature request rather than a bug.

Revision history for this message

Maximilian Stinsky (mstinsky) wrote on 2023-02-17:

In newer versions of rabbitmq there is a high chance that rolling upgrades are working but there is still the possibility that some versions adds new feature flags that dont support rolling upgrades.

I understand the thought process about one agent maybe serving stale data if a single node is losing its connection to rabbitmq, but to lose all public connectivity to the cloud on a rabbitmq problem seems like too much impact.

In my opinion the agent should continue to announce its stale state and an operator needs to have alerting that the agent is down because it lost its connection to rabbitmq and then act accordingly to the situation.

When thinking about it dont other agent more or less do the same and just keep its local state like l3, ovs-agent and so on. They dont just remove router ip's, floating ip's, security groups or ovs-flows when they lose their connection to rabbitmq.

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2023-02-21:

Hello Maximiliam:

I understand the inconvenience that means to have the BGP agent down if the MQ service is down too. However, this is how is currently implemented in n-d-r.

This could be considered as an improvement (a very good one) and this is why I'm marking this bug as a RFE [1]. We are currently short of working cycles, thus if you can work on this new functionality, that will be great.

Regards.

[1]https://meetings.opendev.org/meetings/networking/2023/networking.2023-02-21-14.00.log.html#l-108

summary:	- BGP Speaker peer sessions down when rabbitmq offline + [RFE] BGP Speaker peer sessions down when rabbitmq offline
tags:	added: rfe

Revision history for this message

Brian Haley (brian-haley) wrote on 2023-10-19:

Closed https://bugs.launchpad.net/neutron/+bug/2039812 as a duplicate of this, but it might have some info that is useful when proposing an RFE.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-10-24: Related fix proposed to neutron-specs (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-specs/+/899209

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-11-06: Fix proposed to neutron-dynamic-routing (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron-dynamic-routing/+/900246

Changed in neutron:
status:	New → In Progress

Brian Haley (brian-haley) on 2023-11-10

tags:

added: rfe-approved

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2024-04-05: Related fix merged to neutron-specs (master)

Reviewed: https://review.opendev.org/c/openstack/neutron-specs/+/899209
Committed: https://opendev.org/openstack/neutron-specs/commit/9623a1c850b139ee42f2facaa31b61aed18c0bbc
Submitter: "Zuul (22348)"
Branch: master

commit 9623a1c850b139ee42f2facaa31b61aed18c0bbc
Author: Roberto Bartzen Acosta <email address hidden>
Date: Tue Oct 24 17:05:27 2023 -0300

Add spec for BGP speaker peer sessions resilient - RFE

    Depends-On: https://review.opendev.org/c/openstack/neutron-specs/+/914043
    Related-bug: #2006145
    Change-Id: Ib365b9641dd5e932df705bb263bad9e0f73c508b

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #2039812

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.