[ndr] neutron-bgp-dragent is racy when a service restart is made just before a speaker is added

Bug #2024481 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned
neutron-dynamic-routing (Ubuntu)
New
Undecided
Unassigned

Bug Description

Hit a race with the Antelope (22.0.0) version of NDR in one of our functional test runs:

1) neutron-bgp-dragent got restarted right before creating a speaker and adding an external network and tenant network to it;
2) As can be seen in the service log below, just after neutron-bgp-dragent started, it tried to advertise a route (00:03:21.766) before a speaker got added to it (00:03:22.251) - which it failed to do with the `BgpSpeakerNotAdded` exception:
https://github.com/openstack/neutron-dynamic-routing/blob/13e0d8a63dbdbd9e1a863144999794d4fc9af22d/neutron_dynamic_routing/services/bgp/agent/driver/os_ken/driver.py#L150-L154

3) As a result, the peer (FRR in our case) only got a floating IP route (/32) in the test result in the tenant network route (/24) was never advertised.

Test steps (downstream) that generated the log lines: https://github.com/openstack-charmers/zaza-openstack-tests/blob/edd7717dc2ca300cfb94729d9d6bb7021787906c/zaza/openstack/configure/bgp_speaker.py#L65-L100
The service restart is done prior to calling the test code above (notably, it was done as a workaround for something else but inadvertently helped to trigger this edge case):
https://github.com/openstack-charmers/zaza-openstack-tests/blob/edd7717dc2ca300cfb94729d9d6bb7021787906c/zaza/openstack/charm_tests/dragent/configure.py#L92-L103

The lack of a route at the peer side can be seen at 2023-06-19 00:03:32 here:
https://openstack-ci-reports.ubuntu.com/artifacts/e4c/886157/8/check/jammy-antelope-ovn/e4c9b5d/job-output.txt
2023-06-19 00:03:32.346994 | focal-medium |
2023-06-19 00:03:32.347012 | focal-medium | B>* 100.64.0.144/32 [20/0] via 172.16.27.207, ens3, weight 1, 00:00:07
2023-06-19 00:03:32.347045 | focal-medium |

Summary: It looks like neutron-bgp-dragent may try to advertise routes it gets from a DB before a speaker is added by it. It should properly make sure a speaker is present before trying to advertise routes. If speakers aren't scheduled to it yet, it should attempt to advertise as soon as one is present on it.

-------

Functional test log:

2023-06-19 00:03:19.709430 | focal-medium | 2023-06-19 00:03:19 [INFO] Setting up BGP speaker
2023-06-19 00:03:20.307141 | focal-medium | 2023-06-19 00:03:20 [INFO] Creating BGP Speaker
2023-06-19 00:03:20.434428 | focal-medium | 2023-06-19 00:03:20 [INFO] Advertising BGP routes
2023-06-19 00:03:20.678231 | focal-medium | 2023-06-19 00:03:20 [INFO] Advertising ext_net network on BGP Speaker bgp-speaker
2023-06-19 00:03:20.919232 | focal-medium | 2023-06-19 00:03:20 [INFO] Advertising private network on BGP Speaker bgp-speaker
2023-06-19 00:03:21.155337 | focal-medium | 2023-06-19 00:03:21 [INFO] Setting up BGP peer
2023-06-19 00:03:22.099859 | focal-medium | 2023-06-19 00:03:22 [INFO] Creating BGP Peer
2023-06-19 00:03:22.142524 | focal-medium | 2023-06-19 00:03:22 [INFO] Adding BGP peer to BGP speaker
2023-06-19 00:03:22.143374 | focal-medium | 2023-06-19 00:03:22 [INFO] Adding peer osci-frr on BGP Speaker bgp-speaker
2023-06-19 00:03:22.208265 | focal-medium | 2023-06-19 00:03:22 [INFO] Creating floating IP to advertise
2023-06-19 00:03:22.301280 | focal-medium | 2023-06-19 00:03:22 [INFO] Creating port: NDR_TEST_FIP
2023-06-19 00:03:23.599942 | focal-medium | 2023-06-19 00:03:23 [INFO] Creating floatingip
2023-06-19 00:03:26.351808 | focal-medium | 2023-06-19 00:03:26 [INFO] Advertised floating IP: 100.64.0.144

neutron-bgp-dragent.log:

2023-06-19 00:03:20.751 26428 INFO neutron.common.config [-] Logging enabled!
2023-06-19 00:03:20.751 26428 INFO neutron.common.config [-] /usr/bin/neutron-bgp-dragent version 22.0.0
2023-06-19 00:03:21.533 26428 INFO neutron_dynamic_routing.services.bgp.agent.driver.os_ken.driver [-] Initializing os-ken driver for BGP functionality.
2023-06-19 00:03:21.533 26428 INFO neutron_dynamic_routing.services.bgp.agent.driver.os_ken.driver [-] Initialized os-ken BGP Speaker driver interface with bgp_router_id=172.16.0.46
2023-06-19 00:03:21.578 26428 INFO neutron_dynamic_routing.services.bgp.agent.bgp_dragent [-] BGP dynamic routing agent started
2023-06-19 00:03:21.748 26428 INFO bgpspeaker.api.base [None req-3e563ce5-7b78-46d2-9dd3-02067da4e197 - - - - - -] API method core.start called with args: {'waiter': <os_ken.lib.hub.Event object at 0x7fb16c0c6950>, 'local_as': 4279238701, 'router_id': '172.16.0.46', 'bgp_server_hosts': ('0.0.0.0', '::'), 'bgp_server_port': 0, 'refresh_stalepath_time': 0, 'refresh_max_eor_time': 0, 'label_range': (100, 100000), 'allow_local_as_in_count': 0, 'cluster_id': None, 'local_pref': 100}
2023-06-19 00:03:21.766 26428 ERROR neutron_dynamic_routing.services.bgp.agent.bgp_dragent [None req-f082fe6d-cb70-4761-b02d-19f38bda7ae2 - - - - - -] Call to driver for BGP Speaker 04d9b59c-e4b9-4756-92b3-df364fa7bd0d advertise_route has failed with exception BGP Speaker for local_as=4279238701 with router_id=172.16.0.46 not added yet..: neutron_dynamic_routing.services.bgp.agent.driver.exceptions.BgpSpeakerNotAdded: BGP Speaker for local_as=4279238701 with router_id=172.16.0.46 not added yet.
2023-06-19 00:03:21.768 26428 INFO neutron_dynamic_routing.services.bgp.agent.driver.os_ken.driver [None req-3e563ce5-7b78-46d2-9dd3-02067da4e197 - - - - - -] Added BGP Speaker for local_as=4279238701 with router_id= 172.16.0.46.
2023-06-19 00:03:22.249 26428 INFO bgpspeaker.api.base [None req-eac9e066-f8a0-417e-a2cb-9c5995f1011d 166ebaf50413428c83ab07b94245479d 61efb6f767ed4ac39728f0e96f87a7f6 - - 729a6868d7ef40cea300ac221475d98e 729a6868d7ef40cea300ac221475d98e] API method neighbor.create called with args: {'connect_mode': 'active', 'is_next_hop_self': False, 'cap_mbgp_ipv6': True, 'is_route_server_client': False, 'remote_as': 4279238721, 'is_route_reflector_client': False, 'cap_enhanced_refresh': False, 'cap_mbgp_l2vpnfs': False, 'cap_mbgp_ipv4fs': False, 'cap_mbgp_vpnv4': False, 'cap_mbgp_ipv6fs': False, 'cap_mbgp_vpnv6fs': False, 'peer_next_hop': None, 'cap_mbgp_ipv4': True, 'cap_mbgp_vpnv4fs': False, 'cap_four_octet_as_number': True, 'ip_address': '172.16.0.66', 'cap_mbgp_vpnv6': False, 'remote_port': 179, 'cap_mbgp_evpn': False}
2023-06-19 00:03:22.251 26428 INFO neutron_dynamic_routing.services.bgp.agent.driver.os_ken.driver [None req-eac9e066-f8a0-417e-a2cb-9c5995f1011d 166ebaf50413428c83ab07b94245479d 61efb6f767ed4ac39728f0e96f87a7f6 - - 729a6868d7ef40cea300ac221475d98e 729a6868d7ef40cea300ac221475d98e] Added BGP Peer 172.16.0.66 for remote_as=4279238721 to BGP Speaker running for local_as=4279238701.
2023-06-19 00:03:23.261 26428 INFO bgpspeaker.peer [-] Connection to peer: 172.16.0.66 established
2023-06-19 00:03:23.263 26428 INFO neutron_dynamic_routing.services.bgp.agent.driver.os_ken.driver [-] BGP Peer 172.16.0.66 for remote_as=23456 is UP.
2023-06-19 00:03:24.370 26428 INFO neutron_dynamic_routing.services.bgp.agent.driver.os_ken.driver [-] Best path change observed. cidr=172.16.0.0/16, nexthop=172.16.0.66, remote_as=4279238721, is_withdraw=False
2023-06-19 00:03:25.450 26428 INFO bgpspeaker.api.base [None req-af11f291-7671-4ff9-bce2-d7a8dedd5037 - - - - - -] API method network.add called with args: {'prefix': '100.64.0.144/32', 'next_hop': '172.16.27.207'}
2023-06-19 00:03:25.451 26428 INFO neutron_dynamic_routing.services.bgp.agent.driver.os_ken.driver [None req-af11f291-7671-4ff9-bce2-d7a8dedd5037 - - - - - -] Route cidr=100.64.0.144/32, nexthop=172.16.27.207 is advertised for BGP Speaker running for local_as=4279238701.
2023-06-19 00:03:25.507 26428 ERROR bgpspeaker.peer [-] AS_PATH on UPDATE message has loops. Ignoring this message: BGPUpdate(len=53,nlri=[BGPNLRI(addr='100.64.0.144',length=32)],path_attributes=[BGPPathAttributeOrigin(flags=64,length=1,type=1,value=0), BGPPathAttributeAsPath(flags=80,length=10,type=2,value=[[4279238721, 4279238701]]), BGPPathAttributeNextHop(flags=64,length=4,type=3,value='172.16.0.66')],total_path_attribute_len=25,type=2,withdrawn_routes=[],withdrawn_routes_len=0)

description: updated
description: updated
description: updated
no longer affects: neutron (Ubuntu)
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Not sure if related or not, but the final error "ERROR bgpspeaker.peer [-] AS_PATH on UPDATE message has loops." indicates that you have a broken BGP peer configuration, sending prefixes back to the bgp-agent, which shouldn't happen. In another recent bug, fixing that configuration also fixed other issues, so I would start with repairing that and see if you still can reproduce the issue afterwards.

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Actually thinking about this a bit more, that error proves that n-d-r was actually announcing the prefix in question at that time, so not seeing it in your BGP peer output must have a different reason than what you are suggesting.

Changed in neutron:
status: New → Incomplete
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote (last edit ):

Well, the prefix you are talking about is for a floating ip (the /32) one. The one that's missing and causing the func test to fail is for the tenant network subnet 192.168.0.0/24:

2023-06-19 00:09:58.409988 | focal-medium | 2023-06-19 00:09:58 [INFO] AssertionError: CIDR, 192.168.0.0/24, not found in BGP peer's routing table: Codes: K - kernel route, C - connected, S - static, R - RIP,

vs 100.64.0.172/32 that is present but is causing is as you point out:

2023-06-19 00:09:46.067 26428 ERROR bgpspeaker.peer [-] AS_PATH on UPDATE message has loops. Ignoring this message: BGPUpdate(len=53,nlri=[BGPNLRI(addr='100.64.0.172',length=32)],path_attributes=[BGPPathAttributeOrigin(flags=64,length=1,type=1,value=0), BGPPathAttributeAsPath(flags=80,length=10,type=2,value=[[4279238721, 4279238701]]), BGPPathAttributeNextHop(flags=64,length=4,type=3,value='172.16.0.66')],total_path_attribute_len=25,type=2,withdrawn_routes=[],withdrawn_routes_len=0)

I agree this error should be fixed by adjusting the peer config but I still think the error above it is what's causing the lack of a /24 route advertisement:

 2023-06-19 00:03:21.766 26428 ERROR neutron_dynamic_routing.services.bgp.agent.bgp_dragent [None req-f082fe6d-cb70-4761-b02d-19f38bda7ae2 - - - - - -] Call to driver for BGP Speaker 04d9b59c-e4b9-4756-92b3-df364fa7bd0d advertise_route has failed with exception BGP Speaker for local_as=4279238701 with router_id=172.16.0.46 not added yet..: neutron_dynamic_routing.services.bgp.agent.driver.exceptions.BgpSpeakerNotAdded: BGP Speaker for local_as=4279238701 with router_id=172.16.0.46 not added yet.

In the func test we retry with tenacity several times checking if the tenant network subnet is present at the other side whereas neutron-bgp-agent doesn't even attempt to advertise it in this case while a floating IP is created after networks are added to the speaker.

So the sequence of events is:

1) neutron-bgp-agent restart
2) bgp speaker gets added to the DB
3) external and tenant networks get added to the speaker in the DB
4) the advertise_route error happens
5) a peer is added to the BGP speaker
6) a floating IP gets created
7) a floating IP gets advertised
8) the test loops with tenacity to assess the advertised routes

Changed in neutron:
status: Incomplete → New
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Ah right, I mixed those routes up, sorry. So the next question is: how long does your test wait? IIUC usually such a situation should be cleaned up by a periodic_resync task, which by default runs every 40s, can you check that in your log? If 40s is too long for you, you could try to reduce that interval in the neutron config.

If the resync does not fix the announcement, that would be the first location that I would see where a fix could be applied.

Also, just out of curiosity, do you plan to work on a fix yourself?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.