Neutron Dynamic Routing : Connection to peers lost

Bug #2020349 reported by Yusuf Güngör
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

Hi everyone, we are getting random connection lost issues for ALL bgp peers. Logs attached to issue.

May 20, 2023 @ 12:26:51.276 controller-04 BGP Peer 10.210.48.2 for remote_as=64611 went DOWN.
May 20, 2023 @ 12:26:51.279 controller-04 Connection to peer 10.210.48.3 lost, reason: unpack_from requires a buffer of at least 1 bytes for unpacking 1 bytes at offset 0 (actual buffer size is 0) Resetting retry connect loop: False

Do you have any idea? Thanks.

Environment Details:
 Openstack Version: Wallaby (cluster installed via kolla-ansible)
 OS Version: Ubuntu 20.04.2 LTS Hosts. (Kernel:5.4.0-90-generic)
 Neutron Version: 18.1.2.dev118 ["neutron-server", "neutron-dhcp-agent", "neutron-openvswitch-agent", "neutron-l3-agent", "neutron-bgp-dragent", "neutron-metadata-agent"]
 There exist 5 controller+network node.
 OpenvSwitch used in DVR mode and router HA is disabled. (l3_ha = false)
 We are using a single centralized neutron router for connecting all tenant networks to provider network.
 We are using bgp_dragent to announce unique tenant networks.
 Tenant network type: vxlan
 External network type: vlan

Tags: l3-bgp
Revision history for this message
Yusuf Güngör (yusuf2) wrote :
yatin (yatinkarel)
tags: added: l3-bgp
Revision history for this message
yatin (yatinkarel) wrote :

@tobias, @jens can you please check if any idea on what can cause these random disconnections

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

What type of router are you using as BGP peer? Are there logs on that router? Also, is this a new installation or was this working before and only started to fail now? Please also show the output of "openstack bgp speaker list" and "openstack bgp peer list".

Changed in neutron:
status: New → Incomplete
Revision history for this message
Yusuf Güngör (yusuf2) wrote (last edit ):

Hi Jens,

Our bgp peers are vmware nsx-t edge. We have not permitted to see the logs on the nsx edge site but requested the logs to share with you. This is an existing installation, started to fail now. We are doing tests to reproduce the failure scenario but no success yet.

```
$ openstack bgp speaker list
+--------------------------------------+-------------+----------+------------+
| ID | Name | Local AS | IP Version |
+--------------------------------------+-------------+----------+------------+
| f20c5d17-c3ef-4150-a6b3-351fa97f79dc | bgp-speaker | 64621 | 4 |
+--------------------------------------+-------------+----------+------------+

$ openstack bgp peer list
+--------------------------------------+-------------------+-------------+-----------+
| ID | Name | Peer IP | Remote AS |
+--------------------------------------+-------------------+-------------+-----------+
| 63e1dca6-0237-451e-b93f-578e2feffc20 | as-nsx-bm-edge-03 | 10.210.48.3 | 64611 |
| 7ad475d9-cf06-4070-a7fc-d29f3c205697 | as-nsx-bm-edge-01 | 10.210.48.1 | 64611 |
| bda2c485-07b6-43d0-9e40-87b061562ac4 | as-nsx-bm-edge-02 | 10.210.48.2 | 64611 |
| cea01bd1-1618-4f31-ad18-eaeac59db981 | as-nsx-bm-edge-04 | 10.210.48.4 | 64611 |
+--------------------------------------+-------------------+-------------+-----------+
```

Yusuf Güngör (yusuf2)
description: updated
Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

You can try to get logs from neutron with debug=true set in neutron.conf, not sure though if os-ken will log enough details. Another option would be to run tcpdump on the BGP session, my guess is still that this is happening due to the remote sending some bad data.

You could also add another peer running something like bird or frr in order to see whether that session would show the same issue or not.

Revision history for this message
Yusuf Güngör (yusuf2) wrote :

Hi Jens, thanks for your reply. We can reproduce the issue on two different production clusters. When creating more than 10 instances concurrently and see more than 1K messages on rabbitmq and then the bgp connections restarted. We have noticed that mostly all peers down except one.

We have 5 controller node and every controller node creates bgp connection to same 4 remote peer. (vmware nsx-t edge nodes) 5x4=20 total bgp connection, all of them disconnects except one. But on last test we have seen that bgp connections from one of the controller is not lost but other 4 controller bgp connections restarted.

Our BGP Peers (vmware nsx-t edges) are FRR based and the frr logs showed that "Connection reset by peer". Also the vmware nsx-t edges also have bgp connections to firewalls etc but that connections are not restarted. We think that there is no problem on vmware nsx-t edge site. Only the bgp connections dropped with neutron site. Do you think we still need to add another peer like bird or frr?

Attaching the bgp dragent debug logs for every controller. Since we can reproduce the issue, if you still want the tcpdump the bgp sessions then we can send it too.

Revision history for this message
Yusuf Güngör (yusuf2) wrote :
Revision history for this message
Yusuf Güngör (yusuf2) wrote :
Revision history for this message
Yusuf Güngör (yusuf2) wrote :
Revision history for this message
Yusuf Güngör (yusuf2) wrote :

After the all bgp connections restarted except one, the one which is not restarted not sends the withdrawal requests. We have noticed the issue for vip ip addresses. In our scenario vip ip address is used for nova instances in past but instance is deleted and then octavia uses that ip address as vip. Because of one peer still sends the incorrect /32 route for that ip address, we can not access the Octavia LB.

LB vip ip addresses is not advertised as /32 routes, and that incorrect /32 route overrides the advertised subnet route. details : Neutron Dynamic Routing : vip is not advertised via BGP - https://bugs.launchpad.net/neutron/+bug/2020001

I think this issue may related with these:
 - [neutron-bgp-dragent] passive agents send wrong number of routes - https://bugs.launchpad.net/neutron/+bug/1862932
 - [RFE] BGP Speaker peer sessions down when rabbitmq offline - https://bugs.launchpad.net/neutron/+bug/2006145

Revision history for this message
Dr. Jens Harbott (j-harbott) wrote :

Looking at the logs that you posted, I am seeing some configuration error:

2023-05-24 03:22:20.979 7 ERROR bgpspeaker.peer [-] AS_PATH on UPDATE message has loops. Ignoring this message: BGPUpdate(len=58,nlri=[BGPNLRI(addr='10.209.39.204',length=32), BGPNLRI(addr='10.209.39.201',length=32)],path_attributes=[BGPPathAttributeOrigin(flags=64,length=1,type=1,value=0), BGPPathAttributeAsPath(flags=80,length=10,type=2,value=[[64611, 64622]]), BGPPathAttributeNextHop(flags=64,length=4,type=3,value='10.208.48.4')],total_path_attribute_len=25,type=2,withdrawn_routes=[],withdrawn_routes_len=0)

This indicates that the BGP peer is sending prefixes back to the neutron agent. This is a malconfiguration on the peer, neutron should never receive any prefixes, it will not use them for routing anyway. I'm not sure whether this is related to the actual issue, but if you could fix that issue it may be helpful. This would actually be consistent with the observal that one last peer doesn't get disconnected. Maybe you can also double check that everything is stable if there is only one peer in total?

Otherwise I still would be interested in seeing a comparison with FRR or bird as peer.

Revision history for this message
Yusuf Güngör (yusuf2) wrote :

Thanks Jens, we will try to configure bgp peer to not send prefixes back and test the scenario when only one peer exist.

Revision history for this message
Yusuf Güngör (yusuf2) wrote :

Hi Jens, we have configured bgp peer to not send prefixes back and it seems that problem is solved, no more connection reset occurs. We are still testing some scenarios to see if the problem resolved permanently.

Also it was feedbacked to me that this shouldn't cause a bgp connection reset even if peer sends the same prefixes back. "AS_PATH on UPDATE message has loops" logs printed on the first connection after getting routes but the issue happens when lots of update send (create concurrent bulk instances etc.) If it is ok then we can close the bug report.

Thanks you very much for your support.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Thank you Jens for the support on this bug. I'm closing it.

Changed in neutron:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.