Floating IPs attached to Octavia LB VIPs are not reachable across routers using different external subnets (OVN + Octavia)

Bug #2110225 reported by Sami Yessou
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
neutron
Confirmed
Undecided
Miro Tomaska

Bug Description

Hi Everyone!

We were facing an issue when extending the Openstack public network range by adding a new subnet, by looking the comments of that launchpad bug:

https://bugs.launchpad.net/neutron/+bug/1605343
Seems like the way to do it is to add a newer subnet, but doing so we encounter the following issue:

When using Openstack with OVN networking and Octavia LBaaS, creating routers with different external gateway subnets (e.g.. public-subnet and public-subnet-2) causes load balancer VIPs to become unreachable across subnets from VMs on the same public-subnet

Reproduction Steps:
1) Create two public subnets (public-subnet, public-subnet-2) on the same external network

2) Create two routers

router1 with external gateway set to public-subnet-2
router2 with external gateway set to public-subnet
3) Attach internal networks (private, private-2) to each router

4) Launch VMs in each subnet and deploy web servers

5) Create two Octavia load balancers:

subnet1_lb1 in private network
subnet2_lb1 in private-2 network
Assign floating IPs to the VIP ports of both load balancers

6) From each VM, try accessing subnet1_lb1 and subnet2_lb1

Observed Behavior:
From subnet1_node1 (172.24.4.100) behind router R1 (172.24.5.X subnet) we are able to reach the Octavia LB subnet1_lb1 (172.24.4.200) ✅
From subnet1_node1 (172.24.4.100) behind router R1 (172.24.5.X subnet) we are not able to reach the Octavia LB subnet1_lb2 (172.24.5.200) ❌

From subnet2_node1 (172.24.5.100) behind router R2 (172.24.4.X subnet) we are able to reach the Octavia LB subnet2_lb1 (172.24.5.200) ✅
From subnet2_node1 (172.24.5.100) behind router R2 (172.24.4.X subnet) we are not able to reach the Octavia LB subnet1_lb1 (172.24.4.200) ❌

If we put those in a matrix:
Source VM Router (Gateway Subnet) Target LB VIP Expected Observed Result
subnet1_node1 R1 (public-subnet-2 / 172.24.5.X) subnet1_lb1 (172.24.4.200) Reachable 200. ✅
subnet1_node1 R1 (public-subnet-2 / 172.24.5.X) subnet2_lb1 (172.24.5.200) Reachable Timeout ❌
subnet2_node1 R2 (public-subnet / 172.24.4.X) subnet2_lb1 (172.24.5.200) Reachable 200. ✅
subnet2_node1 R2 (public-subnet / 172.24.4.X). subnet1_lb1 (172.24.4.200) Reachable Timeout ❌

We clearly see that a VM from a different subnet tries to reach a loadbalancer via the LB floating IP, the LB is not reachable if the router (connecting the LB to the public network) external IP is in a different range than the LB floating IP.

Expected Behavior:
All VMs, regardless of which subnet/router they are connected to, should be able to reach any VIP exposed via floating IP

Environment:
OpenStack with OVN (replicable on latest version with Devstack)

Below i'll attach the scripts to replicate it on a freshly created Devstack instance

Feel free to reach out for any further detail

Revision history for this message
Sami Yessou (yessou-sami) wrote :
Revision history for this message
Michael Johnson (johnsom) wrote :

I am changing the project on this bug from octavia to neutron as it appears to be a neutron issue and not related to octavia.

affects: octavia → neutron
Miro Tomaska (mtomaska)
Changed in neutron:
status: New → Opinion
Miro Tomaska (mtomaska)
Changed in neutron:
assignee: nobody → Miro Tomaska (mtomaska)
Revision history for this message
alisafari (alisafar1212) wrote :

We are facing the same issue in 2023.1

Revision history for this message
Mohammed Naser (mnaser) wrote :

I don't think this is an opinion, I think this is actually a bug since this means if you add a second subnet to your primary "public" external network, you may have scenarios where load balancers are not reachable from one subnet or the other.

This could also be because of something inside OVN provider for Octavia as well.

Changed in neutron:
status: Opinion → Confirmed
tags: added: ovn-octavia-provider
tags: added: ovn
Revision history for this message
Miro Tomaska (mtomaska) wrote :

I added opinion so we can discuss in the neutron meeting.

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello all:

I had a quick talk to Miro (who has a diagram prepared for this bug). The issue here is who is routing the traffic between subnet1 and subnet2. Neutron routers (router1 and router2) have information about their corresponding gateway subnets (subnet1<-router1, subnet2<-router2). But these routers will send all the external network traffic to their corresponding GW IPs (on the range of the subnets).

So in this scenario it is needed something else to route the traffic between subnet1 and subnet2, that should be external to Neutron.

Regards.

Revision history for this message
Miro Tomaska (mtomaska) wrote :

Network diagram LP2110225

Revision history for this message
Cristian Contescu (ckristi) wrote (last edit ):

Please find attached a diagram which we used internally to describe the problem. The following description of the problem is based on the attached diagram.

If we try to connect from VM1 in tenant subnet1 (network1 behind vRouter1) to the LB in tenant subnet 2 (network2 behind vRouter2), the connection hangs. tcpdump-ing on the hypervisor containing any of the two virtual routers, we can see several ARP requests from vRouter1 with no replies (which to us is an indication nobody replies to vRouter1’s ARP request):

09:49:46.107504 fa:16:3e:1e:ae:60 > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 101, p 0, ethertype ARP (0x0806), Request who-has 10.10.64.144 tell 10.10.64.200, length 28

So, what seems to be happening is that vRouter1 does not actually try to route the traffic, as you can see in the diagram its external IP address (10.10.64.200) is in the same public subnet as the floating IP attached to the LB VIP (10.10.64.144), so the vRouter1 is trying to simply reach the LB VIP via L2 as it considers it on the same directly connected network (hence the ARP request), and this is where things seem to break as no one replies to the ARP request.

The load balancer behind network2/tenant subnet2 is reachable from the ‘Internet/remote PC’ without an issue. Together with our network provider we debugged a bit, and we see an ARP request is sent from the PE router, which does receive a reply from vRouter2, which is not the case if the ARP request comes from vRouter1.

Hope this helps,
Cristi

Revision history for this message
Sami Yessou (yessou-sami) wrote :

Hi Miro
I am going to attach to this bug an output of `ovn-nbctl show` and the details of our Devstack environment where we replicate the issue in an automated way

In short, seems the traffic is reaching the virtual router and then stops there when going to the other public network

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.