Bug #2052485 “Instances on same flat/vlan network but different ...” : Bugs : ovn-bgp-agent

Revision history for this message

Eduardo Olivares (eolivare) wrote on 2024-02-06:

#1

Hello,

We cover some cases like this on our downstream environments and they don't fail.
I'd like to understand what is different between our setups.

The tests consists in two VMs connected to the external provider flat network running on two different computes, connected between them with routers, advertising routes via BGP.

I am wondering if the problem in your setup is the existence of any route to the public subnet 172.24.4.0/24 on the computes.

In my case, the public subnet is 172.16.100.0/24. One of the VMs sends a ping to another one. That ping goes through this path when it leaves the source compute:
- tap interface (src and dest MACs correspond with the src and dest VM MAC addresses)
- br-ex (dest MAC modified to br-ex MAC due to br-ex flows)
- enp2s0 nic (src MAC is the enp2s0 MAC, dest MAC is the router's MAC)

That compute doesn't have any route to any IP within 172.16.100.0/24. It uses the default route that it obtains via BGP to send the traffic to the destination compute.
$ ip r
default nhid 34 proto bgp src 172.30.1.3 metric 20
nexthop via 100.64.0.9 dev enp3s0 weight 1
nexthop via 100.65.1.9 dev enp2s0 weight 1
100.64.0.8/30 dev enp3s0 proto kernel scope link src 100.64.0.10
100.65.1.8/30 dev enp2s0 proto kernel scope link src 100.65.1.10
192.168.1.0/24 dev enp1s0 proto kernel scope link src 192.168.1.152
192.168.2.0/24 via 192.168.1.1 dev enp1s0
192.168.3.0/24 via 192.168.1.1 dev enp1s0
192.168.4.0/24 via 192.168.1.1 dev enp1s0

The intermediate routers obtained the route to the destination compute via BGP.

>What is the supported configuration that will allow instances on the same provider network (not tenant network) to communicate with each other while only having L3 connectivity to the nodes?
So, answering your question, from the compute's perspective, this is pure L3 routing. The compute doesn't know about the external subnet at all.

Revision history for this message

Dan Sneddon (dsneddon) wrote on 2024-02-06:

#2

This scenario works outside Devstack, so it may be that the topology with the Cumulus switch-router or something in Devstack itself is getting in the way. It would help to know your topology.

* What interfaces are on the Cumulus virtual switch (both IP interfaces and patches to bridges)?

* What are the IP addresses/subnets on those interfaces?

* What does the routing table look like on the Cumulus router? (“show ip route”, “show ip route 172.24.4.x”)

* What does the BGP table look like on the Cumulus switch?
vtysh “show bgp”, “show ip bgp”, “show ip bgp summary”, and “show ip bgp 172.24.4.x/32” for the individual VM IPs

* What do the neighbors look like? (“show ip bgp neighbors”)

* What do the routing tables look like on the VM hosts themselves?

* Does it help to turn on proxy-arp on the Cumulus switch on the interface(s) facing the VMs?

Revision history for this message

Dan Sneddon (dsneddon) wrote on 2024-02-06:

#3

A note about my previous comment. It *should not* help if proxy-arp is enabled (it shouldn’t be needed on the switch). That’s just a troubleshooting step for testing to make sure that the topology isn’t relying on L2 to forward packets.

I would expect that there should be a default route on the compute hosts pointing toward the cumulus switch, and there should not be a route to the 172.24.4.0/24 anywhere (the routes in the Cumulus switch should be /32 routes pointing to the compute hosts).

You may want to take packet captures to see what’s happening, check counters, or dump flows to see where the pings are traversing and where they are missing from the path between the VMs.

Revision history for this message

Andy Litzinger (alitzinger) wrote on 2024-02-06:

#4

Hi eduardo,
thank you for your reply.
>We cover some cases like this on our downstream environments and they don't fail.
awesome, I will try and look and see if I can understand how the tests are configured

>I am wondering if the problem in your setup is the existence of any route to the public subnet 172.24.4.0/24 on the computes.

Do you wonder if such a route exists and it's next hop isn't the upstream switch? No, there is not. As long as any route that existed does point properly to the upstream switch then that would still accomplish the same result as following the default route to the upstream switch, right? Or am I missing something that would explain why it has to be a default route and not a more specific route?

>In my case, the public subnet is 172.16.100.0/24. One of the VMs sends a ping to another one. That ping >goes through this path when it leaves the source compute:
>- tap interface (src and dest MACs correspond with the src and dest VM MAC addresses)
I observe the same at my tap interfaces. I am a little confused as to how my two vms on separate nodes have learned the proper mac addresses for each other, but don't have connectivity. What responded to their respective arp requests? I managed to capture an arp request on the tap interface and it looked like the two vms were happily L2 adjacent- the 2nd vm responded to the arp:
22:52:05.554413 fa:16:3e:26:77:6a > fa:16:3e:dd:6b:48, ethertype ARP (0x0806), length 42: Request who-has 172.24.4.56 tell 172.24.4.93, length 28
22:52:05.554773 fa:16:3e:dd:6b:48 > fa:16:3e:26:77:6a, ethertype ARP (0x0806), length 42: Reply 172.24.4.56 is-at fa:16:3e:dd:6b:48, length 28

>- br-ex (dest MAC modified to br-ex MAC due to br-ex flows)
Can you explain why/how this happens? How can I validate the mac rewrite? Does br-ex perform some kind of source NAT on the packet too? Otherwise I'm really struggling to understand how routing even comes into play. You route when the src and dst IP are on different subnets. If they are on the same subnet, you bridge. But I suppose there could be some ovn/ovs magic here that causes br-ex to re-write only the mac addrs and still hand off the packet to the kernel. And perhaps the Linux kernel routing consults it's routing table for this flow despite the fact that it should be an bridged flow?

I will describe my set up in more detail in another comment shortly.

Hi eduardo,
  thank you for your reply.
>We cover some cases like this on our downstream environments and they don't fail.
awesome, I will try and look and see if I can understand how the tests are configured

>I am wondering if the problem in your setup is the existence of any route to the public subnet 172.24.4.0/24 on the computes.

Do you wonder if such a route exists and it's next hop isn't the upstream switch? No, there is not.  As long as any route that existed does point properly to the upstream switch then that would still accomplish the same result as following the default route to the upstream switch, right?  Or am I missing something that would explain why it has to be a default route and not a more specific route?

>In my case, the public subnet is 172.16.100.0/24. One of the VMs sends a ping to another one. That ping >goes through this path when it leaves the source compute:
>- tap interface (src and dest MACs correspond with the src and dest VM MAC addresses)
I observe the same at my tap interfaces.  I am a little confused as to how my two vms on separate nodes have learned the proper mac addresses for each other, but don't have connectivity.  What responded to their respective arp requests?  I managed to capture an arp request on the tap interface and it looked like the two vms were happily L2 adjacent- the 2nd vm responded to the arp:
22:52:05.554413 fa:16:3e:26:77:6a > fa:16:3e:dd:6b:48, ethertype ARP (0x0806), length 42: Request who-has 172.24.4.56 tell 172.24.4.93, length 28
22:52:05.554773 fa:16:3e:dd:6b:48 > fa:16:3e:26:77:6a, ethertype ARP (0x0806), length 42: Reply 172.24.4.56 is-at fa:16:3e:dd:6b:48, length 28

>- br-ex (dest MAC modified to br-ex MAC due to br-ex flows)
Can you explain why/how this happens?  How can I validate the mac rewrite?  Does br-ex perform some kind of source NAT on the packet too?  Otherwise I'm really struggling to understand how routing even comes into play. You route when the src and dst IP are on different subnets. If they are on the same subnet, you bridge. But I suppose there could be some ovn/ovs magic here that causes br-ex to re-write only the mac addrs and still hand off the packet to the kernel. And perhaps the Linux kernel routing consults it's routing table for this flow despite the fact that it should be an bridged flow?

I will describe my set up in more detail in another comment shortly.

Revision history for this message

Andy Litzinger (alitzinger) wrote on 2024-02-07:

#5

Download full text (11.9 KiB)

Hi Eduardo and Dan,
Here is a more detailed look at my setup:

I am running the entire "lab" on a remote virtual machine. The host vm is Ubuntu 22.04. I am running vagrant on this vm to stand up 3 vms (what vagrant calls "boxes"). Two are Ubuntu 22.04 named rack-1-host-1 and rack-1-host-2 and the third is a cumulus-vx switch running cumulus linux v5.6.0 named rack-1-leaf-1

Note that I am leaning heavily on the Vagrantfile That Luis refers to in his blog post: https://luis5tb.github.io/bgp/2021/02/04/ovn-bgp-agent-testing-setup.html, but with some modifications

rack-1-host-1 uses eth1 100.65.1.2/30 to connect to port swp1 on rack-1-leaf-1 100.65.1.1/30. It also has a loopback address of 99.99.1.1 was set as IP_HOST in the devstack local.conf which at a minimum means that it uses this IP for geneve tunnel endpoint.

rack-1-host-2 uses eth1 100.65.1.6/30 to connect to port swp2 on rack-1-leaf-1 100.65.1.5/30. It also has a loopback address of 99.99.1.2 was set as IP_HOST in the devstack local.conf which at a minimum means that it uses this IP for geneve tunnel endpoint. It hosts vm1-provider, 172.24.4.

Note that in order to get the geneve tunnel to come up (pass BFD probes) I had to add an iptables NAT rule on each compute node that would src packets from the loopback IP when the destination was the remote loopback IP.

The default route on all three vagrant vms is out of the vagrant interface. This needs to stay in place in order to provide connectivity to endpoints outside the vagrant lab (e.g. package repos). Although Luis's blog indicates (and hist default frr config on the hosts supports) that the leaf switch should provide a default route to the hosts, in my lab I have adjusted the routing policy applied to BGP on the hosts to allow the installation of /32 routes from BGP. While you might not choose to do this in production because it could lead to the installation of thousands of /32 routes on each host, in this mini lab it is perfectly acceptable as I'll never have more than ten or so instances yielding /32 routes.

As I've mentioned devstack spun up a provider network named public with ipv4 subnet 172.24.4.0/24. Here are the route tables from the host and the leaf switch:

rack-1-leaf-1# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, A - Babel, D - SHARP, F - PBR, f - OpenFabric,
       Z - FRR,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 10.255.1.1, vagrant, 02w4d20h
C>* 10.255.1.0/24 is directly connected, vagrant, 02w4d20h
C>* 99.98.1.1/32 is directly connected, lo, 02w4d20h
B>* 99.99.1.1/32 [200/0] via 100.65.1.2, swp1, weight 1, 02w2d00h
B>* 99.99.1.2/32 [200/0] via 100.65.1.6, swp2, weight 1, 01w3d22h
C>* 100.65.1.0/30 is directly connected, swp1, 02w4d20h
C>* 100.65.1.4/30 is directly connected, swp2, 02w4d20h
B>* 172.24.4.18/32 [200/0] via 100.65.1.2, swp1, weight 1, 23:28:11
B>* 172.24.4.31/32 [200/0] via 100.65.1.2, swp1, weight 1, 5d05h53m
B>* 172.24.4.56/32 [200/0] via 100.65.1.6, swp2, weight 1, 00:26:55
B>* 172...

Hi Eduardo and Dan,
Here is a more detailed look at my setup:

I am running the entire "lab" on a remote virtual machine.  The host vm is Ubuntu 22.04.  I am running vagrant on this vm to stand up 3 vms (what vagrant calls "boxes").  Two are Ubuntu 22.04 named rack-1-host-1 and rack-1-host-2 and the third is a cumulus-vx switch running cumulus linux v5.6.0 named rack-1-leaf-1

Note that I am leaning heavily on the Vagrantfile That Luis refers to in his blog post: https://luis5tb.github.io/bgp/2021/02/04/ovn-bgp-agent-testing-setup.html, but with some modifications

rack-1-host-1 uses eth1 100.65.1.2/30 to connect to port swp1 on rack-1-leaf-1 100.65.1.1/30.  It also has a loopback address of 99.99.1.1 was set as IP_HOST in the devstack local.conf which at a minimum means that it uses this IP for geneve tunnel endpoint.

rack-1-host-2 uses eth1 100.65.1.6/30 to connect to port swp2 on rack-1-leaf-1 100.65.1.5/30.  It also has a loopback address of 99.99.1.2 was set as IP_HOST in the devstack local.conf which at a minimum means that it uses this IP for geneve tunnel endpoint. It hosts vm1-provider, 172.24.4.

Note that in order to get the geneve tunnel to come up (pass BFD probes) I had to add an iptables NAT rule on each compute node that would src packets from the loopback IP when the destination was the remote loopback IP.

The default route on all three vagrant vms is out of the vagrant interface. This needs to stay in place in order to provide connectivity to endpoints outside the vagrant lab (e.g. package repos).  Although Luis's blog indicates (and hist default frr config on the hosts supports) that the leaf switch should provide a default route to the hosts, in my lab I have adjusted the routing policy applied to BGP on the hosts to allow the installation of /32 routes from BGP.  While you might not choose to do this in production because it could lead to the installation of thousands of /32 routes on each host, in this mini lab it is perfectly acceptable as I'll never have more than ten or so instances yielding /32 routes.

As I've mentioned devstack spun up a provider network named public with ipv4 subnet 172.24.4.0/24.  Here are the route tables from the host and the leaf switch:

rack-1-leaf-1# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, A - Babel, D - SHARP, F - PBR, f - OpenFabric,
       Z - FRR,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

K>* 0.0.0.0/0 [0/0] via 10.255.1.1, vagrant, 02w4d20h
C>* 10.255.1.0/24 is directly connected, vagrant, 02w4d20h
C>* 99.98.1.1/32 is directly connected, lo, 02w4d20h
B>* 99.99.1.1/32 [200/0] via 100.65.1.2, swp1, weight 1, 02w2d00h
B>* 99.99.1.2/32 [200/0] via 100.65.1.6, swp2, weight 1, 01w3d22h
C>* 100.65.1.0/30 is directly connected, swp1, 02w4d20h
C>* 100.65.1.4/30 is directly connected, swp2, 02w4d20h
B>* 172.24.4.18/32 [200/0] via 100.65.1.2, swp1, weight 1, 23:28:11
B>* 172.24.4.31/32 [200/0] via 100.65.1.2, swp1, weight 1, 5d05h53m
B>* 172.24.4.56/32 [200/0] via 100.65.1.6, swp2, weight 1, 00:26:55
B>* 172.24.4.93/32 [200/0] via 100.65.1.2, swp1, weight 1, 02w1d00h
B>* 172.24.4.105/32 [200/0] via 100.65.1.6, swp2, weight 1, 00:26:55
B>* 172.24.6.148/32 [200/0] via 100.65.1.2, swp1, weight 1, 4d22h55m

vagrant@rack-1-host-1:/opt/stack/devstack$ ip r
default via 10.255.1.1 dev vagrant 
10.255.1.0/24 dev vagrant proto kernel scope link src 10.255.1.130 
99.98.1.1 nhid 316 via 100.65.1.1 dev eth1 proto bgp src 99.99.1.1 metric 20 
99.99.1.2 nhid 316 via 100.65.1.1 dev eth1 proto bgp src 99.99.1.1 metric 20 
100.64.1.0/30 dev eth2 proto kernel scope link src 100.64.1.2 
100.65.1.0/30 dev eth1 proto kernel scope link src 100.65.1.2 
100.65.1.6 via 100.65.1.1 dev eth1 
172.24.4.0/24 dev br-ex proto kernel scope link src 172.24.4.1 
172.24.4.18 nhid 327 dev bgp-nic proto bgp metric 20 
172.24.4.31 nhid 327 dev bgp-nic proto bgp metric 20 
172.24.4.56 nhid 316 via 100.65.1.1 dev eth1 proto bgp src 99.99.1.1 metric 20 
172.24.4.93 nhid 327 dev bgp-nic proto bgp metric 20 
172.24.4.105 nhid 316 via 100.65.1.1 dev eth1 proto bgp src 99.99.1.1 metric 20 
172.24.6.148 nhid 327 dev bgp-nic proto bgp metric 20 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown

vagrant@rack-1-host-2:~$ ip r
default via 10.255.1.1 dev vagrant 
10.255.1.0/24 dev vagrant proto kernel scope link src 10.255.1.235 
99.98.1.1 nhid 103 via 100.65.1.5 dev eth1 proto bgp src 99.99.1.2 metric 20 
99.99.1.1 via 100.65.1.5 dev eth1 proto static 
100.64.1.4/30 dev eth2 proto kernel scope link src 100.64.1.6 
100.65.1.2 via 100.65.1.5 dev eth1 proto static 
100.65.1.4/30 dev eth1 proto kernel scope link src 100.65.1.6 
172.24.4.18 nhid 103 via 100.65.1.5 dev eth1 proto bgp src 99.99.1.2 metric 20 
172.24.4.31 nhid 103 via 100.65.1.5 dev eth1 proto bgp src 99.99.1.2 metric 20 
172.24.4.56 nhid 101 dev bgp-nic proto bgp metric 20 
172.24.4.93 nhid 103 via 100.65.1.5 dev eth1 proto bgp src 99.99.1.2 metric 20 
172.24.4.105 nhid 101 dev bgp-nic proto bgp metric 20 
172.24.6.148 nhid 103 via 100.65.1.5 dev eth1 proto bgp src 99.99.1.2 metric 20 
192.168.122.0/24 dev virbr0 proto kernel scope link src 192.168.122.1 linkdown

The test instances are cirros vms:

vm2-provider, eth0: 172.24.4.93/24, fa:16:3e:26:77:6a and lives on rack-1-host-1.  It's openstack port ID is: 61318c8a-c449-4180-89da-791628576a0d with name vm2-port

vm1-provider, etho: 172.24.4.56/24, fa:16:3e:dd:6b:48 and lives on rack-1-host-2.  It's openstack port ID is: 80da0b47-f403-4144-b172-a2ae160252ed with name vm1-port

Security groups are wide open:

ovn-trace seems to think the two vms are L2 adjacent if IIUC:

vagrant@rack-1-host-1:/opt/stack/devstack$ sudo ovn-trace --summary public 'ip4.src == 172.24.4.93 && ip4.dst == 172.24.4.56 && ip.ttl == 64 && icmp4 && inport == "vm2-port" && eth.src == '$VM2PROVIDER_MAC' && eth.dst =='$VM1PROVIDER_MAC
# icmp,reg14=0x5,vlan_tci=0x0000,dl_src=fa:16:3e:26:77:6a,dl_dst=fa:16:3e:dd:6b:48,nw_src=172.24.4.93,nw_dst=172.24.4.56,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,icmp_type=0,icmp_code=0
ingress(dp="public", inport="vm2-port") {
    reg0[15] = check_in_port_sec();
    next;
    reg0[0] = 1;
    next;
    ct_next;
    ct_next(ct_state=est|trk /* default (use --ct to customize) */) {
        reg0[8] = 1;
        reg0[10] = 1;
        next;
        reg8[16] = 1;
        next;
        reg8[16] = 0;
        reg8[17] = 0;
        reg8[18] = 0;
        next;
        reg8[16] = 0;
        reg8[17] = 0;
        reg8[18] = 0;
        next;
        outport = "vm1-port";
        output;
        egress(dp="public", inport="vm2-port", outport="vm1-port") {
            reg0[0] = 1;
            next;
            ct_next;
            ct_next(ct_state=est|trk /* default (use --ct to customize) */) {
                reg0[8] = 1;
                reg0[10] = 1;
                next;
                reg8[16] = 1;
                next;
                reg8[16] = 0;
                reg8[17] = 0;
                reg8[18] = 0;
                next;
                reg0[15] = check_out_port_sec();
                next;
                output;
                /* output to "vm1-port", type "" */;
            };
        };
    };
};

Same for the opposite direction:

sudo ovn-trace --summary public 'ip4.src == 172.24.4.56 && ip4.dst == 172.24.4.93 && ip.ttl == 64 && icmp4 && inport == "vm1-port" && eth.src == '$VM1PROVIDER_MAC' && eth.dst =='$VM2PROVIDER_MAC
# icmp,reg14=0x4,vlan_tci=0x0000,dl_src=fa:16:3e:dd:6b:48,dl_dst=fa:16:3e:26:77:6a,nw_src=172.24.4.56,nw_dst=172.24.4.93,nw_tos=0,nw_ecn=0,nw_ttl=64,nw_frag=no,icmp_type=0,icmp_code=0
ingress(dp="public", inport="vm1-port") {
    reg0[15] = check_in_port_sec();
    next;
    reg0[0] = 1;
    next;
    ct_next;
    ct_next(ct_state=est|trk /* default (use --ct to customize) */) {
        reg0[8] = 1;
        reg0[10] = 1;
        next;
        reg8[16] = 1;
        next;
        reg8[16] = 0;
        reg8[17] = 0;
        reg8[18] = 0;
        next;
        reg8[16] = 0;
        reg8[17] = 0;
        reg8[18] = 0;
        next;
        outport = "vm2-port";
        output;
        egress(dp="public", inport="vm1-port", outport="vm2-port") {
            reg0[0] = 1;
            next;
            ct_next;
            ct_next(ct_state=est|trk /* default (use --ct to customize) */) {
                reg0[8] = 1;
                reg0[10] = 1;
                next;
                reg8[16] = 1;
                next;
                reg8[16] = 0;
                reg8[17] = 0;
                reg8[18] = 0;
                next;
                reg0[15] = check_out_port_sec();
                next;
                output;
                /* output to "vm2-port", type "" */;
            };
        };
    };
};

But connectivity between the vms does not work:

$ hostname
vm2-provider
$ ping 172.24.4.56
PING 172.24.4.56 (172.24.4.56) 56(84) bytes of data.
^C
--- 172.24.4.56 ping statistics ---
11 packets transmitted, 0 received, 100% packet loss, time 10248ms

$ hostname
vm1-provider
$ ping 172.24.4.93
PING 172.24.4.93 (172.24.4.93) 56(84) bytes of data.

--- 172.24.4.93 ping statistics ---
8 packets transmitted, 0 received, 100% packet loss, time 7169ms

However, connectivity from the compute nodes to the vms does work (i.e. routing works):

vagrant@rack-1-host-1:/opt/stack/devstack$ ping 172.24.4.93 -c 1
PING 172.24.4.93 (172.24.4.93) 56(84) bytes of data.
64 bytes from 172.24.4.93: icmp_seq=1 ttl=64 time=0.352 ms

--- 172.24.4.93 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.352/0.352/0.352/0.000 ms
vagrant@rack-1-host-1:/opt/stack/devstack$ ping 172.24.4.56 -c 1
PING 172.24.4.56 (172.24.4.56) 56(84) bytes of data.
64 bytes from 172.24.4.56: icmp_seq=1 ttl=62 time=2.21 ms

--- 172.24.4.56 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.206/2.206/2.206/0.000 ms

vagrant@rack-1-host-2:~$ ping 172.24.4.93 -c 1
PING 172.24.4.93 (172.24.4.93) 56(84) bytes of data.
64 bytes from 172.24.4.93: icmp_seq=1 ttl=62 time=11.9 ms

--- 172.24.4.93 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 11.910/11.910/11.910/0.000 ms
vagrant@rack-1-host-2:~$ ping 172.24.4.56 -c 1
PING 172.24.4.56 (172.24.4.56) 56(84) bytes of data.
64 bytes from 172.24.4.56: icmp_seq=1 ttl=64 time=0.739 ms

--- 172.24.4.56 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.739/0.739/0.739/0.000 ms

FWIW I have also tested this when not adjusting the local node BGP policies to accept /32 routes and the connectivity still fails between instances.

What else would you like to know/see?

Revision history for this message

Andy Litzinger (alitzinger) wrote on 2024-02-07:

#6

Download full text (3.2 KiB)

Eduardo, I may have found something that answers one of my questions to you and may explain the issue

>>- br-ex (dest MAC modified to br-ex MAC due to br-ex flows)
>Can you explain why/how this happens?

One of my colleagues turned me on to this Redhat blog post from 2022 which is clearly heavily influenced or done in conjunction with Luis Tomas's blog posts around the same time. https://developers.redhat.com/articles/2022/09/22/learn-about-new-bgp-capabilities-red-hat-openstack-17#data_plane

The post contains this statement refering to one of the tasks that ovn-bgp-agent is supposed to complete:
"For egress traffic, add flows that change the destination MAC address to that of the provider bridge, so that the kernel will forward the traffic using the default outgoing ECMP routes:"

It then goes on to use this command 'sudo ovs-ofctl dump-flows br-ex' to show that flow rules have indeed been added to rewrite the mac addr for v4 and v6 traffic egressing from the br-int bridge to the br-ex bridge via the provnet patch.

I have repeated this same command on my two hypervisors and I do not have those rules in place, only the statistics summary:
vagrant@rack-1-host-1:~$ sudo ovs-ofctl --protocols=OpenFlow15 dump-flows br-ex
cookie=0x0, duration=1725926.213s, table=0, n_packets=111160, n_bytes=21586664, idle_age=17, priority=0 actions=NORMAL

vagrant@rack-1-host-2:~$ sudo ovs-ofctl --protocols=OpenFlow15 dump-flows br-ex
cookie=0x0, duration=1361525.650s, table=0, n_packets=51292, n_bytes=9421634, idle_age=14, priority=0 actions=NORMAL

Now you may notice that my command differs from the one in blog as I've added the --protocols=OpenFlow15 flag. If I don't do this, i get the following error:
vagrant@rack-1-host-1:~$ sudo ovs-ofctl dump-flows br-ex
2024-02-07T20:42:29Z|00001|vconn|WARN|unix:/usr/local/var/run/openvswitch/br-ex.mgmt: version negotiation failed (we support version 0x01, peer supports versions 0x04, 0x06)
ovs-ofctl: br-ex: failed to connect to socket (Protocol error)

Which made me wonder if ovn-bgp-agent is hitting the same error. It is logging a very similar error:

vagrant@rack-1-host-1:~$ sudo journalctl -u devstack@ovn-bgp-agent | grep negotiation
<snipped lots of matches, only showing some of last set>
Feb 07 20:45:11 rack-1-host-1 ovs-ofctl[3426804]: ovs|00001|vconn|WARN|unix:/usr/local/var/run/openvswitch/br-ex.mgmt: version negotiation failed (we support version 0x01, peer supports versions 0x04, 0x06)
Feb 07 20:45:11 rack-1-host-1 ovs-ofctl[3426807]: ovs|00001|vconn|WARN|unix:/usr/local/var/run/openvswitch/br-ex.mgmt: version negotiation failed (we support version 0x01, peer supports versions 0x04, 0x06)
Feb 07 20:45:11 rack-1-host-1 ovs-ofctl[3426809]: ovs|00001|vconn|WARN|unix:/usr/local/var/run/openvswitch/br-ex.mgmt: version negotiation failed (we support version 0x01, peer supports versions 0x04, 0x06)
<snip>
vagrant@rack-1-host-1:~$ date
Wed Feb 7 08:46:14 PM UTC 2024

Every five minutes this message repeats about 10 total times with what I presume are different thread ids.
Is there something in the way that ovn-bgp-agent is using ovs-ofctl (or all ovs-*tools) to fail because of version mismatch?

FWIW from my com...

Eduardo, I may have found something that answers one of my questions to you and may explain the issue

>>- br-ex (dest MAC modified to br-ex MAC due to br-ex flows)
>Can you explain why/how this happens?

One of my colleagues turned me on to this Redhat blog post from 2022 which is clearly heavily influenced or done in conjunction with Luis Tomas's blog posts around the same time. https://developers.redhat.com/articles/2022/09/22/learn-about-new-bgp-capabilities-red-hat-openstack-17#data_plane

The post contains this statement refering to one of the tasks that ovn-bgp-agent is supposed to complete:
"For egress traffic, add flows that change the destination MAC address to that of the provider bridge, so that the kernel will forward the traffic using the default outgoing ECMP routes:"

It then goes on to use this command 'sudo ovs-ofctl dump-flows br-ex' to show that flow rules have indeed been added to rewrite the mac addr for v4 and v6 traffic egressing from the br-int bridge to the br-ex bridge via the provnet patch.

I have repeated this same command on my two hypervisors and I do not have those rules in place, only the statistics summary:
vagrant@rack-1-host-1:~$ sudo ovs-ofctl --protocols=OpenFlow15 dump-flows br-ex
 cookie=0x0, duration=1725926.213s, table=0, n_packets=111160, n_bytes=21586664, idle_age=17, priority=0 actions=NORMAL

vagrant@rack-1-host-2:~$ sudo ovs-ofctl --protocols=OpenFlow15 dump-flows br-ex
 cookie=0x0, duration=1361525.650s, table=0, n_packets=51292, n_bytes=9421634, idle_age=14, priority=0 actions=NORMAL

Now you may notice that my command differs from the one in blog as I've added the --protocols=OpenFlow15 flag.  If I don't do this, i get the following error:
vagrant@rack-1-host-1:~$ sudo ovs-ofctl dump-flows br-ex
2024-02-07T20:42:29Z|00001|vconn|WARN|unix:/usr/local/var/run/openvswitch/br-ex.mgmt: version negotiation failed (we support version 0x01, peer supports versions 0x04, 0x06)
ovs-ofctl: br-ex: failed to connect to socket (Protocol error)

Which made me wonder if ovn-bgp-agent is hitting the same error.  It is logging a very similar error:

vagrant@rack-1-host-1:~$ sudo journalctl -u devstack@ovn-bgp-agent | grep negotiation
<snipped lots of matches, only showing some of last set>
Feb 07 20:45:11 rack-1-host-1 ovs-ofctl[3426804]: ovs|00001|vconn|WARN|unix:/usr/local/var/run/openvswitch/br-ex.mgmt: version negotiation failed (we support version 0x01, peer supports versions 0x04, 0x06)
Feb 07 20:45:11 rack-1-host-1 ovs-ofctl[3426807]: ovs|00001|vconn|WARN|unix:/usr/local/var/run/openvswitch/br-ex.mgmt: version negotiation failed (we support version 0x01, peer supports versions 0x04, 0x06)
Feb 07 20:45:11 rack-1-host-1 ovs-ofctl[3426809]: ovs|00001|vconn|WARN|unix:/usr/local/var/run/openvswitch/br-ex.mgmt: version negotiation failed (we support version 0x01, peer supports versions 0x04, 0x06)
<snip>
vagrant@rack-1-host-1:~$ date
Wed Feb  7 08:46:14 PM UTC 2024

Every five minutes this message repeats about 10 total times with what I presume are different thread ids.
Is there something in the way that ovn-bgp-agent is using ovs-ofctl (or all ovs-*tools) to fail because of version mismatch?

FWIW from my command line, only specifying --protocols=OpenFlow=13 or OpenFlow=15 work correctly.

Revision history for this message

Andy Litzinger (alitzinger) wrote on 2024-02-13:

#7

I have confirmed the missing flows are definitely the source of my issue. If I manually install the flows:

on rack-1-host-1
sudo ovs-ofctl --protocols=OpenFlow13 add-flow br-ex ip,priority=900,in_port="patch-provnet-2",actions=mod_dl_dst:<br-ex mac addr>,NORMAL

on rack-1-host-2:
sudo ovs-ofctl --protocols=OpenFlow13 add-flow br-ex ip,priority=900,in_port="patch-provnet-2",actions=mod_dl_dst:<br-ex mac addr>,NORMAL

Then my two vms on the same network can communicate with each other. For some reason if I stop my traffic tests then after a few minutes the flows seem to time out and get removed from the flow table. Not sure why that happens as the man page for ovs-ofctl add-flow seems to indicate that the default idle time is infinite.

I looked into the ovn-bgp-agent code and I can see where it's calling ovs-ofctl to create the flows, but isn't specifying an OpenFlow version. Based on the man page for ovs-ofcts it seems like ovn-bgp-agent should be specifying a value for --protocols. Only I'm not sure how it is working for the tests in the repro or other actual set ups you and Dan have presumably built.

One last note- in my local.conf for devstack I am specifying an OVS_BRANCH value:
# Pin OVN past bug
# https://bugs.launchpad.net/neutron/+bug/2049488
OVS_BRANCH=4102674b3ecadb0e20e512cc661cddbbc4b3d1f6

But I wouldn't think that matters since by default devstack would just build the latest each time. Plus the build of OVS is also what provides the ovs-ofctl tool. The man page for ovs-ofctl tool says that it defaults to an older version of OpenFlow, OpenFlow 1.0 so it seems like it would be required to included the --protocols flag when operating with an up to date OVS.

This seems like a bug, do you agree?

Revision history for this message

Luis Tomas Bolivar (ltomasbo) wrote on 2024-03-01:

#8

Sorry I missed this, for some reason I was not getting notified about the replies.

When running the ovs command we check if there is a processexecutionerror and then try with specifying OpenFlow13, see this: https://opendev.org/openstack/ovn-bgp-agent/src/branch/master/ovn_bgp_agent/privileged/ovs_vsctl.py#L32,

And based on your last comment, you were also using OpenFlow13 to fix this, right? Perhaps for some reason you have an older version of processutils and it is not raising ProcessExecutionError and therefore not trying with OpenFlow13?

Revision history for this message

Andy Litzinger (alitzinger) wrote on 2024-03-01:

#9

Hi Luis,
Is there an easy way to tell what version of processutils / oslo_concurrency is installed?

I am attempting to pin to the 2023.2 stable branch and OVN past a bug via these statements in my devstack local.conf:

enable_plugin neutron https://opendev.org/openstack/neutron stable/2023.2

OVN_BUILD_FROM_SOURCE=true
OVN_BRANCH=main
#OVS_BRANCH=branch-3.2
# Pin OVN past bug
# https://bugs.launchpad.net/neutron/+bug/2049488
OVS_BRANCH=4102674b3ecadb0e20e512cc661cddbbc4b3d1f6

Both of these are recent enough that it feels like it shouldn't be the issue, right?

Revision history for this message

Andy Litzinger (alitzinger) wrote on 2024-03-01:

#10

assuming this is accurate, I activated the virtual environment and listed the python modules installed:

vagrant@rack-1-host-1:/opt/stack/devstack$ source /opt/stack/data/venv/bin/activate
(venv) vagrant@rack-1-host-1:/opt/stack/devstack$ pip list --format=freeze
<snip>
oslo.concurrency==5.2.0
<snip>
ovn-bgp-agent==1.1.0.dev50
ovs==2.17.1.post1
<snip>

Revision history for this message

Luis Tomas Bolivar (ltomasbo) wrote on 2024-03-04:

#11

interesting, that looks good (I btw sent a fix to bump ovs_branch to 3.3 https://review.opendev.org/c/openstack/ovn-bgp-agent/+/910491)

If you can reproduce this all the time, perhaps you can add some modifications to the LOG.exception messages in here https://opendev.org/openstack/ovn-bgp-agent/src/branch/master/ovn_bgp_agent/privileged/ovs_vsctl.py, to know if the one in line 36 is executed, or the one in line 40 directly

Revision history for this message

Andy Litzinger (alitzinger) wrote on 2024-03-04 (last edit on 2024-03-04):

#12

Download full text (5.8 KiB)

Should the log messages from ovs_vsctl.py end up in the ovn-bgp-agent logs? If so, I don't have any messages that match the error text from lines 36 or 40:

vagrant@rack-1-host-1:~$ sudo journalctl -u devstack@ovn-bgp-agent | grep Exception
vagrant@rack-1-host-1:~$

vagrant@rack-1-host-2:~$ sudo journalctl -u devstack@ovn-bgp-agent | grep Exception
vagrant@rack-1-host-2:~$

There is a slight difference in the errors ovn-bgp-agent is logging on my two nodes.
Node rack-1-host-1 is the controller + compute node. It has the config line to pin the neutron version and the pin to the OVS version in its local.conf . I also edited ovn_bgp_agent/drivers/openstack/utils/ovs.py so that each call for ovs-ofctl passes in the --protocols=OpenFlow13, e.g.:

--- a/ovn_bgp_agent/drivers/openstack/utils/ovs.py
+++ b/ovn_bgp_agent/drivers/openstack/utils/ovs.py
@@ -42,7 +42,7 @@ def _find_ovs_port(bridge):

def get_bridge_flows(bridge, filter_=None):
- args = ['dump-flows', bridge]
+ args = ['--protocols=OpenFlow13', 'dump-flows', bridge]
     if filter_ is not None:
         args.append(filter_)
     return ovn_bgp_agent.privileged.ovs_vsctl.ovs_cmd(
@@ -114,10 +114,10 @@ def ensure_mac_tweak_flows(bridge, mac, ports, cookie):

         if not exist_flow:
             ovn_bgp_agent.privileged.ovs_vsctl.ovs_cmd(
- 'ovs-ofctl', ['add-flow', bridge, flow])
+ 'ovs-ofctl', ['--protocols=OpenFlow13', 'add-flow', bridge, flow])
         if not exist_flow_v6:
             ovn_bgp_agent.privileged.ovs_vsctl.ovs_cmd(
- 'ovs-ofctl', ['add-flow', bridge, flow_v6])
+ 'ovs-ofctl', ['--protocols=OpenFlow13', 'add-flow', bridge, flow_v6])

<snip>

Node rack-1-host-2 is the compute node. It does NOT have the config line to pin the neutron version in its local.conf. It does have the config line to pin OVS version. It does NOT have any updates to it's ovs.py

node rack-1-host-1 is not specifically logging the ofctl version error, presumably because of my changes to ovs.py. Despite not logging the 'version negotiation failed' errors, it is not adding the flow to br-ex. Every 5 minutes it is logging these errors which I am not sure what they relate to:
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: 2024-03-04 19:37:29.522 46187 DEBUG ovn_bgp_agent.drivers.openstack.ovn_bgp_driver [-] Added BGP route for logical por>
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: fail <class 'int'>
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: fail <class 'str'>
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: fail <class 'str'>
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: fail <class 'str'>
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: fail <class 'int'>
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: fail <class 'int'>
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: recursion
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: fail <class 'int'>
Mar 04 19:37:29 rack-1-host-1 ovn-bgp-agent[46187]: {'attrs': [('RTA_TABLE', 232), ('RTA_DST', '2001:db8::192'), ('RTA_SRC', '::'), ('RTA_PREFSRC', 'fd53:d91e:400:7f17::1>

rack-1-host-2 is logging the ofctl version mismatch error every ...

ovn-bgp-agent

Instances on same flat/vlan network but different nodes cannot communicate

Bug Description

Other bug subscribers

Remote bug watches