[RFE] Layer 2 VNI support in EVPN driver

Bug #2017890 reported by Luis Tomas Bolivar
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
ovn-bgp-agent
New
Wishlist
Unassigned

Bug Description

Using EVPN, it is possible to extend layer-2 segments across a layer-3 fabric. This is done using L2VNIs and Type-2 EVPN routes a.k.a. MAC/IP routes. This functionality makes it possible to use layer-2 provider networks without any trunking of VLANs throughout the data centre network and onwards to the hypervisors, which is highly desirable in data centre networks at scale.

This makes it possible for an OpenStack tenant to create a provider network that, say, connects a VM running in OpenStack with some physical firewall device connected to some leaf switch in a different rack, even though the topology between the hypervisor and the leaf switch is a pure layer-3 IP fabric with EVPN signaling. This is not possible with OVN today, as far as I can tell.

It would be awesome if ovn-bgp-agent had this feature. The way I can imagine it being set it up goes something like this:
1. The admin creates a provider network with a dummy VLAN tag. The network should not be trunked to any physical interface on the compute node (using exactly the same trick as with the external network used with the ovn-bgp-agent's BGP driver, in other words).
2. Some attribute is added to the network/switch object in the OVN DB that binds it to a L2VNI (say neutron_bgpvpn:vni=99)
3. ovn-bgp-agent reacts to this and creates a corresponding bridge device in the Linux kernel (say br-99) and a corresponding VXLAN device (say vxlan-99)
4. ovn-bgp-agent plugs both the OVS-connected netdev (br-ex.99) and vxlan-42 into br-99.

Assuming FRR is running with the EVPN SAFI enabled on VNI 99, this ought to ensure that the MAC addreses of local ports (instances, router ports, floating addresses, etc.) will be advertised to the remote (physical) VTEPs, and MAC addreses behind the remote VTEPs will be installed locally.

I tried to do a little PoC, and I have almost gotten it to work, but not quite (see below). I used ovn-bgp-agent in BGP mode to set up the dummy network (i.e., step #1). What I did was:

# First start ovn-bgp-agent in BGP mode so that it can do its magic to make br-ex.99 show up as a netdev
systemctl start ovn-bgp-agent

# The stop it again so that it doesn't stomp over our manual PoC
systemctl stop ovn-bgp-agent

# Remove layer 3 stuff set up by ovn-bgp-agent which should not be necessary for an L2VNI to work (and possibly conflict too, who knows)
echo 0 > /proc/sys/net/ipv4/conf/br-ex.99/proxy_arp
vtysh -c 'conf t' -c 'no router bgp 64999 vrf bgp-vrf'
ip -4 a flush dev bgp-nic

# Create the glue between VNI 99 and br-ex.99 (10.27.24.10 is the loopback address)
ip link add vxlan-99 type vxlan id 99 dstport 4789 local 10.27.24.10 nolearning
brctl addbr br-99
brctl addif br-99 vxlan-99
brctl addif br-99 br-ex.99
ip link set dev br-99 up
ip link set dev vxlan-99 up
For what it is worth, the FRR running configuration now contains:

!
frr version 8.2.2
frr defaults datacenter
hostname openstack-compute
log syslog
no ipv6 forwarding
!
interface lo
 ip address 10.27.24.10/32
exit
!
router bgp 64999
 no bgp default ipv4-unicast
 bgp bestpath as-path multipath-relax
 neighbor eth0 interface remote-as external
 neighbor eth1 interface remote-as external
 !
 address-family ipv4 unicast
  redistribute connected
  neighbor eth0 activate
  neighbor eth0 prefix-list only-default in
  neighbor eth0 prefix-list only-host-prefixes out
  neighbor eth1 activate
  neighbor eth1 prefix-list only-default in
  neighbor eth1 prefix-list only-host-prefixes out
  import vrf bgp-vrf
 exit-address-family
 !
 address-family ipv6 unicast
  neighbor eth0 activate
  neighbor eth0 prefix-list only-default in
  neighbor eth0 prefix-list only-host-prefixes out
  neighbor eth1 activate
  neighbor eth1 prefix-list only-default in
  neighbor eth1 prefix-list only-host-prefixes out
  import vrf bgp-vrf
 exit-address-family
 !
 address-family l2vpn evpn
  neighbor eth0 activate
  neighbor eth1 activate
  advertise-all-vni
 exit-address-family
exit
!
ip prefix-list only-default seq 5 permit 0.0.0.0/0
ip prefix-list only-host-prefixes seq 5 permit 0.0.0.0/0 ge 32
!
ipv6 prefix-list only-default seq 5 permit ::/0
ipv6 prefix-list only-host-prefixes seq 5 permit ::/0 ge 128
!
ip nht resolve-via-default
!
end
At this point, the L2VNI does look good. I can now see that the remote MAC addresses of the floating IP and router ports on the copmute show up on the remote VTEP (a switch running Cumulus Linux):

cumulus-switch# show evpn mac vni 99 vtep 10.27.24.10

VNI 99

MAC Type FlagsIntf/Remote ES/VTEP VLAN Seq #'s
fa:16:3e:ea:b3:1b remote 10.27.24.10 1/0
fa:16:3e:ec:b2:24 remote 10.27.24.10 0/0
fa:16:3e:69:92:4f remote 10.27.24.10 0/0
And the BGP advertisements:

And the BGP advertisements:

cumulus-switch# show bgp l2vpn evpn rd 10.27.24.10:2
BGP table version is 34, local router ID is 192.0.2.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10.27.24.10:2
* i[2]:[0]:[48]:[fa:16:3e:69:92:4f]
                    10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i 10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i 10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* 10.27.24.10 0 64902 64903 64999 i
                    RT:64999:99 ET:8
*> 10.27.24.10 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i[2]:[0]:[48]:[fa:16:3e:ea:b3:1b]
                    10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i 10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i 10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
*> 10.27.24.10 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* 10.27.24.10 0 64902 64903 64999 i
                    RT:64999:99 ET:8
* i[2]:[0]:[48]:[fa:16:3e:ec:b2:24]
                    10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i 10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i 10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* 10.27.24.10 0 64902 64903 64999 i
                    RT:64999:99 ET:8
*> 10.27.24.10 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i[3]:[0]:[32]:[10.27.24.10]
                    10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i 10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* i 10.27.24.10 100 0 64901 64903 64999 i
                    RT:64999:99 ET:8
* 10.27.24.10 0 64902 64903 64999 i
                    RT:64999:99 ET:8
*> 10.27.24.10 0 64901 64903 64999 i
                    RT:64999:99 ET:8

Displayed 4 out of 20 total prefixes
Note also the Type-3 route which tells the VTEP to use Head-end-replication to flood BUM traffic to it, which is necessary in order for ARP etc. to work across the L2VNI. So all looks normal on the remote VTEP.

It looks good on the openstack compute node as well:

openstack-compute# show evpn mac vni 99
Number of MACs (local and remote) known for this VNI: 15
Flags: N=sync-neighs, I=local-inactive, P=peer-active, X=peer-proxy
MAC Type Flags Intf/Remote ES/VTEP VLAN Seq #'s
80:a2:35:02:3d:77 remote 192.0.2.0 0/0
00:00:5e:00:01:00 remote 192.0.2.35 0/0
fa:16:3e:ea:b3:1b local br-ex.99 0/0
fa:16:3e:9e:a0:e5 remote 192.0.2.33 0/18
(…)
So what I now can do, is to send ARPs from the remote VTEP to a floating IP, they are responded to and everything is fine:

cumulus-switch$ sudo arping -I vlan99 87.238.55.12
ARPING 87.238.55.12
58 bytes from fa:16:3e:ec:b2:24 (87.238.55.12): index=0 time=474.901 usec
58 bytes from fa:16:3e:ec:b2:24 (87.238.55.12): index=1 time=412.797 usec
I see them passing through the br-99 I made on the compute node as well:

openstack-compute# tcpdump -i br-99 icmp or 'host 87.238.55.12' -e -c2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on br-99, link-type EN10MB (Ethernet), snapshot length 262144 bytes
13:55:24.998600 64:9d:99:3a:34:58 (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 58: Request who-has 87.238.55.12 tell 87.238.54.5, length 44
13:55:24.998838 fa:16:3e:ec:b2:24 (oui Unknown) > 64:9d:99:3a:34:58 (oui Unknown), ethertype ARP (0x0806), length 58: Reply 87.238.55.12 is-at fa:16:3e:ec:b2:24 (oui Unknown), length 44
However when I start to ping, that is when I get in trouble. While the pings also pass through br-99, the responses from the VM has the wrong destination MAC instead of the MAC of the host pinging:

openstack-compute# tcpdump -i br-99 icmp or 'host 87.238.55.12' -e -c2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on br-99, link-type EN10MB (Ethernet), snapshot length 262144 bytes
13:57:03.612631 64:9d:99:3a:34:58 (oui Unknown) > fa:16:3e:ec:b2:24 (oui Unknown), ethertype IPv4 (0x0800), length 98: 87.238.54.5 > 87.238.55.12: ICMP echo request, id 14979, seq 4, length 64
13:57:03.612722 fa:16:3e:ec:b2:24 (oui Unknown) > 0a:c3:9f:2c:19:40 (oui Unknown), ethertype IPv4 (0x0800), length 98: 87.238.55.12 > 87.238.54.5: ICMP echo reply, id 14979, seq 4, length 64
The destination MAC 0a:c3:9f:2c:19:40 is local to the compute node, it resides on br-ex, which causes it to not be transmitted out to vxlan-99 (if I tcpdump on vxlan-99, I can only see the ping requests).

Also I can see that on the tap interface connected to the VM with the floating IP assigned, yet another MAC address is in use:

openstack-compute$ sudo tcpdump -i tap62ed80fd-e1 -e -c2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tap62ed80fd-e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:02:08.765034 fa:16:3e:aa:10:3e (oui Unknown) > fa:16:3e:46:bc:4a (oui Unknown), ethertype IPv4 (0x0800), length 98: 87.238.54.5 > 10.0.0.103: ICMP echo request, id 14979, seq 302, length 64
14:02:08.765108 fa:16:3e:46:bc:4a (oui Unknown) > fa:16:3e:aa:10:3e (oui Unknown), ethertype IPv4 (0x0800), length 98: 10.0.0.103 > 87.238.54.5: ICMP echo reply, id 14979, seq 302, length 64
fa:16:3e:aa:10:3e here belongs to the LRP port on the tenant network (i.e. where the 10.0.0.1 gateway address is, which makes sense.

So that is the missing piece of the puzzle. There must be some kind of NAT floating IP flow magic inside OVN/OVS that causes the destination MAC address to get wrong, perhaps set up by ovn-bgp-agent initially (a local DMAC makes makes sense in its intended operation mode, after all), I don't know, but I assume that someone with more experience would be able to figure it out. I'm pretty sure that if those ping replies had DMAC 64:9d:99:3a:34:58 once they entered into br-99, then everything would have worked perfectly and I would have been able to get IP traffic forwarded across the L2VNI.

Revision history for this message
Luis Tomas Bolivar (ltomasbo) wrote :

From Tore Anderson:
I forgot to mention one thing: EVPN Type-2 MAC/IP route can optionally include an IP address, as exemplified by the advertisements from the remote VTEP here:

openstack-compute$ sudo vtysh -c 'sh bgp l2vpn evpn rd 192.02.2.4:2'
BGP table version is 7, local router ID is 192.0.2.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]:[EthTag]:[IPlen]:[VTEP-IP]:[Frag-id]
EVPN type-2 prefix: [2]:[EthTag]:[MAClen]:[MAC]:[IPlen]:[IP]
EVPN type-3 prefix: [3]:[EthTag]:[IPlen]:[OrigIP]
EVPN type-4 prefix: [4]:[ESI]:[IPlen]:[OrigIP]
EVPN type-5 prefix: [5]:[EthTag]:[IPlen]:[IP]

   Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 192.0.2.4:2
* [2]:[0]:[48]:[00:00:5e:00:01:00]:[32]:[87.238.54.1]
                    192.0.2.35 0 64902 39029 i
                    RT:39029:99 ET:8 Default Gateway
*> 192.0.2.35 0 64901 39029 i
                    RT:39029:99 ET:8 Default Gateway
(…)
This information is useful when enabling ARP/ND suppression. The OpenStack compute node could, using this IP/MAC coupling information, answer ARP requests for 87.238.54.1 with the correct MAC locally, preventing the need to flood the ARP request packet across the L2VNI to all remote VTEPs. (Same for IPv6 Neighbour Solicitations.)

In a similar fashion, ovn-bgp-agent and/or FRR could include the IP addresses when advertising the local MAC addreses to the remote VTEPs. The MAC/IP bindings are of course known in advance since they are part of the port object in OpenStack. Advertising them would allow the remote VTEPs to suppress ARPs for VM/floating IPs/etc, answering them locally instead.

Of course, this is just a "nice-to-have" / v2.0 feature - after all, the current VLAN-based provider networks can only do flooding, so having flooding with L2VNI-based provider networks is definitively not a regression.

Revision history for this message
Luis Tomas Bolivar (ltomasbo) wrote :

From Tore Anderson:
So for the record, @ltomasbo identified the missing piece of the puzzle immediately - the following flows were set up by ovn-bgp-agent at startup and are responsible for rewriting the destination MAC of the responses:
openstack-compute$ sudo ovs-ofctl dump-flows br-ex
 cookie=0x3e7, duration=11052.208s, table=0, n_packets=10976, n_bytes=1045131, priority=900,ip,in_port="patch-provnet-2" actions=mod_dl_dst:0a:c3:9f:2c:19:40,NORMAL
 cookie=0x3e7, duration=11052.200s, table=0, n_packets=0, n_bytes=0, priority=900,ipv6,in_port="patch-provnet-2" actions=mod_dl_dst:0a:c3:9f:2c:19:40,NORMAL
 cookie=0x0, duration=11455.756s, table=0, n_packets=2184469, n_bytes=101125352, priority=0 actions=NORMAL
Got rid of them with ovs-ofctl del-flows br-ex cookie=0x3e7/-1 and the traffic now flows fine through the L2VNI! :-)

Revision history for this message
Luis Tomas Bolivar (ltomasbo) wrote :

From Tore Anderson:
After some more testing I think I have an even better and simpler PoC than the one described initially, one that does not need me to run ovn-bgp-agent first. This is all it takes:

### The following needs to be done ONCE
# 1) Create and enable a VLAN-aware bridge device
ip link add br-vxlan type bridge
echo 1 > /sys/class/net/br-vxlan/bridge/vlan_filtering
ip link set dev br-vxlan up

# 2) Connect the OVS provider net bridge to the Linux bridge
ip link set br-ex master br-vxlan

# 3) Optionally disallow untagged traffic to the Linux bridge
bridge vlan del vid 1 dev br-ex

### The following needs to be done for each provider network to be glued to a L2VNI:
# 1) Allow traffic on the provider VLAN to flow into the Linux bridge from the OVS bridge
bridge vlan add vid 99 dev br-ex

# 2) Create and enable a VXLAN device for target L2VNI (note how VNI value does not need to be equal to VLAN tag value)
ip link add vxlan-1234 type vxlan id 1234 dstport 4789 local <loopback-address> nolearning
ip link set dev vxlan-1234 up

# 3) Connect the VXLAN device to the Linux bridge and move it to the correct VLAN
ip link set vxlan-1234 master br-vxlan
bridge vlan del vid 1 dev vxlan-1234
bridge vlan add vid 99 dev vxlan-1234 pvid untagged
I also tried something even simpler, essentially plugging the vxlan device directly into the OVS bridge. While the commands work, it does not work with EVPN. Presumably this is because FRR does not know how to communicate with OVS bridges, as it needs to in order to find out which MAC addresses to advertise in BGP and so on. Therefore the intermediate Linux bridge seems necessary, unless OVS bridge support can be enabled in FRR somehow. In any case, this is what I tried:

# Create and enable a VXLAN device for target L2VNI (same as above)
ip link add vxlan-1234 type vxlan id 1234 dstport 4789 local <loopback-address> nolearning
ip link set dev vxlan-1234 up

# Connect VXLAN device to OVS provider bridge (but it does NOT work)
ovs-vsctl add-port br-ex vxlan-1234 tag=99

Revision history for this message
Maximilian Sesterhenn (msnatepg) wrote :

Sorry for the change of information type some seconds ago. I reverted it.
Not sure why I have the rights for this... ;)

information type: Public → Public Security
information type: Public Security → Public
Changed in ovn-bgp-agent:
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.