[RFE] Layer 2 VNI support in EVPN driver
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ovn-bgp-agent |
New
|
Wishlist
|
Unassigned |
Bug Description
Using EVPN, it is possible to extend layer-2 segments across a layer-3 fabric. This is done using L2VNIs and Type-2 EVPN routes a.k.a. MAC/IP routes. This functionality makes it possible to use layer-2 provider networks without any trunking of VLANs throughout the data centre network and onwards to the hypervisors, which is highly desirable in data centre networks at scale.
This makes it possible for an OpenStack tenant to create a provider network that, say, connects a VM running in OpenStack with some physical firewall device connected to some leaf switch in a different rack, even though the topology between the hypervisor and the leaf switch is a pure layer-3 IP fabric with EVPN signaling. This is not possible with OVN today, as far as I can tell.
It would be awesome if ovn-bgp-agent had this feature. The way I can imagine it being set it up goes something like this:
1. The admin creates a provider network with a dummy VLAN tag. The network should not be trunked to any physical interface on the compute node (using exactly the same trick as with the external network used with the ovn-bgp-agent's BGP driver, in other words).
2. Some attribute is added to the network/switch object in the OVN DB that binds it to a L2VNI (say neutron_
3. ovn-bgp-agent reacts to this and creates a corresponding bridge device in the Linux kernel (say br-99) and a corresponding VXLAN device (say vxlan-99)
4. ovn-bgp-agent plugs both the OVS-connected netdev (br-ex.99) and vxlan-42 into br-99.
Assuming FRR is running with the EVPN SAFI enabled on VNI 99, this ought to ensure that the MAC addreses of local ports (instances, router ports, floating addresses, etc.) will be advertised to the remote (physical) VTEPs, and MAC addreses behind the remote VTEPs will be installed locally.
I tried to do a little PoC, and I have almost gotten it to work, but not quite (see below). I used ovn-bgp-agent in BGP mode to set up the dummy network (i.e., step #1). What I did was:
# First start ovn-bgp-agent in BGP mode so that it can do its magic to make br-ex.99 show up as a netdev
systemctl start ovn-bgp-agent
# The stop it again so that it doesn't stomp over our manual PoC
systemctl stop ovn-bgp-agent
# Remove layer 3 stuff set up by ovn-bgp-agent which should not be necessary for an L2VNI to work (and possibly conflict too, who knows)
echo 0 > /proc/sys/
vtysh -c 'conf t' -c 'no router bgp 64999 vrf bgp-vrf'
ip -4 a flush dev bgp-nic
# Create the glue between VNI 99 and br-ex.99 (10.27.24.10 is the loopback address)
ip link add vxlan-99 type vxlan id 99 dstport 4789 local 10.27.24.10 nolearning
brctl addbr br-99
brctl addif br-99 vxlan-99
brctl addif br-99 br-ex.99
ip link set dev br-99 up
ip link set dev vxlan-99 up
For what it is worth, the FRR running configuration now contains:
!
frr version 8.2.2
frr defaults datacenter
hostname openstack-compute
log syslog
no ipv6 forwarding
!
interface lo
ip address 10.27.24.10/32
exit
!
router bgp 64999
no bgp default ipv4-unicast
bgp bestpath as-path multipath-relax
neighbor eth0 interface remote-as external
neighbor eth1 interface remote-as external
!
address-family ipv4 unicast
redistribute connected
neighbor eth0 activate
neighbor eth0 prefix-list only-default in
neighbor eth0 prefix-list only-host-prefixes out
neighbor eth1 activate
neighbor eth1 prefix-list only-default in
neighbor eth1 prefix-list only-host-prefixes out
import vrf bgp-vrf
exit-address-
!
address-family ipv6 unicast
neighbor eth0 activate
neighbor eth0 prefix-list only-default in
neighbor eth0 prefix-list only-host-prefixes out
neighbor eth1 activate
neighbor eth1 prefix-list only-default in
neighbor eth1 prefix-list only-host-prefixes out
import vrf bgp-vrf
exit-address-
!
address-family l2vpn evpn
neighbor eth0 activate
neighbor eth1 activate
advertise-all-vni
exit-address-
exit
!
ip prefix-list only-default seq 5 permit 0.0.0.0/0
ip prefix-list only-host-prefixes seq 5 permit 0.0.0.0/0 ge 32
!
ipv6 prefix-list only-default seq 5 permit ::/0
ipv6 prefix-list only-host-prefixes seq 5 permit ::/0 ge 128
!
ip nht resolve-via-default
!
end
At this point, the L2VNI does look good. I can now see that the remote MAC addresses of the floating IP and router ports on the copmute show up on the remote VTEP (a switch running Cumulus Linux):
cumulus-switch# show evpn mac vni 99 vtep 10.27.24.10
VNI 99
MAC Type FlagsIntf/Remote ES/VTEP VLAN Seq #'s
fa:16:3e:ea:b3:1b remote 10.27.24.10 1/0
fa:16:3e:ec:b2:24 remote 10.27.24.10 0/0
fa:16:3e:69:92:4f remote 10.27.24.10 0/0
And the BGP advertisements:
And the BGP advertisements:
cumulus-switch# show bgp l2vpn evpn rd 10.27.24.10:2
BGP table version is 34, local router ID is 192.0.2.4
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]
EVPN type-2 prefix: [2]:[EthTag]
EVPN type-3 prefix: [3]:[EthTag]
EVPN type-4 prefix: [4]:[ESI]
EVPN type-5 prefix: [5]:[EthTag]
Network Next Hop Metric LocPrf Weight Path
Route Distinguisher: 10.27.24.10:2
* i[2]:[0]
* i 10.27.24.10 100 0 64901 64903 64999 i
* i 10.27.24.10 100 0 64901 64903 64999 i
* 10.27.24.10 0 64902 64903 64999 i
*> 10.27.24.10 0 64901 64903 64999 i
* i[2]:[0]
* i 10.27.24.10 100 0 64901 64903 64999 i
* i 10.27.24.10 100 0 64901 64903 64999 i
*> 10.27.24.10 0 64901 64903 64999 i
* 10.27.24.10 0 64902 64903 64999 i
* i[2]:[0]
* i 10.27.24.10 100 0 64901 64903 64999 i
* i 10.27.24.10 100 0 64901 64903 64999 i
* 10.27.24.10 0 64902 64903 64999 i
*> 10.27.24.10 0 64901 64903 64999 i
* i[3]:[0]
* i 10.27.24.10 100 0 64901 64903 64999 i
* i 10.27.24.10 100 0 64901 64903 64999 i
* 10.27.24.10 0 64902 64903 64999 i
*> 10.27.24.10 0 64901 64903 64999 i
Displayed 4 out of 20 total prefixes
Note also the Type-3 route which tells the VTEP to use Head-end-
It looks good on the openstack compute node as well:
openstack-compute# show evpn mac vni 99
Number of MACs (local and remote) known for this VNI: 15
Flags: N=sync-neighs, I=local-inactive, P=peer-active, X=peer-proxy
MAC Type Flags Intf/Remote ES/VTEP VLAN Seq #'s
80:a2:35:02:3d:77 remote 192.0.2.0 0/0
00:00:5e:00:01:00 remote 192.0.2.35 0/0
fa:16:3e:ea:b3:1b local br-ex.99 0/0
fa:16:3e:9e:a0:e5 remote 192.0.2.33 0/18
(…)
So what I now can do, is to send ARPs from the remote VTEP to a floating IP, they are responded to and everything is fine:
cumulus-switch$ sudo arping -I vlan99 87.238.55.12
ARPING 87.238.55.12
58 bytes from fa:16:3e:ec:b2:24 (87.238.55.12): index=0 time=474.901 usec
58 bytes from fa:16:3e:ec:b2:24 (87.238.55.12): index=1 time=412.797 usec
I see them passing through the br-99 I made on the compute node as well:
openstack-compute# tcpdump -i br-99 icmp or 'host 87.238.55.12' -e -c2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on br-99, link-type EN10MB (Ethernet), snapshot length 262144 bytes
13:55:24.998600 64:9d:99:3a:34:58 (oui Unknown) > Broadcast, ethertype ARP (0x0806), length 58: Request who-has 87.238.55.12 tell 87.238.54.5, length 44
13:55:24.998838 fa:16:3e:ec:b2:24 (oui Unknown) > 64:9d:99:3a:34:58 (oui Unknown), ethertype ARP (0x0806), length 58: Reply 87.238.55.12 is-at fa:16:3e:ec:b2:24 (oui Unknown), length 44
However when I start to ping, that is when I get in trouble. While the pings also pass through br-99, the responses from the VM has the wrong destination MAC instead of the MAC of the host pinging:
openstack-compute# tcpdump -i br-99 icmp or 'host 87.238.55.12' -e -c2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on br-99, link-type EN10MB (Ethernet), snapshot length 262144 bytes
13:57:03.612631 64:9d:99:3a:34:58 (oui Unknown) > fa:16:3e:ec:b2:24 (oui Unknown), ethertype IPv4 (0x0800), length 98: 87.238.54.5 > 87.238.55.12: ICMP echo request, id 14979, seq 4, length 64
13:57:03.612722 fa:16:3e:ec:b2:24 (oui Unknown) > 0a:c3:9f:2c:19:40 (oui Unknown), ethertype IPv4 (0x0800), length 98: 87.238.55.12 > 87.238.54.5: ICMP echo reply, id 14979, seq 4, length 64
The destination MAC 0a:c3:9f:2c:19:40 is local to the compute node, it resides on br-ex, which causes it to not be transmitted out to vxlan-99 (if I tcpdump on vxlan-99, I can only see the ping requests).
Also I can see that on the tap interface connected to the VM with the floating IP assigned, yet another MAC address is in use:
openstack-compute$ sudo tcpdump -i tap62ed80fd-e1 -e -c2
dropped privs to tcpdump
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on tap62ed80fd-e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
14:02:08.765034 fa:16:3e:aa:10:3e (oui Unknown) > fa:16:3e:46:bc:4a (oui Unknown), ethertype IPv4 (0x0800), length 98: 87.238.54.5 > 10.0.0.103: ICMP echo request, id 14979, seq 302, length 64
14:02:08.765108 fa:16:3e:46:bc:4a (oui Unknown) > fa:16:3e:aa:10:3e (oui Unknown), ethertype IPv4 (0x0800), length 98: 10.0.0.103 > 87.238.54.5: ICMP echo reply, id 14979, seq 302, length 64
fa:16:3e:aa:10:3e here belongs to the LRP port on the tenant network (i.e. where the 10.0.0.1 gateway address is, which makes sense.
So that is the missing piece of the puzzle. There must be some kind of NAT floating IP flow magic inside OVN/OVS that causes the destination MAC address to get wrong, perhaps set up by ovn-bgp-agent initially (a local DMAC makes makes sense in its intended operation mode, after all), I don't know, but I assume that someone with more experience would be able to figure it out. I'm pretty sure that if those ping replies had DMAC 64:9d:99:3a:34:58 once they entered into br-99, then everything would have worked perfectly and I would have been able to get IP traffic forwarded across the L2VNI.
Changed in ovn-bgp-agent: | |
importance: | Undecided → Wishlist |
From Tore Anderson:
I forgot to mention one thing: EVPN Type-2 MAC/IP route can optionally include an IP address, as exemplified by the advertisements from the remote VTEP here:
openstack-compute$ sudo vtysh -c 'sh bgp l2vpn evpn rd 192.02.2.4:2' :[EthTag] :[IPlen] :[VTEP- IP]:[Frag- id] :[MAClen] :[MAC]: [IPlen] :[IP] :[IPlen] :[OrigIP] :[IPlen] :[OrigIP] :[IPlen] :[IP]
BGP table version is 7, local router ID is 192.0.2.1
Status codes: s suppressed, d damped, h history, * valid, > best, i - internal
Origin codes: i - IGP, e - EGP, ? - incomplete
EVPN type-1 prefix: [1]:[ESI]
EVPN type-2 prefix: [2]:[EthTag]
EVPN type-3 prefix: [3]:[EthTag]
EVPN type-4 prefix: [4]:[ESI]
EVPN type-5 prefix: [5]:[EthTag]
Network Next Hop Metric LocPrf Weight Path :[48]:[ 00:00:5e: 00:01:00] :[32]:[ 87.238. 54.1]
192. 0.2.35 0 64902 39029 i
RT: 39029:99 ET:8 Default Gateway
RT: 39029:99 ET:8 Default Gateway
Route Distinguisher: 192.0.2.4:2
* [2]:[0]
*> 192.0.2.35 0 64901 39029 i
(…)
This information is useful when enabling ARP/ND suppression. The OpenStack compute node could, using this IP/MAC coupling information, answer ARP requests for 87.238.54.1 with the correct MAC locally, preventing the need to flood the ARP request packet across the L2VNI to all remote VTEPs. (Same for IPv6 Neighbour Solicitations.)
In a similar fashion, ovn-bgp-agent and/or FRR could include the IP addresses when advertising the local MAC addreses to the remote VTEPs. The MAC/IP bindings are of course known in advance since they are part of the port object in OpenStack. Advertising them would allow the remote VTEPs to suppress ARPs for VM/floating IPs/etc, answering them locally instead.
Of course, this is just a "nice-to-have" / v2.0 feature - after all, the current VLAN-based provider networks can only do flooding, so having flooding with L2VNI-based provider networks is definitively not a regression.