L2pop flows are lost after OVS agent restart
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
neutron |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
In OVS agent, there is a race condition between l2pop's add_fdb_entries notification and provision_
They are lost semi-permanently after this as l2Pop mechanism driver only sends full list of fdb entries after a port_update_up, but only on 1st agent port, or after OVS reboot (where we again hit same race condition, or it partially fixed flows).
Legacy testbed w/ 3 nodes. 4 tenant networks:
1. The add_fdb_entries code path will create the tunnel port(s) in add_fdb_tun, then invoke add_fdb_flow to add the BC/UC l2pop flows and - but only if it can get a Vlanmanager mapping:
def fdb_add(self, context, fdb_entries):
for lvm, agent_ports in self.get_
if len(agent_ports):
if not self.enable_
def get_agent_
For each known (i.e found in VLAN manager) network in
:param fdb_entries: l2pop fdb entries
:param local_vlan_map: Deprecated.
"""
lvm_getter = self._get_
for network_id, values in fdb_entries.
try:
lvm = lvm_getter(
except vlanmanager.
yield (lvm, agent_ports)
2. If the vlan mapping isn't found, the tunnel port creation is skipped, as are flows.
3. When we create VLAN mapping in provision_
def provision_
...
if network_type in constants.
if self.enable_
# outbound broadcast/multicast
if ofports:
# inbound from tunnels: set lvid in the right table
# and resubmit to Table LEARN_FROM_TUN for mac learning
4. Finally, the cleanup stale flows logic removes all old flows. At this point br-tun is left with missing flooding and/or unicast flows.
5. If #3 always happens first for all networks, we are good. Otherwise flows are lost:
Unicast only flows missing if (but flood added):
- Network Vlanmanager mapping is allocated *after* it's add_fdb_entries, but some other network sets up tunnel ports on br-tun
Broadcast AND UC flows missing if:
- A network tries to add fdb flows before vlanmanager allocated, and no other network has created the tunnel ports/ofports on br-tun yet.
Example with 3 tenant networks:
1. add_fdb_entries for network 1 and 2 - no LVM yet, so flow and tunnel ports not created yet
2. LVM created for network 2, but flood not installed because no ofports
3. LVM created for networks 3
4. add_fdb_entries for network 3, here it properly finds the LVM, and creates tunnel ports/flows
5. LVM created for network 1, tunnel ofports created, so flood installed - but unicast missing
After this point, network 3 would be fine, network 2 would me missing all flows, network 1 would have flood but not unicast.
The ordering seems to vary wildly depending on # of tunnel ports, # of networks, ports per network, how ports are distributed, network speed, etc...
tags: | added: l2-pop ovs |
Here 4 networks. A VM from each network on both 2 compute nodes. This was OVS agent restart on network node (legacy deployment here). Tunnel 0x75b34 missing unicast flows. Tunnel 0x75b1d missing both.
BEFORE: 0x9f710356180d5 375, duration=872.577s, table=0, n_packets=262, n_bytes=27320, idle_age=83, priority= 1,in_port= 1 actions= resubmit( ,2) 0x9f710356180d5 375, duration=526.306s, table=0, n_packets=231, n_bytes=22502, idle_age=180, priority= 1,in_port= 7 actions= resubmit( ,4) 0x9f710356180d5 375, duration=422.125s, table=0, n_packets=106, n_bytes=10221, idle_age=140, priority= 1,in_port= 8 actions= resubmit( ,4) 0x9f710356180d5 375, duration=872.574s, table=0, n_packets=0, n_bytes=0, idle_age=65534, priority=0 actions=drop 0x9f710356180d5 375, duration=872.573s, table=2, n_packets=195, n_bytes=22750, idle_age=122, priority= 0,dl_dst= 00:00:00: 00:00:00/ 01:00:00: 00:00:00 actions= resubmit( ,20) 0x9f710356180d5 375, duration=872.572s, table=2, n_packets=67, n_bytes=4570, idle_age=83, priority= 0,dl_dst= 01:00:00: 00:00:00/ 01:00:00: 00:00:00 actions= resubmit( ,22) 0x9f710356180d5 375, duration=872.569s, table=3, n_packets=0, n_bytes=0, idle_age=65534, priority=0 actions=drop 0x9f710356180d5 375, duration=863.798s, table=4, n_packets=27, n_bytes=4311, idle_age=197, priority= 1,tun_id= 0x75ae7 actions= mod_vlan_ vid:2,resubmit( ,10) 0x9f710356180d5 375, duration=861.078s, table=4, n_packets=202, n_bytes=18087, idle_age=140, priority= 1,tun_id= 0x75b10 actions= mod_vlan_ vid:1,resubmit( ,10) 0x9f710356180d5 375, duration=770.920s, table=4, n_packets=18, n_bytes=2874, idle_age=282, priority= 1,tun_id= 0x75b1d actions= mod_vlan_ vid:4,resubmit( ,10) 0x9f710356180d5 375, duration=756.222s, table=4, n_packets=90, n_bytes=7451, idle_age=346, priority= 1,tun_id= 0x75b34 actions= mod_vlan_ vid:5,resubmit( ,10) 0x9f710356180d5 375, duration=872.568s, table=4, n_packets=0, n_bytes=0, idle_age=65534, priority=0 actions=drop 0x9f710356180d5 375, duration=872.567s, table=6, n_packets=0, n_bytes=0, idle_age=65534, priority=0 actions=drop 0x9f710356180d5 375, duration=872.566s, table=10, n_packets=337, n_bytes=32723, idle_age=140, priority=1 actions= learn(table= 20,hard_ timeout= 300,priority= 1,cookie= 0x9f710356180d5 375,NXM_ OF_VLAN_ TCI[0.. 11],NXM_ OF_ETH_ DST[]=NXM_ OF_ETH_ SRC[],load: 0->NXM_ OF_VLAN_ TCI[],load: NXM_NX_ TUN_ID[ ]->NXM_ NX_TUN_ ID[],output: NXM_OF_ IN_PORT[ ]),output: 1 0x9f710356180d5 375, duration=526.301s, table=20, n_packets=31, n_bytes=3520, idle_age=458, priority= 2,dl_vlan= 5,dl_dst= fa:16:3e: 5f:25:2e actions= strip_vlan, set_tunnel: 0x75b34, output: 7 0x9f710356180d5 375, duration=513.346s, table=20, n_packets=0, n_bytes=0, idle_age=513, priority= 2,dl_vlan= 4,dl_dst= fa:16:3e: 51:92:0d actions= strip_vlan, set_tunnel: 0x75b1d, output: 7 0x9f710356180d5 375, duration=422.479s, table=20, n_packets=35, n_bytes=3964, idle_age=346, priority= 2,dl_vlan= 5,dl_dst= fa:16:3e: 39:05:d8 actions= strip_vlan, set_tunnel: 0x75b34, output: 8 0x9f710356180d5 375, duration=422.125s, table=20, n_packets=0, n_bytes=0, idle_age=422, prior...
[root@123-1449 ~]# ovs-ofctl dump-flows br-tun
NXST_FLOW reply (xid=0x4):
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=
cookie=