ENV: stable/victoria
In the following scenarios (especially in large-scale cases, when restarting many ovs-agents at the same time), the openflow table is missing and cannot be self-recovered
As a simple example, restarting two ovs-agent at the same time:
```
network.local_ip=30.0.1.6,output="vxlan-1e000106"
compute1.local_ip=30.0.1.7,output="vxlan-1e000107"
compute2.local_ip=30.0.1.8,output="vxlan-1e000108"
network.port=('192.168.1.2')
compute1.port=('192.168.1.11')
compute2.port=('192.168.1.141')
// iter_num=0 of compute1
DEBUG neutron.plugins.ml2.db [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] For port cb7fad87-7dc7-4008-a349-3a17e3b8be71, host compute1, got binding levels [PortBindingLevel(driver='openvswitch',host='compute1',level=0,port_id=cb7fad87-7dc7-4008-a349-3a17e3b8be71,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)] get_binding_level_objs /usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] host: compute1, agent_active_ports: 3, refresh_tunnels: True update_port_up
// rpc-1
Notify l2population agent compute1 at q-agent-notifier the message add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 22, 'network_type': 'vxlan', 'ports': {'30.0.1.6': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:db:75:11', ip_address='192.168.1.2')], '30.0.1.8': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:45:eb:6a', ip_address='192.168.1.141')]}}} _notification_host
// rpc-2
Fanout notify l2population agents at q-agent-notifier the message add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 22, 'network_type': 'vxlan', 'ports': {'30.0.1.7': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:21:34:43', ip_address='192.168.1.11')]}}} _notification_fanout
// iter_num>0 of compute1
DEBUG neutron.plugins.ml2.db [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] For port cb7fad87-7dc7-4008-a349-3a17e3b8be71, host compute1, got binding levels [PortBindingLevel(driver='openvswitch',host='compute1',level=0,port_id=cb7fad87-7dc7-4008-a349-3a17e3b8be71,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)] get_binding_level_objs /usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
2022-06-09 17:45:39.546 833566 DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] host: compute1, agent_active_ports: 3, refresh_tunnels: False update_port_up
...
// iter_num=0 of compute2
DEBUG neutron.plugins.ml2.db [req-2e977b20-4438-4928-85bb-59de4c7389f6 - - - - -] For port ccca9701-19c0-4590-92d0-5fbd909d4eeb, host compute2, got binding levels [PortBindingLevel(driver='openvswitch',host='compute2',level=0,port_id=ccca9701-19c0-4590-92d0-5fbd909d4eeb,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)] get_binding_level_objs /usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver [req-2e977b20-4438-4928-85bb-59de4c7389f6 - - - - -] host: compute2, agent_active_ports: 3, refresh_tunnels: True update_port_up
// rpc-3
Notify l2population agent compute2 at q-agent-notifier the message add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 22, 'network_type': 'vxlan', 'ports': {'30.0.1.6': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:db:75:11', ip_address='192.168.1.2')], '30.0.1.7': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:21:34:43', ip_address='192.168.1.11')]}}} _notification_host
// rpc-4
Fanout notify l2population agents at q-agent-notifier the message add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 22, 'network_type': 'vxlan', 'ports': {'30.0.1.8': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:45:eb:6a', ip_address='192.168.1.141')]}}} _notification_fanout
```
1. After iter_num=0, cleanup_stale_flows clears table=21 and table=22 of stale openflow tables
2. If compute1 receives rpc-4 first, tunnels_missing=False
3. rpc-1 timeout not received
4. As a result, table=22,priority=1, output is missing output="vxlan-1e000106" and table=21,priority=1 is missing 192.168.1.2 arp responder table
5. Missing flow tables will always be missing, resulting in VMs under this network not being able to communicate with VMs under the network node at layer 2
Hi Liu:
I understand what the problem is but I don't know how this is happening. Can you describe what the OVS agent receives, the order and the actions taken? This is not clear for me from your description.
Can you also post the full logs of the OVS agents including the timestamps? If possible, of course.
Regards.