About how the problem happened:
I restarted hundreds of ovs-agent and l3-agent one after another, and soon the load of mysql and rabbitmq were both higher, which meant that the same interface took more time, and there were a lot of messages piling up in the message queue at this time.
As long as this rpc timeout for _notification_host(host=host.1 is lost, and another host triggers _notification_fanout(fdb_entries with FLOODING_ENTRY), the problem can be reproduced.
``` https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/rpc.py #
def add_fdb_entries(self, context, fdb_entries, host=None):
if fdb_entries:
if host: self._notification_host(context, 'add_fdb_entries', fdb_entries, host)
else: self._notification_fanout(context, 'add_fdb_entries', fdb_entries)
```
```ovs-agent.log of host.1
The ovs-agent does not have an error log, but will see the following logs:
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Cleaning stale br-tun flows
INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Reserved cookies for br-tun: ['0xd7bd3e42fd184ecb', '0x92b0e246d365bd08']
WARNING neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Deleting flow with cookie 0xae03f1819bf6e6d5
WARNING neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Deleting flow with cookie 0xae03f1819bf6e6d5
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Agent rpc_loop - iteration:0 - cleanup stale flows. Elapsed:3.764
```
As in the above code, I think the problem is that the condition for refresh_tunnels is not complete, the current code for tunnels_missing can't guarantee whether the rpc with refresh_tunnels in the `iter_num == 0` stage is received and processed correctly by the ovs-agent.
As in the code above, I think the problem is that the condition of refresh_tunnels is incomplete, and the current code of tunnels_missing cannot indicate that after the update_port_up with refresh_tunnels sent by host.1 in the `iter_num == 0`.
Whether the add_fdb_entries rpc that should have received the full amount of port information under that network again successfully returns to host.1.
Hi Rodolfo:
About how the problem happened:
I restarted hundreds of ovs-agent and l3-agent one after another, and soon the load of mysql and rabbitmq were both higher, which meant that the same interface took more time, and there were a lot of messages piling up in the message queue at this time.
As long as this rpc timeout for _notification_ host(host= host.1 is lost, and another host triggers _notification_ fanout( fdb_entries with FLOODING_ENTRY), the problem can be reproduced. /github. com/openstack/ neutron/ blob/master/ neutron/ plugins/ ml2/drivers/ l2pop/rpc. py # entries( self, context, fdb_entries, host=None):
self. _notification_ host(context, 'add_fdb_entries',
fdb_ entries, host)
self. _notification_ fanout( context, 'add_fdb_entries',
fdb_entries)
```
https:/
def add_fdb_
if fdb_entries:
if host:
else:
```
```ovs-agent.log of host.1 plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-5aa9dccf- 4847-42a5- aae0-1a31d444be 8d - - - - -] Cleaning stale br-tun flows plugins. ml2.drivers. openvswitch. agent.openflow. native. ofswitch [req-5aa9dccf- 4847-42a5- aae0-1a31d444be 8d - - - - -] Reserved cookies for br-tun: ['0xd7bd3e42fd1 84ecb', '0x92b0e246d365 bd08'] plugins. ml2.drivers. openvswitch. agent.openflow. native. ofswitch [req-5aa9dccf- 4847-42a5- aae0-1a31d444be 8d - - - - -] Deleting flow with cookie 0xae03f1819bf6e6d5 plugins. ml2.drivers. openvswitch. agent.openflow. native. ofswitch [req-5aa9dccf- 4847-42a5- aae0-1a31d444be 8d - - - - -] Deleting flow with cookie 0xae03f1819bf6e6d5 plugins. ml2.drivers. openvswitch. agent.ovs_ neutron_ agent [req-5aa9dccf- 4847-42a5- aae0-1a31d444be 8d - - - - -] Agent rpc_loop - iteration:0 - cleanup stale flows. Elapsed:3.764
The ovs-agent does not have an error log, but will see the following logs:
INFO neutron.
INFO neutron.
WARNING neutron.
WARNING neutron.
INFO neutron.
```
``` /github. com/openstack/ neutron/ blob/master/ neutron/ plugins/ ml2/drivers/ openvswitch/ agent/ovs_ neutron_ agent.py #890 FLOODING_ ENTRY:
lvm. tun_ofports. add(ofport) /github. com/openstack/ neutron/ blob/master/ neutron/ plugins/ ml2/drivers/ openvswitch/ agent/ovs_ neutron_ agent.py #1246
refresh_ tunnels = (self.iter_num == 0) or tunnels_missing
https:/
def add_fdb_flow(self, br, port_info, remote_ip, lvm, ofport):
if port_info == n_const.
and
https:/
```
As in the above code, I think the problem is that the condition for refresh_tunnels is not complete, the current code for tunnels_missing can't guarantee whether the rpc with refresh_tunnels in the `iter_num == 0` stage is received and processed correctly by the ovs-agent.
As in the code above, I think the problem is that the condition of refresh_tunnels is incomplete, and the current code of tunnels_missing cannot indicate that after the update_port_up with refresh_tunnels sent by host.1 in the `iter_num == 0`.
Whether the add_fdb_entries rpc that should have received the full amount of port information under that network again successfully returns to host.1.