Comment 2 for bug 1978088

Revision history for this message
liujinxin (scilla) wrote :

Hi Rodolfo:

About how the problem happened:
I restarted hundreds of ovs-agent and l3-agent one after another, and soon the load of mysql and rabbitmq were both higher, which meant that the same interface took more time, and there were a lot of messages piling up in the message queue at this time.

As long as this rpc timeout for _notification_host(host=host.1 is lost, and another host triggers _notification_fanout(fdb_entries with FLOODING_ENTRY), the problem can be reproduced.
```
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/rpc.py #
    def add_fdb_entries(self, context, fdb_entries, host=None):
        if fdb_entries:
            if host:
                self._notification_host(context, 'add_fdb_entries',
                                        fdb_entries, host)
            else:
                self._notification_fanout(context, 'add_fdb_entries',
                                          fdb_entries)
```

```ovs-agent.log of host.1
The ovs-agent does not have an error log, but will see the following logs:
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Cleaning stale br-tun flows
INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Reserved cookies for br-tun: ['0xd7bd3e42fd184ecb', '0x92b0e246d365bd08']
WARNING neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Deleting flow with cookie 0xae03f1819bf6e6d5
WARNING neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Deleting flow with cookie 0xae03f1819bf6e6d5
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Agent rpc_loop - iteration:0 - cleanup stale flows. Elapsed:3.764
```

```
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py #890
    def add_fdb_flow(self, br, port_info, remote_ip, lvm, ofport):
        if port_info == n_const.FLOODING_ENTRY:
            lvm.tun_ofports.add(ofport)
and
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py #1246
            refresh_tunnels = (self.iter_num == 0) or tunnels_missing
```

As in the above code, I think the problem is that the condition for refresh_tunnels is not complete, the current code for tunnels_missing can't guarantee whether the rpc with refresh_tunnels in the `iter_num == 0` stage is received and processed correctly by the ovs-agent.
As in the code above, I think the problem is that the condition of refresh_tunnels is incomplete, and the current code of tunnels_missing cannot indicate that after the update_port_up with refresh_tunnels sent by host.1 in the `iter_num == 0`.
Whether the add_fdb_entries rpc that should have received the full amount of port information under that network again successfully returns to host.1.