After ovs-agent restart, table=21 and table=22 on br-tun openflow table is missing

Bug #1978088 reported by liujinxin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
In Progress
Undecided
Arnaud Morin

Bug Description

ENV: stable/victoria

In the following scenarios (especially in large-scale cases, when restarting many ovs-agents at the same time), the openflow table is missing and cannot be self-recovered

As a simple example, restarting two ovs-agent at the same time:
```
network.local_ip=30.0.1.6,output="vxlan-1e000106"
compute1.local_ip=30.0.1.7,output="vxlan-1e000107"
compute2.local_ip=30.0.1.8,output="vxlan-1e000108"

network.port=('192.168.1.2')
compute1.port=('192.168.1.11')
compute2.port=('192.168.1.141')

// iter_num=0 of compute1
DEBUG neutron.plugins.ml2.db [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] For port cb7fad87-7dc7-4008-a349-3a17e3b8be71, host compute1, got binding levels [PortBindingLevel(driver='openvswitch',host='compute1',level=0,port_id=cb7fad87-7dc7-4008-a349-3a17e3b8be71,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)] get_binding_level_objs /usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] host: compute1, agent_active_ports: 3, refresh_tunnels: True update_port_up

// rpc-1
Notify l2population agent compute1 at q-agent-notifier the message add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 22, 'network_type': 'vxlan', 'ports': {'30.0.1.6': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:db:75:11', ip_address='192.168.1.2')], '30.0.1.8': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:45:eb:6a', ip_address='192.168.1.141')]}}} _notification_host

// rpc-2
Fanout notify l2population agents at q-agent-notifier the message add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 22, 'network_type': 'vxlan', 'ports': {'30.0.1.7': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:21:34:43', ip_address='192.168.1.11')]}}} _notification_fanout

// iter_num>0 of compute1
DEBUG neutron.plugins.ml2.db [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] For port cb7fad87-7dc7-4008-a349-3a17e3b8be71, host compute1, got binding levels [PortBindingLevel(driver='openvswitch',host='compute1',level=0,port_id=cb7fad87-7dc7-4008-a349-3a17e3b8be71,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)] get_binding_level_objs /usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
2022-06-09 17:45:39.546 833566 DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver [req-f8093da8-9f1a-4da2-a27f-03f1b4d50dfd - - - - -] host: compute1, agent_active_ports: 3, refresh_tunnels: False update_port_up

...

// iter_num=0 of compute2
DEBUG neutron.plugins.ml2.db [req-2e977b20-4438-4928-85bb-59de4c7389f6 - - - - -] For port ccca9701-19c0-4590-92d0-5fbd909d4eeb, host compute2, got binding levels [PortBindingLevel(driver='openvswitch',host='compute2',level=0,port_id=ccca9701-19c0-4590-92d0-5fbd909d4eeb,segment=NetworkSegment(0bcd776d-92cd-4d96-9e54-92350700c4ca),segment_id=0bcd776d-92cd-4d96-9e54-92350700c4ca)] get_binding_level_objs /usr/lib/python3.6/site-packages/neutron/plugins/ml2/db.py:78
DEBUG neutron.plugins.ml2.drivers.l2pop.mech_driver [req-2e977b20-4438-4928-85bb-59de4c7389f6 - - - - -] host: compute2, agent_active_ports: 3, refresh_tunnels: True update_port_up

// rpc-3
Notify l2population agent compute2 at q-agent-notifier the message add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 22, 'network_type': 'vxlan', 'ports': {'30.0.1.6': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:db:75:11', ip_address='192.168.1.2')], '30.0.1.7': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:21:34:43', ip_address='192.168.1.11')]}}} _notification_host

// rpc-4
Fanout notify l2population agents at q-agent-notifier the message add_fdb_entries with {'8883e077-aadb-4b79-9315-3c029e94a857': {'segment_id': 22, 'network_type': 'vxlan', 'ports': {'30.0.1.8': [('00:00:00:00:00:00', '0.0.0.0'), PortInfo(mac_address='fa:16:3e:45:eb:6a', ip_address='192.168.1.141')]}}} _notification_fanout

```

1. After iter_num=0, cleanup_stale_flows clears table=21 and table=22 of stale openflow tables
2. If compute1 receives rpc-4 first, tunnels_missing=False
3. rpc-1 timeout not received
4. As a result, table=22,priority=1, output is missing output="vxlan-1e000106" and table=21,priority=1 is missing 192.168.1.2 arp responder table
5. Missing flow tables will always be missing, resulting in VMs under this network not being able to communicate with VMs under the network node at layer 2

Tags: ovs
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Liu:

I understand what the problem is but I don't know how this is happening. Can you describe what the OVS agent receives, the order and the actions taken? This is not clear for me from your description.

Can you also post the full logs of the OVS agents including the timestamps? If possible, of course.

Regards.

tags: added: ovs
Revision history for this message
liujinxin (scilla) wrote :

Hi Rodolfo:

About how the problem happened:
I restarted hundreds of ovs-agent and l3-agent one after another, and soon the load of mysql and rabbitmq were both higher, which meant that the same interface took more time, and there were a lot of messages piling up in the message queue at this time.

As long as this rpc timeout for _notification_host(host=host.1 is lost, and another host triggers _notification_fanout(fdb_entries with FLOODING_ENTRY), the problem can be reproduced.
```
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/l2pop/rpc.py #
    def add_fdb_entries(self, context, fdb_entries, host=None):
        if fdb_entries:
            if host:
                self._notification_host(context, 'add_fdb_entries',
                                        fdb_entries, host)
            else:
                self._notification_fanout(context, 'add_fdb_entries',
                                          fdb_entries)
```

```ovs-agent.log of host.1
The ovs-agent does not have an error log, but will see the following logs:
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Cleaning stale br-tun flows
INFO neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Reserved cookies for br-tun: ['0xd7bd3e42fd184ecb', '0x92b0e246d365bd08']
WARNING neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Deleting flow with cookie 0xae03f1819bf6e6d5
WARNING neutron.plugins.ml2.drivers.openvswitch.agent.openflow.native.ofswitch [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Deleting flow with cookie 0xae03f1819bf6e6d5
INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-5aa9dccf-4847-42a5-aae0-1a31d444be8d - - - - -] Agent rpc_loop - iteration:0 - cleanup stale flows. Elapsed:3.764
```

```
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py #890
    def add_fdb_flow(self, br, port_info, remote_ip, lvm, ofport):
        if port_info == n_const.FLOODING_ENTRY:
            lvm.tun_ofports.add(ofport)
and
https://github.com/openstack/neutron/blob/master/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py #1246
            refresh_tunnels = (self.iter_num == 0) or tunnels_missing
```

As in the above code, I think the problem is that the condition for refresh_tunnels is not complete, the current code for tunnels_missing can't guarantee whether the rpc with refresh_tunnels in the `iter_num == 0` stage is received and processed correctly by the ovs-agent.
As in the code above, I think the problem is that the condition of refresh_tunnels is incomplete, and the current code of tunnels_missing cannot indicate that after the update_port_up with refresh_tunnels sent by host.1 in the `iter_num == 0`.
Whether the add_fdb_entries rpc that should have received the full amount of port information under that network again successfully returns to host.1.

Revision history for this message
LIU Yulong (dragon889) wrote :

I guess you have met the neutron scale issue. This https://bugs.launchpad.net/neutron/+bug/1813703 bug is a summary which includes many issues when you have many compute nodes or many resources on one host. So from my personal experiences, the problem can be overcomed by restarting the ovs-agent and L3-agent less than 20 hosts once. Neutron-server side had many time-comsuming DB query, which should be the main issue of this. So please have a try of restart agents on 20 hosts during your agent-restart-procedure.
Another problem is that I'm not sure why the L2 pop notifications are missing or out-of-order during the ovs-agent restart. Rpc loop>0 will not start unless the neutron-server side send all l2 pop informations, because update_device_list is a rpc "call" function which will return the information of devices_up, devices_down and so on. And the notify_l2pop_port_wiring is inside it. So, maybe run some check of why DB query lost the required information.

liujinxin (scilla)
description: updated
Revision history for this message
liujinxin (scilla) wrote :

ENV: stable/victoria

We can get around this problem by only restarting a few ovs-agents at a time, but the problem still exists when a large number of ovs-agents are restarted at the same time.

Revision history for this message
Arnaud Morin (arnaud-morin) wrote (last edit ):

I am affected by this also (running Stein ovs-agent with native at large scale).

When restarting the agents, they are receiving the RPC messages that finally call add_fdb_flow function.

BUT, the first RPC loop (iter 0) may finish before all add_fdb_flow are successfully executed.

The flows that are problematic are the "flood" flows (table 22), which are not sent using "call" but "cast" on neutron RPC side [1]

[1] https://github.com/openstack/neutron/blob/5a748b7c29cb846d0de3c5e8ea3868873348c0d1/neutron/plugins/ml2/drivers/l2pop/rpc.py#L48

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/860270

Changed in neutron:
assignee: nobody → Arnaud Morin (arnaud-morin)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/866589

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Arnaud Morin <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/866589
Reason: merged in previous patchset

liujinxin (scilla)
Changed in neutron:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.