flows on br-int wasn't been deleted completely

Bug #1912450 reported by uchenily
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
New
High
Unassigned

Bug Description

hello guys, in our environments, we found that if we migrate(include cold migration or live migration) / resize vms, there may be stale flows left, we can't reproduce it 100%, but adding a sleep func before 'get_vifs_by_id'(https://opendev.org/openstack/neutron/src/tag/17.0.0/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L1911-L1912) will make it easier.

```
devices = devices_details_list.get('devices')
time.sleep(5) # make it easier to reproduce the problem
vif_by_id = self.int_br.get_vifs_by_ids(
```

steps to reproduce the problem:

1. create a vm on compute01
openstack server create --image cirros --flavor m1.tiny --network vxlan9999 --wait --availability-zone :compute01 test20210120

2. get vm interface info, here we need 'c25976f2-ae' and mac address
nova interface-list test20210120
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| ACTIVE | c25976f2-aea4-4b78-9a8f-e4ec2b1373e3 | 5ea08661-ab6b-45dd-aba5-42346cb6ae70 | 172.16.0.25 | fa:16:3e:27:b5:63 |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+

3. dump-flows on compute01, we can grep the result by mac and in_port
ovs-ofctl dump-flows br-int --no-stats | grep -E "fa:16:3e:27:b5:63|in_port=3746"
 cookie=0x5b39cece80919832, priority=10,arp,in_port=3746 actions=resubmit(,24)
 cookie=0x5b39cece80919832, priority=10,icmp6,in_port=3746,icmp_type=136 actions=resubmit(,24)
 cookie=0x5b39cece80919832, priority=9,in_port=3746 actions=resubmit(,25)
 cookie=0x5b39cece80919832, table=1, priority=20,dl_vlan=39,dl_dst=fa:16:3e:27:b5:63 actions=mod_dl_src:fa:16:3e:73:64:50,resubmit(,60)
 cookie=0x5b39cece80919832, table=24, priority=2,icmp6,in_port=3746,icmp_type=136,nd_target=fe80::f816:3eff:fe27:b563 actions=resubmit(,60)
 cookie=0x5b39cece80919832, table=24, priority=2,arp,in_port=3746,arp_spa=172.16.0.25 actions=resubmit(,25)
 cookie=0x5b39cece80919832, table=25, priority=2,in_port=3746,dl_src=fa:16:3e:27:b5:63 actions=resubmit(,60)
 cookie=0x5b39cece80919832, table=60, priority=20,dl_vlan=39,dl_dst=fa:16:3e:27:b5:63 actions=strip_vlan,output:3746
 cookie=0x5b39cece80919832, table=60, priority=9,in_port=3746,dl_src=fa:16:3e:27:b5:63 actions=load:0x27->NXM_NX_REG6[],resubmit(,61)
 cookie=0x5b39cece80919832, table=61, priority=12,dl_vlan=39,dl_dst=fa:16:3e:27:b5:63 actions=strip_vlan,output:3746
 cookie=0x5b39cece80919832, table=61, priority=12,reg6=0x27,dl_dst=fa:16:3e:27:b5:63 actions=output:3746
 cookie=0x5b39cece80919832, table=62, priority=10,in_port=3746,dl_src=fa:16:3e:27:b5:63,dl_dst=00:00:00:00:00:00/01:00:00:00:00:00 actions=mod_vlan_vid:39,output:3

4. migrate this vm, you can use live-migration or cold-migration
openstack server migrate --live-migration --host compute02 test20210120 --wait --os-compute-api-version 2.30

5. after the migration, dump flows again
ovs-ofctl dump-flows br-int --no-stats | grep -E "fa:16:3e:27:b5:63|in_port=3746"
 cookie=0x5b39cece80919832, table=1, priority=20,dl_vlan=39,dl_dst=fa:16:3e:27:b5:63 actions=mod_dl_src:fa:16:3e:73:64:50,resubmit(,60)
 cookie=0x5b39cece80919832, table=60, priority=20,dl_vlan=39,dl_dst=fa:16:3e:27:b5:63 actions=strip_vlan,output:3746
 cookie=0x5b39cece80919832, table=61, priority=12,dl_vlan=39,dl_dst=fa:16:3e:27:b5:63 actions=strip_vlan,output:3746
 cookie=0x5b39cece80919832, table=61, priority=12,reg6=0x27,dl_dst=fa:16:3e:27:b5:63 actions=output:3746

here we found table=1, 60, 61 has stale flows, the other flows has been deleted because these flows match in_port:3746 and this devices has been deleted from br-int; you may need to repeat several times to reproduce it.

# here is ovs-agent logs, i have deleted unimportant ports informations in 'current' list.
2021-01-20 06:13:54.292 1 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Starting to process devices in:{'added': set(), 'removed': set(), 'current': {'c25976f2-aea4-4b78-9a8f-e4ec2b1373e3', 'aaa6f513-5de3-4be5-a468-d49e3b15856f'}, 'updated': {'c25976f2-aea4-4b78-9a8f-e4ec2b1373e3'}} rpc_loop /var/lib/openstack/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2424
2021-01-20 06:13:59.329 1 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Processing port: c25976f2-aea4-4b78-9a8f-e4ec2b1373e3 treat_devices_added_or_updated /var/lib/openstack/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1693
2021-01-20 06:13:59.329 1 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Port c25976f2-aea4-4b78-9a8f-e4ec2b1373e3 updated. Details: {'device': 'c25976f2-aea4-4b78-9a8f-e4ec2b1373e3', 'network_id': '5ea08661-ab6b-45dd-aba5-42346cb6ae70', 'port_id': 'c25976f2-aea4-4b78-9a8f-e4ec2b1373e3', 'mac_address': 'fa:16:3e:27:b5:63', 'admin_state_up': True, 'network_type': 'vxlan', 'segmentation_id': 9999, 'physical_network': None, 'fixed_ips': [{'subnet_id': 'fd3edeff-604b-462e-a0e0-d4a3c24ee7cc', 'ip_address': '172.16.0.25'}], 'device_owner': 'compute:cn-north-3a', 'allowed_address_pairs': [], 'port_security_enabled': True, 'qos_policy_id': None, 'network_qos_policy_id': None, 'profile': {'migrating_to': 'compute03'}, 'security_groups': ['0a13b7db-62fa-4dd0-ab03-071ab1967bfe']}
2021-01-20 06:13:59.330 1 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_dvr_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] DVR: Plumbing compute port c25976f2-aea4-4b78-9a8f-e4ec2b1373e3 _bind_port_on_dvr_subnet /var/lib/openstack/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py:472
2021-01-20 06:14:01.477 1 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Setting status for c25976f2-aea4-4b78-9a8f-e4ec2b1373e3 to UP _bind_devices /var/lib/openstack/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:967
2021-01-20 06:14:02.404 1 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Configuration for devices up ['c25976f2-aea4-4b78-9a8f-e4ec2b1373e3'] and devices down [] completed.
2021-01-20 06:14:02.448 1 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Starting to process devices in:{'added': set(), 'removed': set(), 'current': {'c25976f2-aea4-4b78-9a8f-e4ec2b1373e3', 'aaa6f513-5de3-4be5-a468-d49e3b15856f'}, 'updated': {'c25976f2-aea4-4b78-9a8f-e4ec2b1373e3'}} rpc_loop /var/lib/openstack/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:2424
2021-01-20 06:14:07.485 1 DEBUG neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Processing port: c25976f2-aea4-4b78-9a8f-e4ec2b1373e3 treat_devices_added_or_updated /var/lib/openstack/lib/python3.6/site-packages/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py:1693
2021-01-20 06:14:07.486 1 WARNING neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Port c25976f2-aea4-4b78-9a8f-e4ec2b1373e3 was not found on the integration bridge and will therefore not be processed
2021-01-20 06:14:08.933 1 INFO neutron.plugins.ml2.drivers.openvswitch.agent.ovs_neutron_agent [req-6e6dde50-1636-45ec-a8ef-df1e15744209 - - - - -] Ports {'c25976f2-aea4-4b78-9a8f-e4ec2b1373e3'} skipped, changing status to down

Revision history for this message
uchenily (uchenily) wrote :

Port c25976f2-aea4-4b78-9a8f-e4ec2b1373e3 was not found on the integration bridge and will therefore not be processed

this log shows that when we update this port, the device qvoc25976f2-ae has been deleted, self.int_br.get_vifs_by_ids func will not found it, and it will be treated as 'skipped_devices'

from the defination of `treat_devices_skipped` func(https://opendev.org/openstack/neutron/src/tag/17.0.0/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py#L2030-L2039), we can see, it doesn't delete any flows, this is the root cause of this problem.

as a comparison, treat_devices_removed func will delete flows when executing self.port_unbound method.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

This needs quite a complex environment to reproduce so let me have a few questions first.

It seems to me you have an ovs dvr environment with securitygroup.firewall_driver=iptables_hybrid and you're running version 17.0.0 (victoria). Is that all correct? If anything is different in your environment please let us know.

How often does the error occur with and without the sleep?

Do you have an traffic broken by the stale flows? The stale flows listed above are all specific to a dst_mac so I guess those cannot break any other traffic.

One suggestion, which may or may not be related to the problem. Since we have the 'openvswitch' firewall driver there's hardly any reason to use the older and less performant 'iptables_hybrid' driver. But the hybrid driver adds some extra complexity. Even independently of this bug it may be worth to switch to the 'openvswitch' firewall driver. Can you reproduce the bug with the other driver?

Also it seems to me we may have had an earlier bug in which the error was not the same, but the error conditions were just like here:
https://bugs.launchpad.net/neutron/+bug/1737892

Revision history for this message
uchenily (uchenily) wrote :

Hi, Bence, thanks for your suggestions, we are actively considering using openvswitch firewall driver.

Yes, this is a dvr environment and we use iptables_hybrid as the firewall dirver, but the neutron version is rocky(13.0.6)

In fact, it rarely happens before `sleep` func is added, but the probability is greater than 50% when add this. In my opinion, it's related to the load of host.

There is a traffic broken problem, becuase of the flow in table=61 which matches reg6 and dl_dst fields(this flow is added for non-openflow firewall drivers. https://review.opendev.org/738551)

# test20210121 is another vm in compute01
$ nova interface-list test20210121
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| Port State | Port ID | Net ID | IP addresses | MAC Addr |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+
| ACTIVE | 8d971421-24a8-41ed-a4d8-df32e1218c0e | 5ea08661-ab6b-45dd-aba5-42346cb6ae70 | 172.16.0.10 | fa:16:3e:2c:56:3f |
+------------+--------------------------------------+--------------------------------------+--------------+-------------------+

root@compute01:~# ovs-appctl ofproto/trace br-int in_port=qvo8d971421-24,dl_src=fa:16:3e:2c:56:3f,dl_dst=fa:16:3e:27:b5:63 -generate
Flow: in_port=3838,vlan_tci=0x0000,dl_src=fa:16:3e:2c:56:3f,dl_dst=fa:16:3e:27:b5:63,dl_type=0x0000

bridge("br-int")
----------------
 0. in_port=3838, priority 9, cookie 0x64d2308a7b0a735a
    goto_table:25
25. in_port=3838,dl_src=fa:16:3e:2c:56:3f, priority 2, cookie 0x64d2308a7b0a735a
    goto_table:60
60. in_port=3838,dl_src=fa:16:3e:2c:56:3f, priority 9, cookie 0x64d2308a7b0a735a
    set_field:0x27->reg6
    resubmit(,61)
61. reg6=0x27,dl_dst=fa:16:3e:27:b5:63, priority 12, cookie 0x64d2308a7b0a735a
    output:3839
     >> Nonexistent output port

Final flow: reg6=0x27,in_port=3838,vlan_tci=0x0000,dl_src=fa:16:3e:2c:56:3f,dl_dst=fa:16:3e:27:b5:63,dl_type=0x0000
Megaflow: recirc_id=0,eth,in_port=3838,vlan_tci=0x0000/0x1fff,dl_src=fa:16:3e:2c:56:3f,dl_dst=fa:16:3e:27:b5:63,dl_type=0x0000
Datapath actions: 108

Revision history for this message
uchenily (uchenily) wrote :

Yes, this bug is related to https://bugs.launchpad.net/neutron/+bug/1737892, this patch https://review.openstack.org/574885 deleted port resource for l2 agent extensions but didn't delete any flows.

Revision history for this message
Bence Romsics (bence-romsics) wrote :

On master I'm not able to reproduce this.

I built a multinode devstack:
* devstack0 all-in-one
* devstack0a compute-only

On devstack0a I set the fw driver to iptables_hybrid, added the sleep(5), restarted neutron-agent.

Created a vm:
openstack server create --image cirros-0.5.1-x86_64-disk --flavor m1.tiny --network private --wait --availability-zone :devstack0 vm0

Then moved it back and forth between the computes with 2 resizes:
# 0 -> 0a
openstack server show vm0 -f value -c OS-EXT-SRV-ATTR:hypervisor_hostname
openstack server resize vm0 --flavor cirros256 --wait
openstack server resize confirm vm0
openstack server show vm0 -f value -c OS-EXT-SRV-ATTR:hypervisor_hostname

# 0 <- 0a
openstack server show vm0 -f value -c OS-EXT-SRV-ATTR:hypervisor_hostname
openstack server resize vm0 --flavor m1.tiny --wait
openstack server resize confirm vm0
openstack server show vm0 -f value -c OS-EXT-SRV-ATTR:hypervisor_hostname

After each move from devstack0a to devstack0 I checked for stale flows on br-int in devstack0a by looking at this number (and whether it changed):
sudo ovs-ofctl dump-flows br-int --no-stats | wc -l

However in 10 tries I did not see any stale flows.

Out of curiosity I also did this 10 times with fw driver ovs. Could not find stale flows that way either.

This tells that either this bug is not present on master, or I'm missing something how to reproduce this. Can you reproduce this on newer than rocky versions?

Changed in neutron:
status: New → Incomplete
Revision history for this message
uchenily (uchenily) wrote :

Hi, I noticed that you resized this vm and then confirmed it. If the update port has been completed before the confirmation is performed, it wouldn't work.

Automatic confirmation(https://docs.openstack.org/nova/latest/admin/configuration/resize.html#automatic-confirm) will be helpful.

I suggest using live migration instead or increasing the time of sleep, to make sure update port and delete port in the same ovs rpc loop cycle.

Revision history for this message
uchenily (uchenily) wrote :

I've submitted a patch, Does anyone have any suggestions?
https://review.opendev.org/c/openstack/neutron/+/771903

Revision history for this message
Bence Romsics (bence-romsics) wrote :

I think you're getting better feedback in the code review from Oleg and Rodolfo than what I'm able to give here, so let's continue there.

Changed in neutron:
status: Incomplete → In Progress
importance: Undecided → High
Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/771903
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Rodolfo Alonso <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/784554
Reason: Feel free to address the comments, rebase the patch and restore it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.