OVS: flow loop is created with openvswitch version 2.16

Bug #1969615 reported by Uwe Grawert
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Opinion
High
Unassigned

Bug Description

* Summary
neutron-openvswitch-agent is causing a flow loop when using Openvswitch version 2.16.

* High level description
Running Neutron in Xena release using the openvswitch plugin is causing a flow loop when using openvswitch version 2.16. This does not occur when deploying openvswitch version 2.15.

* Pre-conditions
Ansible-Kolla based deployment using "source: ubuntu" in stable/xena release. neutron_plugin_agent: "openvswitch". Deploying a 3 node cluster with basic Openstack services.

* Version:
 ** OpenStack version: Xena
 ** Linux distro: kolla-ansible stable/xena, Ubuntu 20.04.4 LTS

* Step-by-step
1. Deploy Openstack using kolla-ansible from stable/xena branch
2. Create a project network/subnet for Octavia
3. Create Octavia health-manager ports in Neutron for the 3 control nodes
4. Create the ports on each control node as ovs bridge ports
5. Assign IP addresses to the o-hm0 interfaces on all 3 nodes
6. try to ping one node from another node

ubuntu@ctl1:~$ openstack network show lb-mgmt
+---------------------------+--------------------------------------+
| Field | Value |
+---------------------------+--------------------------------------+
| admin_state_up | UP |
| availability_zone_hints | |
| availability_zones | nova |
| created_at | 2022-04-20T10:36:26Z |
| description | |
| dns_domain | None |
| id | c0c1b3ec-a6c3-4145-b94a-6c7fa4d7a740 |
| ipv4_address_scope | None |
| ipv6_address_scope | None |
| is_default | None |
| is_vlan_transparent | None |
| mtu | 1450 |
| name | lb-mgmt |
| port_security_enabled | True |
| project_id | 6cbb86e577a042499529110f6a1e8603 |
| provider:network_type | vxlan |
| provider:physical_network | None |
| provider:segmentation_id | 577 |
| qos_policy_id | None |
| revision_number | 2 |
| router:external | Internal |
| segments | None |
| shared | False |
| status | ACTIVE |
| subnets | bf004f5a-4cae-4277-a3f4-a4cf787033cb |
| tags | |
| updated_at | 2022-04-20T10:36:28Z |
+---------------------------+--------------------------------------+

ubuntu@ctl1:~$ openstack subnet show lb-mgmt
+----------------------+--------------------------------------+
| Field | Value |
+----------------------+--------------------------------------+
| allocation_pools | 172.16.1.1-172.16.255.254 |
| cidr | 172.16.0.0/16 |
| created_at | 2022-04-20T10:36:28Z |
| description | |
| dns_nameservers | |
| dns_publish_fixed_ip | None |
| enable_dhcp | True |
| gateway_ip | 172.16.0.1 |
| host_routes | |
| id | bf004f5a-4cae-4277-a3f4-a4cf787033cb |
| ip_version | 4 |
| ipv6_address_mode | None |
| ipv6_ra_mode | None |
| name | lb-mgmt |
| network_id | c0c1b3ec-a6c3-4145-b94a-6c7fa4d7a740 |
| project_id | 6cbb86e577a042499529110f6a1e8603 |
| revision_number | 0 |
| segment_id | None |
| service_types | |
| subnetpool_id | None |
| tags | |
| updated_at | 2022-04-20T10:36:28Z |
+----------------------+--------------------------------------+

openstack port list --device-owner octavia:health-mgrr
+--------------------------------------+--------------+-------------------+----------------------------------------------------------------------------+--------+
| ID | Name | MAC Address | Fixed IP Addresses | Status |
+--------------------------------------+--------------+-------------------+----------------------------------------------------------------------------+--------+
| b0c8a28b-b652-4dce-a1b2-4a81e74d74ad | lb-mgmt-ctl1 | fa:17:20:16:00:11 | ip_address='172.16.0.11', subnet_id='bf004f5a-4cae-4277-a3f4-a4cf787033cb' | ACTIVE |
| cfdb1171-21de-448b-a6b4-5c473e13ca12 | lb-mgmt-ctl3 | fa:17:20:16:00:13 | ip_address='172.16.0.13', subnet_id='bf004f5a-4cae-4277-a3f4-a4cf787033cb' | ACTIVE |
| ea66498a-edb9-415b-b49e-fb005b635d75 | lb-mgmt-ctl2 | fa:17:20:16:00:12 | ip_address='172.16.0.12', subnet_id='bf004f5a-4cae-4277-a3f4-a4cf787033cb' | ACTIVE |
+--------------------------------------+--------------+-------------------+----------------------------------------------------------------------------+--------+

ubuntu@ctl1:~$ ip a s o-hm0
10: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:17:20:16:00:11 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.11/16 brd 172.16.255.255 scope global o-hm0
       valid_lft forever preferred_lft forever
    inet6 fe80::f817:20ff:fe16:11/64 scope link
       valid_lft forever preferred_lft forever

ubuntu@ctl1:~$ ovs-vsctl --columns name,external-ids,ofport,error find Interface name='o-hm0'
name : o-hm0
external_ids : {attached-mac="fa:17:20:16:00:11", iface-id="b0c8a28b-b652-4dce-a1b2-4a81e74d74ad", iface-status=active, skip_cleanup="true"}
ofport : 4
error : []

ubuntu@ctl3:~$ ip a s o-hm0
9: o-hm0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:17:20:16:00:13 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.13/16 brd 172.16.255.255 scope global o-hm0
       valid_lft forever preferred_lft forever
    inet6 fe80::f817:20ff:fe16:13/64 scope link
       valid_lft forever preferred_lft forever

ubuntu@ctl3:~$ ovs-vsctl --columns name,external-ids,ofport,error find Interface name='o-hm0'
name : o-hm0
external_ids : {attached-mac="fa:17:20:16:00:13", iface-id="cfdb1171-21de-448b-a6b4-5c473e13ca12", iface-status=active, skip_cleanup="true"}
ofport : 3
error : []

ubuntu@ctl1:~$ ping 172.16.0.13
PING 172.16.0.13 (172.16.0.13) 56(84) bytes of data.
^C
--- 172.16.0.13 ping statistics ---
2 packets transmitted, 0 received, 100% packet loss, time 1025ms

ctl1:/var/log/kolla/openvswitch/ovs-vswitchd.log:

2022-04-20T10:38:14.051Z|00004|ofproto_dpif_xlate(handler1)|WARN|over max translation depth 64 on bridge br-int while processing ct_state=new|trk,ct_nw_src=172.16.0.11,ct_nw_dst=172.16.0.13,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,icmp,in_port=4,vlan_tci=0x0000,dl_src=fa:17:20:16:00:11,dl_dst=fa:17:20:16:00:13,nw_src=172.16.0.11,nw_dst=172.16.0.13,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
2022-04-20T10:38:14.051Z|00005|ofproto_dpif_upcall(handler1)|WARN|Flow: ct_state=new|trk,ct_nw_src=172.16.0.11,ct_nw_dst=172.16.0.13,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,icmp,in_port=6,vlan_tci=0x0000,dl_src=fa:17:20:16:00:11,dl_dst=fa:17:20:16:00:13,nw_src=172.16.0.11,nw_dst=172.16.0.13,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0

bridge("br-int")
----------------
 0. priority 0, cookie 0x557db8f9194e8693
    goto_table:60
60. priority 3, cookie 0x557db8f9194e8693
    NORMAL
     >>>> received packet on unknown port 6 <<<<
     >> no input bundle, dropping

Final flow: unchanged
Megaflow: recirc_id=0,eth,ip,in_port=6,dl_src=fa:17:20:16:00:11,dl_dst=fa:17:20:16:00:13,nw_frag=no
Datapath actions: drop
2022-04-20T10:38:47.322Z|00092|connmgr|INFO|br-tun<->tcp:127.0.0.1:6633: 10 flow_mods in the 2 s starting 53 s ago (10 adds)
2022-04-20T10:39:12.418Z|00006|ofproto_dpif_xlate(handler1)|WARN|over max translation depth 64 on bridge br-int while processing ct_state=new|trk,ct_nw_src=172.16.0.11,ct_nw_dst=172.16.0.13,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,icmp,in_port=4,vlan_tci=0x0000,dl_src=fa:17:20:16:00:11,dl_dst=fa:17:20:16:00:13,nw_src=172.16.0.11,nw_dst=172.16.0.13,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
2022-04-20T10:57:38.785Z|00007|ofproto_dpif_xlate(handler1)|WARN|over max translation depth 64 on bridge br-int while processing ct_state=new|trk,ct_nw_src=172.16.0.11,ct_nw_dst=172.16.0.13,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,icmp,in_port=4,vlan_tci=0x0000,dl_src=fa:17:20:16:00:11,dl_dst=fa:17:20:16:00:13,nw_src=172.16.0.11,nw_dst=172.16.0.13,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0
2022-04-20T10:57:38.785Z|00008|ofproto_dpif_upcall(handler1)|WARN|Dropped 1 log messages in last 1107 seconds (most recently, 1107 seconds ago) due to excessive rate
2022-04-20T10:57:38.786Z|00009|ofproto_dpif_upcall(handler1)|WARN|Flow: ct_state=new|trk,ct_nw_src=172.16.0.11,ct_nw_dst=172.16.0.13,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,icmp,in_port=6,vlan_tci=0x0000,dl_src=fa:17:20:16:00:11,dl_dst=fa:17:20:16:00:13,nw_src=172.16.0.11,nw_dst=172.16.0.13,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=8,icmp_code=0

bridge("br-int")
----------------
 0. priority 0, cookie 0x557db8f9194e8693
    goto_table:60
60. priority 3, cookie 0x557db8f9194e8693
    NORMAL
     >>>> received packet on unknown port 6 <<<<
     >> no input bundle, dropping

Final flow: unchanged
Megaflow: recirc_id=0,eth,ip,in_port=6,dl_src=fa:17:20:16:00:11,dl_dst=fa:17:20:16:00:13,nw_frag=no
Datapath actions: drop

Tags: ovs
Revision history for this message
Uwe Grawert (ugrawert) wrote :
tags: added: ovs
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Is "fa:17:20:00:00:00" an overridden 'base_mac' config on your env? Did you try with default one?
What's the port with ofport 6 on ctl1?
Did you compare OVS flows on br-int and br-tun for working and non-working case?

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
Uwe Grawert (ugrawert) wrote :

Hi Oleg,

the mac addresses are arbitrary. These ports are created according to the Octavia documentation. The steps are documented here: https://docs.osism.io/deployment/services/loadbalancer.html#create-neutron-ports-for-health-manager-access
I did not use default ones. This method has been working for long time now. I don't think there is anything wrong with defining the mac addresses explicitly.

Since I don't have the deployment from the initial bug report anymore, I cannot tell what is on ofport 6. But I did collect the logs from deployments with OVS 2.15 and 2.16 respectively. I attach those logs.

Since I am not so fluent in openflow, I can't figure much from the flow-dumps. But flow-dumps from both deployments are attached in the logs.tgz.

Revision history for this message
Uwe Grawert (ugrawert) wrote :
Revision history for this message
Uwe Grawert (ugrawert) wrote :

@Oleg,

in the new deployments I did, the error message reads:

Final flow: unchanged
Megaflow: recirc_id=0,eth,ip,in_port=7,dl_src=fa:17:20:16:00:11,dl_dst=fa:17:20:16:00:13,nw_frag=no
Datapath actions: drop

So in this deployment the packet supposedly arrived on port 7. But there is no port 7.

Revision history for this message
LIU Yulong (dragon889) wrote :

Where is the output of "ovs-ofctl show br-int" and "ovs-ofctl show br-tun" ?

And "ovs-appctl dpctl/show" and "ovs-appctl dpctl/dump-flows [DP]" should be useful as well.

Looks like there is residual port on ovs bridges. Did this host run an upgrade from 2.15 to 2.16?

Revision history for this message
Uwe Grawert (ugrawert) wrote :

@liu

I've added for the ovs 2.16 deployment:
ovs-ofctl show br-int
ovs-ofctl show br-tun
ovs-appctl dpctl/show

ovs-appctl dpctl/dump-flows didn't print anything.

The ovs has been redeployed from 2.15 to 2.16 yes. Using kolla-ansible deploy --tags openvswitch

Revision history for this message
Uwe Grawert (ugrawert) wrote :
Revision history for this message
Oleg Bondarev (obondarev) wrote :

So flows look the same for both 2.15 and 2.16 (no surprise here), just that in 2.16 case this weird ofport 7 appears out of nowhere according to vswitchd log, and in fact there's no such ofport on the bridge.

Also flow counters are zero for 2.16 case:

cookie=0xb722108b439955c3, duration=81.938s, table=0, n_packets=0, n_bytes=0, idle_age=81, priority=0 actions=resubmit(,60)

for 2.15 we see packets:

cookie=0xb722108b439955c3, duration=631.481s, table=0, n_packets=35, n_bytes=2870, idle_age=20, priority=0 actions=resubmit(,60)

Not sure it's a neutron issue, probably openvswitch folks could point some ways to debug.

Changed in neutron:
status: Confirmed → Opinion
Revision history for this message
Mohammed Naser (mnaser) wrote :

Did you get a chance to progress or discover anything more from this issue?

Just hit this with the same exact setup using OvS 2.17

Revision history for this message
Mohammed Naser (mnaser) wrote :

looks like this is the happening loop/trace:

https://paste.opendev.org/show/b6UTwCIVlX0u0J4CSONS/

Revision history for this message
Mohammed Naser (mnaser) wrote :

FWIW, we've discovered that this issue will happen inside of a range of kernels where it has to be new enough to have that bug.

Running Open vSwitch with this fix https://github.com/openvswitch/ovs/commit/250e1a6dd28b62be34cab181bae69853c6cf34f8 (2.17.3 contains it) has resolved this issue for us.

Revision history for this message
Mohammed Naser (mnaser) wrote :

context: https://<email address hidden>/msg08762.html

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.