ovs offload broken from neutron 16.3.0 onwards

Bug #1931696 reported by Edward Hope-Morley
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron Open vSwitch Charm
Fix Released
High
Edward Hope-Morley
Ubuntu Cloud Archive
New
Undecided
Unassigned
Queens
New
Undecided
Unassigned
Rocky
New
Undecided
Unassigned
Stein
New
Undecided
Unassigned
Train
New
Undecided
Unassigned
Ussuri
New
Undecided
Unassigned
Victoria
New
Undecided
Unassigned
Wallaby
New
Undecided
Unassigned
Xena
New
Undecided
Unassigned
neutron
In Progress
High
Edward Hope-Morley
neutron (Ubuntu)
New
Undecided
Unassigned
Bionic
New
Undecided
Unassigned
Focal
New
Undecided
Unassigned
Hirsute
Won't Fix
Undecided
Unassigned
Impish
Won't Fix
Undecided
Unassigned

Bug Description

The 16.3.0 release of neutron introduced patch [1] which breaks the use of non-offloaded ports on a node that has ovs offload enabled. Our setup is as follows:

  * single tenant network (vxlan)
  * two vms with one port each where one port has offload enabled and the other does not
  * from non-offloaded vm I am unable to ping my gateway and I see the following:

# grep dropping /var/log/openvswitch/ovs-vswitchd.log
2021-06-11T09:37:16.271Z|00553|ofproto_dpif_xlate(handler150)|WARN|dropping VLAN 1 tagged packet received on port qr-446b3a35-d0 configured as VLAN 1 access port on bridge br-int while processing recirc_id=0x1a,ct_state=est|rpl|trk,ct_zone=1,ct_nw_src=172.16.0.126,ct_nw_dst=172.16.0.1,ct_nw_proto=1,ct_tp_src=8,ct_tp_dst=0,eth,icmp,in_port=3,vlan_tci=0x0000,dl_src=fa:16:3e:66:e8:2f,dl_dst=fa:16:3e:3c:50:c1,nw_src=172.16.0.1,nw_dst=172.16.0.126,nw_tos=0,nw_ecn=0,nw_ttl=64,icmp_type=0,icmp_code=0

which is from [2]. The patch [1] is configurable by settig explicitly_egress_direct=True in /etc/neutron/plugins/ml2/openvswitch_agent.ini and that has fixed connectivity for me so I am going to submit a patch to make this configurable via the charm.

[1] https://opendev.org/openstack/neutron/commit/d865165cc8cbd50a3e79a25065ef9a310d7c9396
[2] https://github.com/openvswitch/ovs/blob/branch-2.13/ofproto/ofproto-dpif-xlate.c#L2220

Related branches

Changed in charm-neutron-openvswitch:
assignee: nobody → Edward Hope-Morley (hopem)
importance: Undecided → High
Changed in charm-neutron-openvswitch:
status: New → In Progress
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Changed in charm-neutron-openvswitch:
milestone: none → 21.10
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-openvswitch (master)

Reviewed: https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/795981
Committed: https://opendev.org/openstack/charm-neutron-openvswitch/commit/0f7d135507def07196aa98cc4cf9fd518641a369
Submitter: "Zuul (22348)"
Branch: master

commit 0f7d135507def07196aa98cc4cf9fd518641a369
Author: Edward Hope-Morley <email address hidden>
Date: Fri Jun 11 11:33:10 2021 +0100

    Set explicitly_egress_direct=True for ml2 ovs

    This fixes a regression introduced in 16.3.0 neutron
    release that causes non-offloaded ports to break on
    hypervisors that have offloaded enabled.

    Closes-Bug: #1931696
    Change-Id: I1e884eac26d51c825736f34bcbfdccc906944b8d

Changed in charm-neutron-openvswitch:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-neutron-openvswitch (stable/21.04)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-neutron-openvswitch (stable/21.04)

Reviewed: https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/798072
Committed: https://opendev.org/openstack/charm-neutron-openvswitch/commit/97e3f63b43e68bd0d01e350509c83024a320e444
Submitter: "Zuul (22348)"
Branch: stable/21.04

commit 97e3f63b43e68bd0d01e350509c83024a320e444
Author: Edward Hope-Morley <email address hidden>
Date: Fri Jun 11 11:33:10 2021 +0100

    Set explicitly_egress_direct=True for ml2 ovs

    This fixes a regression introduced in 16.3.0 neutron
    release that causes non-offloaded ports to break on
    hypervisors that have offloaded enabled.

    Closes-Bug: #1931696
    Change-Id: I1e884eac26d51c825736f34bcbfdccc906944b8d
    (cherry picked from commit 0f7d135507def07196aa98cc4cf9fd518641a369)

Revision history for this message
Trent Lloyd (lathiat) wrote :

This may have been a bad option to enable by default based on the notes from https://docs.openstack.org/releasenotes/neutron/queens.html and may be breaking dvr/l3ha in some cases, discussing with details in https://bugs.launchpad.net/neutron/+bug/1945306

Throwing in a copy of a patch I had lying around from another investigation to make this configurable instead. Not rocket science but stashing in case it's wanted later.

diff --git a/config.yaml b/config.yaml
index 764ddee..1018dff 100644
--- a/config.yaml
+++ b/config.yaml
@@ -457,3 +457,8 @@ options:
     description: |
       Allow the charm and packages to restart services automatically when
       required.
+ explicitly-egress-direct:
+ type: boolean
+ default: False
+ description: |
+ Set explicitly_egress_direct on neutron-openvswitch
diff --git a/hooks/neutron_ovs_context.py b/hooks/neutron_ovs_context.py
index 1c97eba..1ff0a9e 100644
--- a/hooks/neutron_ovs_context.py
+++ b/hooks/neutron_ovs_context.py
@@ -216,6 +216,8 @@ class OVSPluginContext(context.NeutronContext):
         ovs_ctxt['enable_dpdk'] = conf['enable-dpdk']
         ovs_ctxt['keepalived_healthcheck_interval'] = \
             conf['keepalived-healthcheck-interval']
+ ovs_ctxt['explicitly_egress_direct'] = conf['explicitly-egress-direct']
+
         ovs_ctxt['disable_mlockall'] = self.disable_mlockall()

         net_dev_mtu = neutron_api_settings.get('network_device_mtu')
diff --git a/templates/queens/openvswitch_agent.ini b/templates/queens/openvswitch_agent.ini
index 74ccefa..cd4a889 100644
--- a/templates/queens/openvswitch_agent.ini
+++ b/templates/queens/openvswitch_agent.ini
@@ -26,6 +26,9 @@ polling_interval = {{ polling_interval }}
 {% if extension_drivers -%}
 extensions = {{ extension_drivers }}
 {% endif -%}
+{% if explicitly_egress_direct -%}
+explicitly_egress_direct = {{ explicitly_egress_direct }}
+{% endif -%}

 [securitygroup]
 {% if neutron_security_groups and not enable_dpdk -%}
diff --git a/templates/ussuri/openvswitch_agent.ini b/templates/ussuri/openvswitch_agent.ini
index 2f42f50..cd4a889 100644
--- a/templates/ussuri/openvswitch_agent.ini
+++ b/templates/ussuri/openvswitch_agent.ini
@@ -26,8 +26,9 @@ polling_interval = {{ polling_interval }}
 {% if extension_drivers -%}
 extensions = {{ extension_drivers }}
 {% endif -%}
-# See LP 1931696
-explicitly_egress_direct = True
+{% if explicitly_egress_direct -%}
+explicitly_egress_direct = {{ explicitly_egress_direct }}
+{% endif -%}

 [securitygroup]
 {% if neutron_security_groups and not enable_dpdk -%}

Revision history for this message
Hemanth Nakkina (hemanth-n) wrote :

Adding project neutron as this is still a problem when explicitly_egress_direct=False (which is the recommended value when DVR SNAT is used)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/812641

Changed in neutron:
status: New → In Progress
Changed in neutron:
importance: Undecided → High
Changed in neutron:
assignee: nobody → Hemanth Nakkina (hemanth-n)
Revision history for this message
Edward Hope-Morley (hopem) wrote :

I have done some more testing on this, specifically re-testing the current neutron code that still contains the fix for bug 1897637 and my results are in the attached file 'lp1931696_test_results.txt'. In short, what I actually see is that without the patch from 1931696 and with the default of explicitly_egress_direct=False things seem to work absolutely find in all cases i.e. offloaded and non-offloaded ports are able to ping each other and their gateway. See the test results for full details but I think is cause to consider reverting the patch from 1897637 since (a) it appears to be broken and (b) it is also breaking dvr_snat with l3ha when explicitly_egress_direct=True (see bug 1945306).

Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/813415

Changed in neutron:
assignee: Hemanth Nakkina (hemanth-n) → Edward Hope-Morley (hopem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Hemanth N <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/812641
Reason: Abandoning this patch as this is not the complete fix and created problems on ports of vnic_type direct, need to look more in-depth.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-neutron-openvswitch (stable/21.10)

Related fix proposed to branch: stable/21.10
Review: https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/813609

Revision history for this message
LIU Yulong (dragon889) wrote (last edit ):

The patch of https://review.opendev.org/c/openstack/neutron/+/666991 which introduced the config option ``explicitly_egress_direct=True/False`` had fixed the following problems:
1. the egress flooding issue on br-int when enable openvswitch(openflow) security group driver
https://bugs.launchpad.net/neutron/+bug/1732067

2. fix the east-west traffic broken of dvr
https://bugs.launchpad.net/neutron/+bug/1831534 (this bug is for VLAN network, but the issue is not vlan only).

3. fix some potential ingress flood issue on br-int

And I had put some issues here as well:
https://bugs.launchpad.net/neutron/+bug/1934666/comments/5

So, not use explicitly_egress_direct=True, you have to face these issues.

Another thing is that as I said in the release note before, do not use ``explicitly_egress_direct=True`` in host which enable dvr_snat and compute service. There are too many cases need to cover, please try to combine the following cases for DVR:
1. vlan/vxlan
2. dvr/dvr+ha
3. agent mode(dvr, dvr_snat, dvr_no_external)
4. east-west traffic and north-south traffic with the Scenario of src and dest in or not in same host
5. IPv6
6. allowed_address_pair
7. enable/disabl openflow firewall
8. HA router failover
The final cases is too many to cover.

And FYI, we had mark that dvr_snat + compute services is not supported.
https://review.opendev.org/c/openstack/neutron/+/801503

Revision history for this message
Edward Hope-Morley (hopem) wrote (last edit ):

@dragon889 thanks for the info. To be clear the patch we are reverting here is not the patch you reference that introduced explicitly_egress_direct but actually a subsequent patch that alters flows for offloaded ports when explicitly_egress_direct=False that appears to have unintended side-effects.

Revision history for this message
Moshe Levi (moshele) wrote :

@hopem The problem was that without explicitly_egress_direct=False we can't offload traffic. We can't offload rule which is flood all port. When you say it works did you check offload? I was aware that it breaks the dvr_snat. What will be the point to use dvr_snat and hardware offload it not working? the comprise to enable it with action normal so at least we can offload vlan/vlxan traffic and for other cases you can use explicitly_egress_direct=True.

Changed in charm-neutron-openvswitch:
status: Fix Committed → Fix Released
Revision history for this message
Edward Hope-Morley (hopem) wrote :

@moshele I have re-tested without dvr-snat and these are the results:

(agent_mode=dvr, offload=true, explicitly_egress_direct=False):

  switchdev port:
    ping between vms same network/separate hypervisors: pass
    ping network gateway: fail
    ping external address: pass

  normal port:
    ping between vms same network/separate hypervisors: pass
    ping network gateway: fail
    ping external address: pass

Results (agent_mode=dvr, offload=true, explicitly_egress_direct=False, 1897637 patch reverted):

  switchdev port:
    ping between vms same network/separate hypervisors: pass
    ping network gateway: pass
    ping external address: pass

  normal port:
    ping between vms same network/separate hypervisors: pass
    ping network gateway: pass
    ping external address: pass

So as you can see, with your patch in a dvr env (computenode=dvr, networknode=dvr_snat) that has offload enabled, I am unable to ping my network gateway. I assume this is an unintended side-effect of your patch since it does not exist if i remove your patch.

Revision history for this message
Edward Hope-Morley (hopem) wrote :

We've also found bug 1948656 which means that toggling explicitly_egress_direct does not remove the flow added when set to True.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-neutron-openvswitch (stable/21.10)

Reviewed: https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/813609
Committed: https://opendev.org/openstack/charm-neutron-openvswitch/commit/d465376858a7a88c1ccf4ea41f3c17fe5ceffd91
Submitter: "Zuul (22348)"
Branch: stable/21.10

commit d465376858a7a88c1ccf4ea41f3c17fe5ceffd91
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 11 11:23:29 2021 +0100

    Revert "Set explicitly_egress_direct=True for ml2 ovs"

    This reverts commit 0f7d135507def07196aa98cc4cf9fd518641a369.

    Related-Bug: #1931696
    Change-Id: I2ee90140f646170552fd3a638af2231ac9a38cad
    (cherry picked from commit 6fb737c1c88c4773bedf65810429224c031ab881)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-neutron-openvswitch (master)

Reviewed: https://review.opendev.org/c/openstack/charm-neutron-openvswitch/+/813407
Committed: https://opendev.org/openstack/charm-neutron-openvswitch/commit/6fb737c1c88c4773bedf65810429224c031ab881
Submitter: "Zuul (22348)"
Branch: master

commit 6fb737c1c88c4773bedf65810429224c031ab881
Author: Edward Hope-Morley <email address hidden>
Date: Mon Oct 11 11:23:29 2021 +0100

    Revert "Set explicitly_egress_direct=True for ml2 ovs"

    This reverts commit 0f7d135507def07196aa98cc4cf9fd518641a369.

    Related-Bug: #1931696
    Change-Id: I2ee90140f646170552fd3a638af2231ac9a38cad

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/813415
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Brian Murray (brian-murray) wrote :

The Hirsute Hippo has reached End of Life, so this bug will not be fixed for that release.

Changed in neutron (Ubuntu Hirsute):
status: New → Won't Fix
Revision history for this message
Brian Murray (brian-murray) wrote :

Ubuntu 21.10 (Impish Indri) has reached end of life, so this bug will not be fixed for that specific release.

Changed in neutron (Ubuntu Impish):
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.