DVR process flow not installed on physical bridge for shared tenant network

Bug #1783654 reported by Arjun Baindur
46
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Fix Released
Critical
Unassigned
Pike
Invalid
Undecided
Unassigned
Queens
Fix Released
Critical
Unassigned
Rocky
Fix Released
Critical
Unassigned
neutron
Fix Released
Critical
Swaminathan Vasudevan
neutron (Ubuntu)
Fix Released
Critical
Unassigned
Bionic
Fix Released
Critical
Unassigned
Cosmic
Fix Released
Critical
Unassigned

Bug Description

Seems like collateral from https://bugs.launchpad.net/neutron/+bug/1751396

In DVR, the distributed gateway port's IP and MAC are shared in the qrouter across all hosts.

The dvr_process_flow on the physical bridge (which replaces the shared router_distributed MAC address with the unique per-host MAC when its the source), is missing, and so is the drop rule which instructs the bridge to drop all traffic destined for the shared distributed MAC.

Because of this, we are seeing the router MAC on the network infrastructure, causing it on flap on br-int on every compute host:

root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 1
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 2
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 1
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0

Where port 1 is phy-br-vlan, connecting to the physical bridge, and port 11 is the correct local qr-interface. Because these dvr flows are missing on br-vlan, pkts w/ source mac ingress into the host and br-int learns it upstream.

The symptom is when pinging a VM's floating IP, we see occasional packet loss (10-30%), and sometimes the responses are sent upstream by br-int instead of the qrouter, so the ICMP replies come with fixed IP of the replier since no NAT'ing took place, and on the tenant network rather than external network.

When I force net_shared_only to False here, the problem goes away: https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L436

It should we noted we *ONLY* need to do this on our dvr_snat host. The dvr process's are missing on every compute host. But if we shut qrouter on the snat host, FIP functionality works and DVR mac stops flapping on others. Or if we apply fix only to snat host, it works. Perhaps there is something on SNAT node that is unique

Ubuntu SRU details:
-------------------
[Impact]
See above

[Test Case]
Deploy OpenStack with dvr enabled and then follow the steps above.

[Regression Potential]
The patches that are backported have already landed upstream in the corresponding stable branches, helping to minimize any regression potential.

Revision history for this message
Arjun Baindur (abaindur) wrote :

I noticed the log says tunnel bridge. Was this not intended to ever happen for VLAN/physical bridges?

        if net_shared_only:
            LOG.debug("Not applying DVR rules to tunnel bridge because %s "
                      "is a shared network", subnet_info.get('network_id'))

https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L436

Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :

is this pike?

Revision history for this message
Arjun Baindur (abaindur) wrote :

Yes, Pike. But the code in question appears unchanged on queens and master too. The DVR drop/mod flows will be skipped for any shared non-external network:

        if lvm.network_type == n_const.TYPE_VLAN:
            # TODO(vivek) remove the IPv6 related flows once SNAT is not
            # used for IPv6 DVR.
            br = self.phys_brs[lvm.physical_network]
        if lvm.network_type in constants.TUNNEL_NETWORK_TYPES:
            br = self.tun_br
        # TODO(vivek) remove the IPv6 related flows once SNAT is not
        # used for IPv6 DVR.
        port_net_info = None
        net_shared_only = False
        try:
            port_net_info = (
                self.plugin_rpc.get_network_info_for_id(
                    self.context, subnet_info.get('network_id')))
        except oslo_messaging.RemoteError as e:
            LOG.error('L2 agent could not get network_info_for_id '
                      'due to RPC error. It happens when the server '
                      'does not support this RPC API. Detailed message: '
                      '%s', e)
        if port_net_info:
            net_shared_only = (
                port_net_info[0]['shared'] and
                not port_net_info[0]['router:external'])
        if net_shared_only:
            LOG.debug("Not applying DVR rules to tunnel bridge because %s "
                      "is a shared network", subnet_info.get('network_id'))
        else:
            if ip_version == 4:
                if subnet_info['gateway_ip']:
                    br.install_dvr_process_ipv4(
                        vlan_tag=lvm.vlan,
                        gateway_ip=subnet_info['gateway_ip'])
            else:
                br.install_dvr_process_ipv6(
                    vlan_tag=lvm.vlan, gateway_mac=subnet_info['gateway_mac'])
            br.install_dvr_process(
                vlan_tag=lvm.vlan, vif_mac=port.vif_mac,
                dvr_mac_address=self.dvr_mac_address)

Arjun Baindur (abaindur)
Changed in neutron:
assignee: nobody → Arjun Baindur (abaindur)
Revision history for this message
Oleg Bondarev (obondarev) wrote :

Marked as critical as this regression is breaking connectivity for all shared tenant networks.

Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
Changed in neutron:
assignee: Arjun Baindur (abaindur) → Swaminathan Vasudevan (swaminathan-vasudevan)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/595490
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fd72643a61f726145288b2a468b044e84d02c88e
Submitter: Zuul
Branch: master

commit fd72643a61f726145288b2a468b044e84d02c88e
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 23 05:54:17 2018 +0000

    Revert "DVR: Inter Tenant Traffic between networks not possible with shared net"

    This reverts commit d019790fe436b72cb05b8d0ff1f3a62ebd9e9bee.

    Closes-Bug: #1783654
    Change-Id: I4fd2610e185fb60cae62693cd4032ab700209b5f

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/596405

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/queens)

Reviewed: https://review.openstack.org/596405
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b70afb50138f9588a5165e1ca986f83856d5399d
Submitter: Zuul
Branch: stable/queens

commit b70afb50138f9588a5165e1ca986f83856d5399d
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 23 05:54:17 2018 +0000

    Revert "DVR: Inter Tenant Traffic between networks not possible with shared net"

    This reverts commit d019790fe436b72cb05b8d0ff1f3a62ebd9e9bee.

    Closes-Bug: #1783654
    Change-Id: I4fd2610e185fb60cae62693cd4032ab700209b5f
    (cherry picked from commit fd72643a61f726145288b2a468b044e84d02c88e)

tags: added: in-stable-queens
Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hi Swaminathan. Should we backport fd72643a61f726145288b2a468b044e84d02c88e to Rocky? The code is in master and Queens, but not in Rocky.

Revision history for this message
Brian Haley (brian-haley) wrote :

Rodolfo - yes, we need both https://review.openstack.org/#/c/595496/ and https://review.openstack.org/#/c/595490/ picked to stable/rocky.

I even dropped a note in the first one, just forgot to push the button:

"I saw this was picked to stable/queens, we need it in stable/rocky as well I believe."

Changed in neutron (Ubuntu Bionic):
status: New → Triaged
Changed in neutron (Ubuntu Cosmic):
importance: Undecided → High
Changed in neutron (Ubuntu Bionic):
importance: Undecided → High
Changed in neutron (Ubuntu Cosmic):
status: New → Triaged
Changed in neutron (Ubuntu Cosmic):
importance: High → Critical
Changed in neutron (Ubuntu Bionic):
importance: High → Critical
Revision history for this message
Corey Bryant (corey.bryant) wrote :

This does not affect Ubuntu Pike because neutron 11.0.5 does not include the original patches that are being reverted.

description: updated
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:13.0.1-0ubuntu1

---------------
neutron (2:13.0.1-0ubuntu1) cosmic; urgency=medium

  * New stable point release for OpenStack Rocky.
  * d/p/revert-dvr-add-error-handling.patch: Cherry-picked from upstream to
    revert DVR regressions (LP: #1751396)
  * d/p/revert-dvr-inter-tenant.patch: Cherry-picked from upstream to revert
    DVR regression (LP: #1783654).

 -- Corey Bryant <email address hidden> Tue, 02 Oct 2018 17:18:19 -0400

Changed in neutron (Ubuntu Cosmic):
status: Triaged → Fix Released
Revision history for this message
Timo Aaltonen (tjaalton) wrote : Please test proposed package

Hello Arjun, or anyone else affected,

Accepted neutron into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:12.0.4-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic
Revision history for this message
Corey Bryant (corey.bryant) wrote :

Hello Arjun, or anyone else affected,

Accepted neutron into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/rocky)

Reviewed: https://review.openstack.org/607349
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=694d7757d745665f6e4ab05efa6feebf4e47304c
Submitter: Zuul
Branch: stable/rocky

commit 694d7757d745665f6e4ab05efa6feebf4e47304c
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 23 05:54:17 2018 +0000

    Revert "DVR: Inter Tenant Traffic between networks not possible with shared net"

    This reverts commit d019790fe436b72cb05b8d0ff1f3a62ebd9e9bee.

    Closes-Bug: #1783654
    Change-Id: I4fd2610e185fb60cae62693cd4032ab700209b5f
    (cherry picked from commit fd72643a61f726145288b2a468b044e84d02c88e)

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Regression testing successful for bionic-proposed (tempest results):

======
Totals
======
Ran: 92 tests in 1318.6413 sec.
 - Passed: 84
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 494.8999 sec.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/609442

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Regression testing successful for queens-proposed (tempest results):

======
Totals
======
Ran: 92 tests in 1000.6584 sec.
 - Passed: 84
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 465.0920 sec.

tags: added: verification-queens-done
removed: verification-queens-needed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/pike)

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/609440
Reason: Already proposed https://review.openstack.org/#/c/600141/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/609442
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f2c3821dc76a50c4994b3b599867b99ff772559d
Submitter: Zuul
Branch: stable/ocata

commit f2c3821dc76a50c4994b3b599867b99ff772559d
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 23 05:54:17 2018 +0000

    Revert "DVR: Inter Tenant Traffic between networks not possible with shared net"

    This reverts commit d019790fe436b72cb05b8d0ff1f3a62ebd9e9bee.

    Closes-Bug: #1783654
    Change-Id: I4fd2610e185fb60cae62693cd4032ab700209b5f
    (cherry picked from commit fd72643a61f726145288b2a468b044e84d02c88e)
    (cherry picked from commit b70afb50138f9588a5165e1ca986f83856d5399d)

tags: added: in-stable-ocata
Revision history for this message
Brian Murray (brian-murray) wrote :

I don't think the regression testing is sufficient for this SRU verification, please follow the test case in the description to verify that this is fixed.

tags: added: verification-needed-bionic
removed: verification-done-bionic
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/pike)

Reviewed: https://review.openstack.org/609440
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=020d745f5b859f93f0c550be221c350bc14e8d23
Submitter: Zuul
Branch: stable/pike

commit 020d745f5b859f93f0c550be221c350bc14e8d23
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 23 05:54:17 2018 +0000

    Revert "DVR: Inter Tenant Traffic between networks not possible with shared net"

    This reverts commit d019790fe436b72cb05b8d0ff1f3a62ebd9e9bee.

    Closes-Bug: #1783654
    Change-Id: I4fd2610e185fb60cae62693cd4032ab700209b5f
    (cherry picked from commit fd72643a61f726145288b2a468b044e84d02c88e)
    (cherry picked from commit b70afb50138f9588a5165e1ca986f83856d5399d)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 13.0.2

This issue was fixed in the openstack/neutron 13.0.2 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.6

This issue was fixed in the openstack/neutron 11.0.6 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 12.0.5

This issue was fixed in the openstack/neutron 12.0.5 release.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This SRU is no longer needed as we're uploading neutron 12.0.5 to bionic (queens) which includes this fix.

Revision history for this message
Brian Murray (brian-murray) wrote : Reminder of SRU verification policy change

Thank you for taking the time to verify this stable release fix. We have noticed that you have used the verification-done tag for marking the bug as verified and would like to point out that due to a recent change in SRU bug verification policy fixes now have to be marked with per-release tags (i.e. verification-done-$RELEASE). Please remove the verification-done tag and add one for the release you have tested the package in. Thank you!

https://wiki.ubuntu.com/StableReleaseUpdates#Verification

Revision history for this message
Corey Bryant (corey.bryant) wrote :

This bug has been fixed in the neutron 12.0.5 stable point release via https://bugs.launchpad.net/cloud-archive/+bug/1795424. neutron 12.0.5 was tested with a tempest smoke test run against a juju-deployed openstack as described in https://wiki.ubuntu.com/OpenStackUpdates.
Please see test results at: https://bugs.launchpad.net/cloud-archive/+bug/1795424/comments/17

Thanks,
Corey

tags: added: verification-done-bionic
removed: verification-needed-bionic
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 14.0.0.0b1

This issue was fixed in the openstack/neutron 14.0.0.0b1 development milestone.

Revision history for this message
Corey Bryant (corey.bryant) wrote :

Marking as fix released for Bionic/Queens as it is included in 2:12.0.5-0ubuntu1 and 2:12.0.5-0ubuntu1~cloud0.

Changed in neutron (Ubuntu Bionic):
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron ocata-eol

This issue was fixed in the openstack/neutron ocata-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.