DVR process flow not installed on physical bridge for shared tenant network

Bug #1783654 reported by Arjun Baindur on 2018-07-25
46
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
Status tracked in Rocky
Pike
Undecided
Unassigned
Queens
Critical
Unassigned
Rocky
Critical
Unassigned
neutron
Critical
Swaminathan Vasudevan
neutron (Ubuntu)
Status tracked in Cosmic
Bionic
Critical
Unassigned
Cosmic
Critical
Unassigned

Bug Description

Seems like collateral from https://bugs.launchpad.net/neutron/+bug/1751396

In DVR, the distributed gateway port's IP and MAC are shared in the qrouter across all hosts.

The dvr_process_flow on the physical bridge (which replaces the shared router_distributed MAC address with the unique per-host MAC when its the source), is missing, and so is the drop rule which instructs the bridge to drop all traffic destined for the shared distributed MAC.

Because of this, we are seeing the router MAC on the network infrastructure, causing it on flap on br-int on every compute host:

root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 1
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 2
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
    1 4 fa:16:3e:42:a2:ec 1
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0
root@milhouse:~# ovs-appctl fdb/show br-int | grep fa:16:3e:42:a2:ec
   11 4 fa:16:3e:42:a2:ec 0

Where port 1 is phy-br-vlan, connecting to the physical bridge, and port 11 is the correct local qr-interface. Because these dvr flows are missing on br-vlan, pkts w/ source mac ingress into the host and br-int learns it upstream.

The symptom is when pinging a VM's floating IP, we see occasional packet loss (10-30%), and sometimes the responses are sent upstream by br-int instead of the qrouter, so the ICMP replies come with fixed IP of the replier since no NAT'ing took place, and on the tenant network rather than external network.

When I force net_shared_only to False here, the problem goes away: https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L436

It should we noted we *ONLY* need to do this on our dvr_snat host. The dvr process's are missing on every compute host. But if we shut qrouter on the snat host, FIP functionality works and DVR mac stops flapping on others. Or if we apply fix only to snat host, it works. Perhaps there is something on SNAT node that is unique

Ubuntu SRU details:
-------------------
[Impact]
See above

[Test Case]
Deploy OpenStack with dvr enabled and then follow the steps above.

[Regression Potential]
The patches that are backported have already landed upstream in the corresponding stable branches, helping to minimize any regression potential.

Arjun Baindur (abaindur) wrote :

I noticed the log says tunnel bridge. Was this not intended to ever happen for VLAN/physical bridges?

        if net_shared_only:
            LOG.debug("Not applying DVR rules to tunnel bridge because %s "
                      "is a shared network", subnet_info.get('network_id'))

https://github.com/openstack/neutron/blob/stable/pike/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py#L436

YAMAMOTO Takashi (yamamoto) wrote :

is this pike?

Arjun Baindur (abaindur) wrote :

Yes, Pike. But the code in question appears unchanged on queens and master too. The DVR drop/mod flows will be skipped for any shared non-external network:

        if lvm.network_type == n_const.TYPE_VLAN:
            # TODO(vivek) remove the IPv6 related flows once SNAT is not
            # used for IPv6 DVR.
            br = self.phys_brs[lvm.physical_network]
        if lvm.network_type in constants.TUNNEL_NETWORK_TYPES:
            br = self.tun_br
        # TODO(vivek) remove the IPv6 related flows once SNAT is not
        # used for IPv6 DVR.
        port_net_info = None
        net_shared_only = False
        try:
            port_net_info = (
                self.plugin_rpc.get_network_info_for_id(
                    self.context, subnet_info.get('network_id')))
        except oslo_messaging.RemoteError as e:
            LOG.error('L2 agent could not get network_info_for_id '
                      'due to RPC error. It happens when the server '
                      'does not support this RPC API. Detailed message: '
                      '%s', e)
        if port_net_info:
            net_shared_only = (
                port_net_info[0]['shared'] and
                not port_net_info[0]['router:external'])
        if net_shared_only:
            LOG.debug("Not applying DVR rules to tunnel bridge because %s "
                      "is a shared network", subnet_info.get('network_id'))
        else:
            if ip_version == 4:
                if subnet_info['gateway_ip']:
                    br.install_dvr_process_ipv4(
                        vlan_tag=lvm.vlan,
                        gateway_ip=subnet_info['gateway_ip'])
            else:
                br.install_dvr_process_ipv6(
                    vlan_tag=lvm.vlan, gateway_mac=subnet_info['gateway_mac'])
            br.install_dvr_process(
                vlan_tag=lvm.vlan, vif_mac=port.vif_mac,
                dvr_mac_address=self.dvr_mac_address)

Arjun Baindur (abaindur) on 2018-08-10
Changed in neutron:
assignee: nobody → Arjun Baindur (abaindur)
Oleg Bondarev (obondarev) wrote :

Marked as critical as this regression is breaking connectivity for all shared tenant networks.

Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
Changed in neutron:
assignee: Arjun Baindur (abaindur) → Swaminathan Vasudevan (swaminathan-vasudevan)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/595490
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=fd72643a61f726145288b2a468b044e84d02c88e
Submitter: Zuul
Branch: master

commit fd72643a61f726145288b2a468b044e84d02c88e
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 23 05:54:17 2018 +0000

    Revert "DVR: Inter Tenant Traffic between networks not possible with shared net"

    This reverts commit d019790fe436b72cb05b8d0ff1f3a62ebd9e9bee.

    Closes-Bug: #1783654
    Change-Id: I4fd2610e185fb60cae62693cd4032ab700209b5f

Changed in neutron:
status: In Progress → Fix Released
tags: added: neutron-proactive-backport-potential

Reviewed: https://review.openstack.org/596405
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=b70afb50138f9588a5165e1ca986f83856d5399d
Submitter: Zuul
Branch: stable/queens

commit b70afb50138f9588a5165e1ca986f83856d5399d
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 23 05:54:17 2018 +0000

    Revert "DVR: Inter Tenant Traffic between networks not possible with shared net"

    This reverts commit d019790fe436b72cb05b8d0ff1f3a62ebd9e9bee.

    Closes-Bug: #1783654
    Change-Id: I4fd2610e185fb60cae62693cd4032ab700209b5f
    (cherry picked from commit fd72643a61f726145288b2a468b044e84d02c88e)

tags: added: in-stable-queens

Hi Swaminathan. Should we backport fd72643a61f726145288b2a468b044e84d02c88e to Rocky? The code is in master and Queens, but not in Rocky.

Brian Haley (brian-haley) wrote :

Rodolfo - yes, we need both https://review.openstack.org/#/c/595496/ and https://review.openstack.org/#/c/595490/ picked to stable/rocky.

I even dropped a note in the first one, just forgot to push the button:

"I saw this was picked to stable/queens, we need it in stable/rocky as well I believe."

Changed in neutron (Ubuntu Bionic):
status: New → Triaged
Changed in neutron (Ubuntu Cosmic):
importance: Undecided → High
Changed in neutron (Ubuntu Bionic):
importance: Undecided → High
Changed in neutron (Ubuntu Cosmic):
status: New → Triaged
Changed in neutron (Ubuntu Cosmic):
importance: High → Critical
Changed in neutron (Ubuntu Bionic):
importance: High → Critical
Corey Bryant (corey.bryant) wrote :

This does not affect Ubuntu Pike because neutron 11.0.5 does not include the original patches that are being reverted.

description: updated
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package neutron - 2:13.0.1-0ubuntu1

---------------
neutron (2:13.0.1-0ubuntu1) cosmic; urgency=medium

  * New stable point release for OpenStack Rocky.
  * d/p/revert-dvr-add-error-handling.patch: Cherry-picked from upstream to
    revert DVR regressions (LP: #1751396)
  * d/p/revert-dvr-inter-tenant.patch: Cherry-picked from upstream to revert
    DVR regression (LP: #1783654).

 -- Corey Bryant <email address hidden> Tue, 02 Oct 2018 17:18:19 -0400

Changed in neutron (Ubuntu Cosmic):
status: Triaged → Fix Released

Hello Arjun, or anyone else affected,

Accepted neutron into bionic-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/neutron/2:12.0.4-0ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-bionic to verification-done-bionic. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-bionic. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in neutron (Ubuntu Bionic):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-bionic
Corey Bryant (corey.bryant) wrote :

Hello Arjun, or anyone else affected,

Accepted neutron into queens-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:queens-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-queens-needed to verification-queens-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-queens-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-queens-needed

Reviewed: https://review.openstack.org/607349
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=694d7757d745665f6e4ab05efa6feebf4e47304c
Submitter: Zuul
Branch: stable/rocky

commit 694d7757d745665f6e4ab05efa6feebf4e47304c
Author: Swaminathan Vasudevan <email address hidden>
Date: Thu Aug 23 05:54:17 2018 +0000

    Revert "DVR: Inter Tenant Traffic between networks not possible with shared net"

    This reverts commit d019790fe436b72cb05b8d0ff1f3a62ebd9e9bee.

    Closes-Bug: #1783654
    Change-Id: I4fd2610e185fb60cae62693cd4032ab700209b5f
    (cherry picked from commit fd72643a61f726145288b2a468b044e84d02c88e)

Corey Bryant (corey.bryant) wrote :

Regression testing successful for bionic-proposed (tempest results):

======
Totals
======
Ran: 92 tests in 1318.6413 sec.
 - Passed: 84
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 494.8999 sec.

tags: added: verification-done verification-done-bionic
removed: verification-needed verification-needed-bionic
Corey Bryant (corey.bryant) wrote :

Regression testing successful for queens-proposed (tempest results):

======
Totals
======
Ran: 92 tests in 1000.6584 sec.
 - Passed: 84
 - Skipped: 8
 - Expected Fail: 0
 - Unexpected Success: 0
 - Failed: 0
Sum of execute time for each test: 465.0920 sec.

tags: added: verification-queens-done
removed: verification-queens-needed

Change abandoned by Slawek Kaplonski (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/609440
Reason: Already proposed https://review.openstack.org/#/c/600141/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers