OVN deployment with DVR environment incorrectly routes FIP traffic through Controller/Chassis-GW

Bug #1842988 reported by Simon Clarke
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Undecided
Lucas Alvares Gomes
tripleo
Fix Released
Undecided
Unassigned

Bug Description

TripleO Stein. OVN deployment with DVR environment file incorrectly routes FIP traffic through Controller/Chassis-GW rather than locally.

Steps to reproduce
===========
1. Deployed overcloud enabling ovn with DVR.

The following neutron environment files were used (in additional to network isolation using bonded VLAN and other customizations)

 -e $TD/environments/services/neutron-ovn-dvr-ha.yaml \
 -e $TD/environments/services/neutron-ovn-dpdk.yaml \
 -e $TD/environments/services/neutron-ovn-sriov.yaml

2. After overcloud deployment confirmed that the neutron conf files and chassis settings are correct.

neutron.conf -> enable_dvr=True
ml2_conf.ini -> enable_distributed_floating_ip=True
bridge_mapping on compute chassis -> ovn-bridge-mappings="datacentre:br-ex"

3. Deployed instance with Geneve Tenant network with floating IP on VLAN external ‘datacentre’ network.

Expected Result
=============
FIP traffic is routed through the same compute node as instance via a local NAT rule.

Actual Result
============
FIP is operational but traffic routed through the Controller/Chassis-GW.

The matching NAT entry for the FIP shows that the external_mac is Null and logical port was not set, so there is no local NAT routing occurring as observed.

Environment
===========

1. Tripleo Stein using the latest current-tripleo-rdo container images with standard Compute role plus OvsDpdk and SR-IOV roles.

2. Ceph and Pure Storage
3. OVN networking (default in Stein) with the following neutron environment

  -e $TD/environments/services/neutron-ovn-dvr-ha.yaml \
  -e $TD/environments/services/neutron-ovn-dpdk.yaml \
  -e $TD/environments/services/neutron-ovn-sriov.yaml

    (in additional to network isolation using bonded VLAN and other customizations)

Confirmed that after deployment

• neutron.conf -> enable_dvr=True
• ml2_conf.ini -> enable_distributed_floating_ip=True
• bridge_mapping on compute chassis -> ovn-bridge-mappings="datacentre:br-ex"

Logs & Configs
===========

neutron.conf -> enable_dvr=True
ml2_conf.ini -> enable_distributed_floating_ip=True
bridge_mapping on compute chassis -> ovn-bridge-mappings="datacentre:br-ex"

ovn-nbctl lr-nat-list neutron-a53687de-ac06-400a-9104-748d2807c55a

TYPE EXTERNAL_IP LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT
dnat_and_snat 10.3.27.20 192.168.0.18
snat 10.3.25.207 192.168.0.0/24

Changed in tripleo:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
milestone: none → ussuri-1
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
Revision history for this message
Andreas Karis (akaris) wrote :
Download full text (7.2 KiB)

I ran into the same issue downstream Red Hat OSP 13 in 2 different, brand new environments. I'm going to add further details from my troubleshooting.

First of all, a workaround or way to "fix" this is to reboot the instance. The external_mac field then will be populated. However, if one detaches and reattaches the VIP, one can easily reproduce the issue.

From my lab, here are steps to reproduce and verify this:

Create server without FIP:
~~~
(overcloud) [stack@undercloud-0 ~]$ nova list
+--------------------------------------+------------+--------+------------+-------------+--------------------------------------------------------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------------+--------+------------+-------------+--------------------------------------------------------------------------+
| ac73af43-c22e-4e5b-be90-f2bc79b30ca4 | rhel-test1 | ACTIVE | - | Running | private1=2000:192:168:0:f816:3eff:fe3f:9ee4, 192.168.0.101, 172.31.0.201 |
| f8dccb8d-b39a-4231-b2f1-46dae48744b4 | rhel-test2 | ACTIVE | - | Running | private1=2000:192:168:0:f816:3eff:fe00:bcd6, 192.168.0.103, 172.31.0.217 |
| 1cd9d144-9951-4bff-8b97-3a6e02b79467 | rhel-test3 | ACTIVE | - | Running | private2=192.168.1.101, 2000:192:168:1:f816:3eff:feb7:5d7d |
+--------------------------------------+------------+--------+------------+-------------+--------------------------------------------------------------------------+
~~~

Check NAT rules:
~~~
[root@overcloud-controller-0 ~]# export SB=$(sudo ovs-vsctl get open . external_ids:ovn-remote | sed -e 's/\"//g')
[root@overcloud-controller-0 ~]# export NB=$(sudo ovs-vsctl get open . external_ids:ovn-remote | sed -e 's/\"//g' | sed -e 's/6642/6641/g')
[root@overcloud-controller-0 ~]# alias ovn-sbctl='sudo docker exec ovn_controller ovn-sbctl --db=$SB'
[root@overcloud-controller-0 ~]# alias ovn-nbctl='sudo docker exec ovn_controller ovn-nbctl --db=$NB'
[root@overcloud-controller-0 ~]# alias ovn-trace='sudo docker exec ovn_controller ovn-trace --db=$SB'
[root@overcloud-controller-0 ~]# ovn-nbctl find NAT type=dnat_and_snat
_uuid : 61b357c0-1c06-4c22-b0e5-50ef64813cc8
external_ids : {"neutron:fip_external_mac"="fa:16:3e:ee:4b:3b", "neutron:fip_id"="b0c03864-8c79-43b3-9240-370cfbcde904", "neutron:fip_port_id"="b7e9fb53-db4e-4b3e-a6e4-2acbbc4ba9ba", "neutron:revision_number"="2", "neutron:router_name"="neutron-e75c5fb4-29ee-450d-840f-a911c9896256"}
external_ip : "172.31.0.201"
external_mac : "fa:16:3e:ee:4b:3b"
logical_ip : "192.168.0.101"
logical_port : "b7e9fb53-db4e-4b3e-a6e4-2acbbc4ba9ba"
type : dnat_and_snat

_uuid : b0782824-00e9-4b66-bc4b-9d54cf8ec737
external_ids : {"neutron:fip_external_mac"="fa:16:3e:8b:04:4f", "neutron:fip_id"="4fc2b87a-a188-4468-ad40-8b03ed26b56d", "neutron:fip_port_id"="581de757-7b2e-4995-aa84-3c4bc58edd4d", "neutron:revision_number"="2", "neutron:router_name"="neutron-e75c5fb4-29ee-45...

Read more...

Revision history for this message
Andreas Karis (akaris) wrote :

The issue can also be recreated by detaching and reattaching the FIP:
~~~
(overcloud) [stack@undercloud-0 ~]$ openstack server remove floating ip rhel-test3 172.31.0.210
(overcloud) [stack@undercloud-0 ~]$ openstack server add floating ip rhel-test3 172.31.0.210
(overcloud) [stack@undercloud-0 ~]$
~~~

~~~
[root@overcloud-controller-0 ~]# ovn-nbctl find NAT type=dnat_and_snat
_uuid : 61b357c0-1c06-4c22-b0e5-50ef64813cc8
external_ids : {"neutron:fip_external_mac"="fa:16:3e:ee:4b:3b", "neutron:fip_id"="b0c03864-8c79-43b3-9240-370cfbcde904", "neutron:fip_port_id"="b7e9fb53-db4e-4b3e-a6e4-2acbbc4ba9ba", "neutron:revision_number"="2", "neutron:router_name"="neutron-e75c5fb4-29ee-450d-840f-a911c9896256"}
external_ip : "172.31.0.201"
external_mac : "fa:16:3e:ee:4b:3b"
logical_ip : "192.168.0.101"
logical_port : "b7e9fb53-db4e-4b3e-a6e4-2acbbc4ba9ba"
type : dnat_and_snat

_uuid : b0782824-00e9-4b66-bc4b-9d54cf8ec737
external_ids : {"neutron:fip_external_mac"="fa:16:3e:8b:04:4f", "neutron:fip_id"="4fc2b87a-a188-4468-ad40-8b03ed26b56d", "neutron:fip_port_id"="581de757-7b2e-4995-aa84-3c4bc58edd4d", "neutron:revision_number"="2", "neutron:router_name"="neutron-e75c5fb4-29ee-450d-840f-a911c9896256"}
external_ip : "172.31.0.217"
external_mac : "fa:16:3e:8b:04:4f"
logical_ip : "192.168.0.103"
logical_port : "581de757-7b2e-4995-aa84-3c4bc58edd4d"
type : dnat_and_snat

_uuid : 2b58f203-5560-480d-869f-66ddb0d97ed8
external_ids : {"neutron:fip_external_mac"="fa:16:3e:9e:96:da", "neutron:fip_id"="cdeb3dc7-fdc8-493b-9536-edb2163c3d1c", "neutron:fip_port_id"="79f8f865-30c7-48ab-9017-76ce6c007c1b", "neutron:revision_number"="22", "neutron:router_name"="neutron-e75c5fb4-29ee-450d-840f-a911c9896256"}
external_ip : "172.31.0.210"
external_mac : []
logical_ip : "192.168.1.101"
logical_port : "79f8f865-30c7-48ab-9017-76ce6c007c1b"
type : dnat_and_snat
[root@overcloud-controller-0 ~]# ovn-nbctl lr-nat-list neutron-e75c5fb4-29ee-450d-840f-a911c9896256
TYPE EXTERNAL_IP LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT
dnat_and_snat 172.31.0.201 192.168.0.101 fa:16:3e:ee:4b:3b b7e9fb53-db4e-4b3e-a6e4-2acbbc4ba9ba
dnat_and_snat 172.31.0.210 192.168.1.101
dnat_and_snat 172.31.0.217 192.168.0.103 fa:16:3e:8b:04:4f 581de757-7b2e-4995-aa84-3c4bc58edd4d
snat 172.31.0.212 192.168.10.0/24
snat 172.31.0.212 192.168.0.0/24
snat 172.31.0.212 192.168.1.0/24
~~~

Revision history for this message
Andreas Karis (akaris) wrote :

And for completeness, here's the output of `ovn-nbctl lr-nat-list` after rebooting that server:
~~~
(overcloud) [stack@undercloud-0 ~]$ nova reboot rhel-test3
Request to reboot server <Server: rhel-test3> has been accepted.
(overcloud) [stack@undercloud-0 ~]$
~~~

~~~
[root@overcloud-controller-0 ~]# ovn-nbctl lr-nat-list neutron-e75c5fb4-29ee-450d-840f-a911c9896256
TYPE EXTERNAL_IP LOGICAL_IP EXTERNAL_MAC LOGICAL_PORT
dnat_and_snat 172.31.0.201 192.168.0.101 fa:16:3e:ee:4b:3b b7e9fb53-db4e-4b3e-a6e4-2acbbc4ba9ba
dnat_and_snat 172.31.0.210 192.168.1.101 fa:16:3e:9e:96:da 79f8f865-30c7-48ab-9017-76ce6c007c1b
dnat_and_snat 172.31.0.217 192.168.0.103 fa:16:3e:8b:04:4f 581de757-7b2e-4995-aa84-3c4bc58edd4d
snat 172.31.0.212 192.168.10.0/24
snat 172.31.0.212 192.168.0.0/24
snat 172.31.0.212 192.168.1.0/24
~~~

Changed in tripleo:
assignee: Lucas Alvares Gomes (lucasagomes) → nobody
tags: added: ovn
Changed in neutron:
assignee: nobody → Lucas Alvares Gomes (lucasagomes)
Revision history for this message
Lucas Alvares Gomes (lucasagomes) wrote :

Looking at the code I found some inconsistency in the way we set this "external_mac" value when the network type is VLAN. That also explains why Andreas gets it working after rebooting the node.

When we create the FIP this conditional is executed [0] which means that we won't set the "external_mac" field if the network type is VLAN.

Now, when we reboot the node. The port will flip to DOWN and the UP again and both of these status will generate an "event" which at the end will invoke this code here [1]. In the case of the event, we are *not* checking for the network type so the "external_mac" field gets set.

I believe we need to remove the VLAN check from [0].

[0] https://github.com/openstack/networking-ovn/blob/eda5d7f80d877601170631c5f5485370ea701f42/networking_ovn/common/ovn_client.py#L836-L837
[1] https://github.com/openstack/networking-ovn/blob/eda5d7f80d877601170631c5f5485370ea701f42/networking_ovn/ml2/mech_driver.py#L794-L800

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/703813

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/703813
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=c471c7330c6c7a642e11ceae5fd604177059a3e8
Submitter: Zuul
Branch: master

commit c471c7330c6c7a642e11ceae5fd604177059a3e8
Author: Lucas Alvares Gomes <email address hidden>
Date: Wed Jan 22 15:00:32 2020 +0000

    [OVN] Remove VLAN check when setting external_mac

    This patch reverts [0].

    The code wasn't accounting for VLAN provider networks, as stated in the
    bug #1842988, DVR won't work if the provider network (where the FIP is
    created) is VLAN.

    There was also an incosistency in how the external_mac was set for the
    VLAN networks. Upon creating the FIP the code was checking for the
    network type and not setting the external_mac attribute in case the
    network was VLAN type. But, if the port went down and up again (e.g if
    you reboot the VM) the event handler that set/unset the external_mac [1]
    wasn't check for the type. This is how people worked around the DVR
    problem (as stated in bug #1842988).

    For more information see bug #1842988.

    [0]
    https://github.com/openstack/networking-ovn/commit/c5aef51edc9843db605303ec8bd8610b6c55e9c2
    [1]
    https://github.com/openstack/networking-ovn/blob/eda5d7f80d877601170631c5f5485370ea701f42/networking_ovn/ml2/mech_driver.py#L794-L800

    Change-Id: Ifb795626dc9c2ac4f0104f491dd38c9b4cc902c9
    Closes-Bug: #1842988
    Signed-off-by: Lucas Alvares Gomes <email address hidden>

Changed in neutron:
status: In Progress → Fix Released
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-2 → ussuri-3
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 16.0.0.0b1

This issue was fixed in the openstack/neutron 16.0.0.0b1 development milestone.

wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-3 → ussuri-rc3
tags: added: neutron-proactive-backport-potential
wes hayutin (weshayutin)
Changed in tripleo:
milestone: ussuri-rc3 → victoria-1
Changed in tripleo:
milestone: victoria-1 → victoria-3
Changed in tripleo:
milestone: victoria-3 → wallaby-1
Changed in tripleo:
milestone: wallaby-1 → wallaby-2
Revision history for this message
Marios Andreou (marios-b) wrote :

clearing out old bugs. no update here in a while so I am going to move it to fix-released for tripleo too please move it back if you disagree thanks

Changed in tripleo:
status: New → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn queens-eol

This issue was fixed in the openstack/networking-ovn queens-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/networking-ovn rocky-eol

This issue was fixed in the openstack/networking-ovn rocky-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.