Bug #1792493 “DVR and floating IPs broken in latest 7.0.0.0rc1?” : Bugs : neutron

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-14:

#1

Packet path.png Edit (289.8 KiB, image/png)

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-14:

#2

It is interesting that OVS reports br-int's port for fg-ba492724-bd as down:

7(fg-ba492724-bd): addr:00:00:00:00:00:00
     config: PORT_DOWN
     state: LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max

I can only assume that this is not normal?

The fg-ba492724-bd interface in the FIP namespace is UP (public IP substituted with A.A.A.A):

26: fg-ba492724-bd: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:26:50:2d brd ff:ff:ff:ff:ff:ff
    inet A.A.A.A/24 brd A.A.A.255 scope global fg-ba492724-bd
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe26:502d/64 scope link
       valid_lft forever preferred_lft forever

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-14:

#3

I have a Queens environment that works, so I am comparing everything possible between it and the Rocky deployment.

Apparently, having the "fg" interface down is normal (why, I have no idea), but the working Queens system shows this when running "ovs-ofctl show br-int", truncated to only show the fg interface.

6(fg-d57c7cc3-d3): addr:00:00:00:00:00:00
     config: PORT_DOWN
     state: LINK_DOWN
     speed: 0 Mbps now, 0 Mbps max

So far, everything in OVS and the network namespaces look identical.

Also, the OVS version is identical (from "ovs-vsctl --version"):
ovs-vsctl (Open vSwitch) 2.9.0
DB Schema 7.15.1

The CentOS install on the Queens environment has a slightly older kernel:
working: Kernel: Linux 3.10.0-862.9.1.el7.x86_64
not working: Kernel: Linux 3.10.0-862.11.6.el7.x86_64

but the kernel version is only a guess - I'll continue to look for variances.

Knowing if anyone else has run into this issue would be greatly appreciated.

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-14:

#4

One significant difference I see is the OVS flows created on the br-ex bridge:

Non-working system:

ovs-ofctl dump-flows br-ex

cookie=0xcf59ae700b21fb77, duration=169.257s, table=0, n_packets=16, n_bytes=944, priority=4,in_port="phy-br-ex",dl_vlan=2 actions=strip_vlan,NORMAL

Working system:

ovs-ofctl dump-flows br-ex

cookie=0xc68f93c404034a40, duration=2.224s, table=0, n_packets=0, n_bytes=0, priority=4,in_port="phy-br-ex",dl_vlan=2 actions=strip_vlan,NORMAL
cookie=0xc68f93c404034a40, duration=24.644s, table=0, n_packets=307, n_bytes=28483, priority=2,in_port="phy-br-ex" actions=resubmit(,1)
cookie=0xc68f93c404034a40, duration=24.977s, table=0, n_packets=1, n_bytes=60, priority=0 actions=NORMAL
cookie=0xc68f93c404034a40, duration=24.641s, table=0, n_packets=5719479, n_bytes=630280414, priority=1 actions=resubmit(,3)
cookie=0xc68f93c404034a40, duration=24.638s, table=1, n_packets=307, n_bytes=28483, priority=0 actions=resubmit(,2)
cookie=0xc68f93c404034a40, duration=24.634s, table=2, n_packets=307, n_bytes=28483, priority=2,in_port="phy-br-ex" actions=drop
cookie=0xc68f93c404034a40, duration=24.545s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:09:b2:a0 actions=output:"phy-br-ex"
cookie=0xc68f93c404034a40, duration=24.510s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:44:65:55 actions=output:"phy-br-ex"
cookie=0xc68f93c404034a40, duration=24.483s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:46:ad:5b actions=output:"phy-br-ex"
cookie=0xc68f93c404034a40, duration=24.453s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:6b:a2:0b actions=output:"phy-br-ex"
cookie=0xc68f93c404034a40, duration=24.428s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:7f:62:05 actions=output:"phy-br-ex"
cookie=0xc68f93c404034a40, duration=24.400s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:9c:21:19 actions=output:"phy-br-ex"
cookie=0xc68f93c404034a40, duration=24.375s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:b0:61:1c actions=output:"phy-br-ex"
cookie=0xc68f93c404034a40, duration=24.351s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:f5:d2:d0 actions=output:"phy-br-ex"
cookie=0xc68f93c404034a40, duration=24.631s, table=3, n_packets=5719479, n_bytes=630280414, priority=1 actions=NORMAL

These flows appear to be created by the "neutron_openvswitch_agent" container. So, maybe something isn't working in the Rocky version of this container?

I'll keep looking...

Eric

One significant difference I see is the OVS flows created on the br-ex bridge:

Non-working system:

ovs-ofctl dump-flows br-ex

cookie=0xcf59ae700b21fb77, duration=169.257s, table=0, n_packets=16, n_bytes=944, priority=4,in_port="phy-br-ex",dl_vlan=2 actions=strip_vlan,NORMAL

Working system:

ovs-ofctl dump-flows br-ex

cookie=0xc68f93c404034a40, duration=2.224s, table=0, n_packets=0, n_bytes=0, priority=4,in_port="phy-br-ex",dl_vlan=2 actions=strip_vlan,NORMAL
 cookie=0xc68f93c404034a40, duration=24.644s, table=0, n_packets=307, n_bytes=28483, priority=2,in_port="phy-br-ex" actions=resubmit(,1)
 cookie=0xc68f93c404034a40, duration=24.977s, table=0, n_packets=1, n_bytes=60, priority=0 actions=NORMAL
 cookie=0xc68f93c404034a40, duration=24.641s, table=0, n_packets=5719479, n_bytes=630280414, priority=1 actions=resubmit(,3)
 cookie=0xc68f93c404034a40, duration=24.638s, table=1, n_packets=307, n_bytes=28483, priority=0 actions=resubmit(,2)
 cookie=0xc68f93c404034a40, duration=24.634s, table=2, n_packets=307, n_bytes=28483, priority=2,in_port="phy-br-ex" actions=drop
 cookie=0xc68f93c404034a40, duration=24.545s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:09:b2:a0 actions=output:"phy-br-ex"
 cookie=0xc68f93c404034a40, duration=24.510s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:44:65:55 actions=output:"phy-br-ex"
 cookie=0xc68f93c404034a40, duration=24.483s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:46:ad:5b actions=output:"phy-br-ex"
 cookie=0xc68f93c404034a40, duration=24.453s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:6b:a2:0b actions=output:"phy-br-ex"
 cookie=0xc68f93c404034a40, duration=24.428s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:7f:62:05 actions=output:"phy-br-ex"
 cookie=0xc68f93c404034a40, duration=24.400s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:9c:21:19 actions=output:"phy-br-ex"
 cookie=0xc68f93c404034a40, duration=24.375s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:b0:61:1c actions=output:"phy-br-ex"
 cookie=0xc68f93c404034a40, duration=24.351s, table=3, n_packets=0, n_bytes=0, priority=2,dl_src=fa:16:3f:f5:d2:d0 actions=output:"phy-br-ex"
 cookie=0xc68f93c404034a40, duration=24.631s, table=3, n_packets=5719479, n_bytes=630280414, priority=1 actions=NORMAL

These flows appear to be created by the "neutron_openvswitch_agent" container.  So, maybe something isn't working in the Rocky version of this container?

I'll keep looking...

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-14:

#5

Download full text (3.4 KiB)

I looked at the openvswitch log file here (from inside the openvswitch-vswitchd container):
/var/log/kolla/openvswitch/ovs-vswitchd.log

while restarting the neutron_openvswitch_agent container. It appears that the Rocky version has continuous re-connections like this (after adding a few flows to the bridges):

Right after restart of the neutron_openvswitch_agent container, the log entry for br-ex flows added shows:

2018-09-14T22:39:43.311Z|63573|connmgr|INFO|br-ex<->tcp:127.0.0.1:6633: 2 flow_mods in the 3 s starting 5 s ago (1 adds, 1 deletes)

That's only a single flow.

Note that we have two provider networks (prov001 and prov002) in addition to the external network (br-ex).

This is what the working system log looks like - with no reconnections at all:

For br-ex, 16 flows were added, as opposed to 1 flow for Rocky.

I noticed that our Queens ...

I looked at the openvswitch log file here (from inside the openvswitch-vswitchd container):
/var/log/kolla/openvswitch/ovs-vswitchd.log

while restarting the neutron_openvswitch_agent container.  It appears that the Rocky version has continuous re-connections like this (after adding a few flows to the bridges):

Right after restart of the neutron_openvswitch_agent container, the log entry for br-ex flows added shows:

2018-09-14T22:39:43.311Z|63573|connmgr|INFO|br-ex<->tcp:127.0.0.1:6633: 2 flow_mods in the 3 s starting 5 s ago (1 adds, 1 deletes)

That's only a single flow.

Note that we have two provider networks (prov001 and prov002) in addition to the external network (br-ex).

This is what the working system log looks like - with no reconnections at all:

For br-ex, 16 flows were added, as opposed to 1 flow for Rocky.

I noticed that our Queens install only has prov001 and br-ex, so it was created prior to us adding the second provider network - so I'm going to re-deploy Rocky with only a single provider network (prov001) and see if this problem still occurs.  Maybe there is an issue with more than two provider networks somewhere?

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-15:

#6

Rocky has been re-deployed with only br-ex and prov001 external bridges and the same problem is occurring.

The ovs-vswitchd.log continues to show reconnections over and over (in the openvswitch_vswitchd container):

tail -f /var/log/kolla/openvswitch/ovs-vswitchd.log

So it definitely appears there is something broken in Rocky.

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-15:

#7

Just a quick note that kernel version doesn't appear to matter. My latest deploy was with no yum updates, so the kernel was:
Kernel: Linux 3.10.0-862.el7.x86_64

and the problem still occurs.

I am removing some things, including any Octavia configurations, as well as the provider network associated with that, and redeploying, just to see if there is something in our configuration causing this. A redeploy is bare metal and up, so we're re-imaging physical servers with CentOS 7.5 (Minimal 1804 ISO).

I'll report back...

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-16:

#8

Ok, after simplifying the deployment (as I mentioned in the previous message), DVR is working as expected with floating IPs.

One interesting problem I had was an issue with iptables not allowing return traffic that was to be DNAT'd from the public IP back to the private IP (in the qrouter namespace). I compared everything between our Queens configuration that works to the Rocky configuration that doesn't, and there was no difference between the iptables rules.

So, I did a yum update (remember, this was running CentOS 7.5 Minimal 1804 with no updates), and all of a sudden, iptables works now! So, I was chasing this issue for a while.

Now, onto the original issue, which, I suspect is related to provider networks, or our Octavia configuration...

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-16:

#9

I believe I have narrowed it down to adding a provider network in the globals.yml file such as:

neutron_bridge_name: "br-ex,br-prov001"
neutron_external_interface: "eth0,eth1"

When these are set to this, DVR Floating IPs work:

neutron_bridge_name: "br-ex"
neutron_external_interface: "eth0"

I'm going to do some further testing to see if I can find why, but wanted to point out the issue in case anyone knows of the problem and possibly why it is happening.

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-17:

#10

I have done additional testing and the provider network configuration is the cause:

This works (floating IPs respond on the ext-net/physnet1 physical interface and SNAT traffic works as expected):
neutron_bridge_name: "br-ex"
neutron_external_interface: "eth0"

This does not work (floating IPs do NOT respond on the ext-net/physnet1 physical interface and SNAT traffic does NOT work as expected):
neutron_bridge_name: "br-ex,br-prov001"
neutron_external_interface: "eth0,eth1"

This prevents Octavia from working since it requires a provider network for communication between amphorae VMs and the Octavia worker service.

Note that there are no issues in a Queens deployment with Kolla Ansible 6.1.0, this problem only occurs in the latest Rocky release candidate of Kolla-Ansible (7.0.0.0rc1).

Eric

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-17:

#11

I think I may have found the issue.

Note that I wasn't sure what other project/sub-project to add this ticket to, so I will add it to the Neutron project.

We use team interfaces with VLANs, which uses the same MAC address for each VLAN (see the bottom of this note for an example of 3 VLANs on a team interface from "ip a").

It appears that the OVS DVR Neutron Agent assumes that MAC addresses will be unique across interfaces - at least what I can tell here (Note that I am not a Python programmer):
https://github.com/openstack/neutron/blob/317cdbf40850964080a0a30d6212a3b536df1caa/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py

Specifically look at the "registered_dvr_macs" variable, which is used in a few places, including the _add_dvr_mac method.

So, instead of using interface names or IDs, it looks like a MAC address is assumed to be unique, but is not unique at all, especially in VLAN configurations (obviously very common).

Anyone know if my theory is right?

Btw, the fact that it worked on Queens was likely due to the fact that we did not use team interfaces on its installation since it was installed in VMs with virtual NICs.

Eric

10: team0.1000@team0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
    link/ether ec:0d:9a:d9:09:16 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ee0d:9aff:fed9:916/64 scope link
       valid_lft forever preferred_lft forever
11: team0.1001@team0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ec:0d:9a:d9:09:16 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ee0d:9aff:fed9:916/64 scope link
       valid_lft forever preferred_lft forever
12: team0.1002@team0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ec:0d:9a:d9:09:16 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ee0d:9aff:fed9:916/64 scope link
       valid_lft forever preferred_lft forever

I think I may have found the issue.

Note that I wasn't sure what other project/sub-project to add this ticket to, so I will add it to the Neutron project.

We use team interfaces with VLANs, which uses the same MAC address for each VLAN (see the bottom of this note for an example of 3 VLANs on a team interface from "ip a").

It appears that the OVS DVR Neutron Agent assumes that MAC addresses will be unique across interfaces - at least what I can tell here (Note that I am not a Python programmer):
https://github.com/openstack/neutron/blob/317cdbf40850964080a0a30d6212a3b536df1caa/neutron/plugins/ml2/drivers/openvswitch/agent/ovs_dvr_neutron_agent.py

Specifically look at the "registered_dvr_macs" variable, which is used in a few places, including the _add_dvr_mac method.

So, instead of using interface names or IDs, it looks like a MAC address is assumed to be unique, but is not unique at all, especially in VLAN configurations (obviously very common).

Anyone know if my theory is right?

Btw, the fact that it worked on Queens was likely due to the fact that we did not use team interfaces on its installation since it was installed in VMs with virtual NICs.

Eric

10: team0.1000@team0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master ovs-system state UP group default qlen 1000
    link/ether ec:0d:9a:d9:09:16 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ee0d:9aff:fed9:916/64 scope link
       valid_lft forever preferred_lft forever
11: team0.1001@team0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ec:0d:9a:d9:09:16 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ee0d:9aff:fed9:916/64 scope link
       valid_lft forever preferred_lft forever
12: team0.1002@team0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ec:0d:9a:d9:09:16 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::ee0d:9aff:fed9:916/64 scope link
       valid_lft forever preferred_lft forever

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-17:

#12

It appears my theory was correct. I wrote some code to change the ifcfg files so the MAC addresses of all VLAN interfaces was unique, and after a re-deploy, floating IPs work as expected, even with a provider network defined.

Eric

Miguel Lavalle (minsel) on 2018-09-17

tags:	added: l3-dvr-backlog
Changed in neutron:
importance:	Undecided → High

Boden R (boden) on 2018-09-17

Changed in neutron:
status:	New → Triaged

Revision history for this message

Swaminathan Vasudevan (swaminathan-vasudevan) wrote on 2018-09-27:

#13

The unique MAC address is only used for the DVR host MAC.
We always have the same MAC for the router ports that are distributed and not sure why this should be an issue..
So I am assuming that it is with respect to your external bridge configuration that you are having this problem.
Can you confirm.

Revision history for this message

Eric Miller (erickmiller) wrote on 2018-09-27:

#14

It is only an issue with respect to the external bridge configuration, which includes ext-net for floating IPs. If there are multiple bridges defined, such as ext-net, br-prov001, br-prov001, etc., all associated with VLANs on the same physical interface, or a team/bond interface, the same MAC address is given to all of these physical interfaces, and this appears to be where the problem lies. Once the MAC address was forcefully changed to unique MAC addresses per VLAN on a physical interface, then external bridges have worked perfectly.

Revision history for this message

Rodolfo Alonso (rodolfo-alonso-hernandez) wrote on 2019-02-14:

#15

Hello:

I think this error is caused by the error described in [1]. If several physical bridges are attached to interfaces with the same MAC, the datapath_id will be the same. When using the native interface, all datapaths must be different.

Having OVS bridges with the same datapath and using native interface will provoke the error described in [2].

This problem was solved in https://review.openstack.org/#/q/379a9faf6206039903555ce7e3fc4221e5f06a7a

[1] https://bugs.launchpad.net/neutron/+bug/1697243
[2] https://bugs.launchpad.net/neutron/+bug/1792493/comments/6

Revision history for this message

LIU Yulong (dragon889) wrote on 2019-02-18:

#16

According to Rodolfo's explanation, so we can close this bug.

Changed in neutron:
status:	Triaged → Fix Released

Michal Nasiadka (mnasiadka) on 2023-03-28

Changed in kolla-ansible:
status:	New → Invalid

Affects		Status	Importance	Assigned to	Milestone
	kolla-ansible	Invalid	Undecided	Unassigned
	neutron	Fix Released	High	Unassigned

neutron

DVR and floating IPs broken in latest 7.0.0.0rc1?

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches