External connectivity broken because of stale FIP rule

Bug #1859887 reported by Mithil Arun
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
New
Undecided
Unassigned

Bug Description

Seen a few occurrences of this issue where I have a VM that does not have a FIP attached, but has a port on a tenant network that is attached to an external network via a router. I expect the VM to be able to reach out to the external network, but I see nothing going through.

On the VM:
--snip--
[root@bob-trove-1 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:97:b3:3b brd ff:ff:ff:ff:ff:ff
    inet 172.20.7.16/24 brd 172.20.7.255 scope global dynamic eth0
       valid_lft 68868sec preferred_lft 68868sec
[root@bob-trove-1 ~]# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 172.20.7.1 0.0.0.0 UG 100 0 0 eth0
169.254.169.254 172.20.7.1 255.255.255.255 UGH 100 0 0 eth0
172.20.2.192 0.0.0.0 255.255.255.192 U 100 0 0 eth0
172.20.5.192 0.0.0.0 255.255.255.192 U 100 0 0 eth0
172.20.6.0 0.0.0.0 255.255.255.192 U 100 0 0 eth0
172.20.6.64 0.0.0.0 255.255.255.192 U 100 0 0 eth0
172.20.7.0 0.0.0.0 255.255.255.0 U 100 0 0 eth0
--snip--

From the router namespace:
--snip--
root@kvm02:/# ip netns exec qrouter-ea187315-b0c7-4f2e-98e9-128a923fca4e ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: rfp-ea187315-b@if292: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 4e:54:d8:b1:6a:6d brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 169.254.114.242/31 scope global rfp-ea187315-b
       valid_lft forever preferred_lft forever
    inet6 fe80::4c54:d8ff:feb1:6a6d/64 scope link
       valid_lft forever preferred_lft forever
15636: qr-81061dca-85: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:94:27:37 brd ff:ff:ff:ff:ff:ff
    inet 192.0.3.1/24 brd 192.0.3.255 scope global qr-81061dca-85
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe94:2737/64 scope link
       valid_lft forever preferred_lft forever
15703: qr-41aba180-7f: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:a5:64:9c brd ff:ff:ff:ff:ff:ff
    inet 172.20.7.1/24 brd 172.20.7.255 scope global qr-41aba180-7f
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fea5:649c/64 scope link
       valid_lft forever preferred_lft forever
13957: qr-1408b658-c8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:ac:80:c4 brd ff:ff:ff:ff:ff:ff
    inet 172.20.6.1/26 brd 172.20.6.63 scope global qr-1408b658-c8
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:feac:80c4/64 scope link
       valid_lft forever preferred_lft forever
11146: qr-127e45c0-8d: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:82:03:97 brd ff:ff:ff:ff:ff:ff
    inet 172.20.5.193/26 brd 172.20.5.255 scope global qr-127e45c0-8d
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe82:397/64 scope link
       valid_lft forever preferred_lft forever
11147: qr-3ebb2a27-9a: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:cc:b9:95 brd ff:ff:ff:ff:ff:ff
    inet 172.20.2.193/26 brd 172.20.2.255 scope global qr-3ebb2a27-9a
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fecc:b995/64 scope link
       valid_lft forever preferred_lft forever
13970: qr-35480bae-20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether fa:16:3e:23:89:f3 brd ff:ff:ff:ff:ff:ff
    inet 172.20.6.65/26 brd 172.20.6.127 scope global qr-35480bae-20
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe23:89f3/64 scope link
       valid_lft forever preferred_lft forever
root@kvm02:/# ip netns exec qrouter-ea187315-b0c7-4f2e-98e9-128a923fca4e ip rule
0: from all lookup local
32766: from all lookup main
32767: from all lookup default
36707: from 172.20.7.5 lookup 16
36709: from 172.20.2.248 lookup 16
37304: from 172.20.7.56 lookup 16
46130: from 172.20.7.36 lookup 16
46133: from 172.20.5.223 lookup 16
46134: from 172.20.2.217 lookup 16
46138: from 172.20.2.245 lookup 16
54173: from 172.20.7.16 lookup 16
57482: from 172.20.5.252 lookup 16
62083: from 172.20.7.76 lookup 16
72399: from 172.20.7.80 lookup 16
72454: from 172.20.7.37 lookup 16
2886992577: from 172.20.2.193/26 lookup 2886992577
2886993345: from 172.20.5.193/26 lookup 2886993345
2886993409: from 172.20.6.1/26 lookup 2886993409
2886993473: from 172.20.6.65/26 lookup 2886993473
2886993665: from 172.20.7.1/24 lookup 2886993665
3221226009: from 192.0.2.25/24 lookup 3221226009
3221226241: from 192.0.3.1/24 lookup 3221226241
root@kvm02:/# ip netns exec qrouter-ea187315-b0c7-4f2e-98e9-128a923fca4e ip route show table 16
default via 169.254.114.243 dev rfp-ea187315-b
root@kvm02:/#
--snip--

The VM does not have a FIP attached, but the router namespace has a rule (54173: from 172.20.7.16 lookup 16) that forwards traffic to the FIP namespace.

Attaching a FIP gets the traffic flowing, but removing it puts it back in this state. The only way to recover is to delete this ip rule manually.

Revision history for this message
Brian Haley (brian-haley) wrote :

What version of neutron are you running? Curious if this is fixed in a later release.

tags: added: l3-dvr-backlog
Changed in neutron:
status: New → Incomplete
Revision history for this message
Mithil Arun (arun-mithil) wrote :

I'm running Neutron on Rocky.

Changed in neutron:
status: Incomplete → New
Revision history for this message
Brian Haley (brian-haley) wrote :

Thanks for the info. Are you able to verify this is still an issue on the master branch? Or are you on the very latest of stable/rocky? It just seems like something that has been fixed already.

Revision history for this message
Mithil Arun (arun-mithil) wrote :

Unfortunately, I'm unable to upgrade my cluster to master or latest rocky immediately. I'm currently on stable/rocky off commit #56c070c5a37f06515c9330274ae12d87e7468421.

I walked through the other commits on latest stable/rocky and I see this commit that comes closest, which I am already running:

commit 9749fd270c1f7493fe4daf8b0e8412fcf0412184
Author: LIU Yulong <email address hidden>
Date: Mon Oct 8 14:52:16 2018 +0800

    Prevent create port forwarding to port which has binding fip

    For dvr scenario, if port has a bound floating, and then create
    port forwarding to it, this port forwarding will not work, due to
    the traffic is redirected to dvr rules.

    This patch restricts such API request, if user try to create port
    forwarding to a port, check if it has bound floating IP first.
    This will be run for all type of routers, since neutron should
    not let user to waste public IP address on a port which already
    has a floating IP, it can take care all the procotol port
    numbers.

    Conflicts:
        neutron/services/portforwarding/pf_plugin.py

    Closes-Bug: #1799137
    Change-Id: I4ba4b023d79185f8d478d60ce16417d3501bf785
    (cherry picked from commit b8d2ab8543a27b03bde534ef994027d9b44556c4)

Can you point me to a specific review/commit that you think fixes this? While upgrading might involve a lot of paperwork, I am able to apply a patch to see if that fixes things.

Revision history for this message
Brian Haley (brian-haley) wrote :

I can't find a particular commit, it just seemed like a race condition in the agent that we fixed.

Just so it's clear, in order to reproduce this do you have to:

1) associate a floating IP
2) dis-associate that floating IP

In other words, the rule was created in 1) but not cleaned-up in 2)

If this is the case, can you attach relevant parts of the l3-agent.log for both operations for a new port? Especially look for any tracebacks. This would need to be done with debug enabled so all the messages are printed.

Thanks

Revision history for this message
Mithil Arun (arun-mithil) wrote :

No, we're not doing anything with floating IPs at all, as far as this VM is concerned. The VM is created with only a fixed IP from a tenant network that's attached to an external network via a router.

These are the only logs in l3-agent around the time the VM was created:
--snip--
2020-01-02 20:42:40.506 31081 INFO neutron.agent.l3.agent [req-ecc73975-7759-4f95-a973-4c13cf24526b 5d6706b2a83e4bd99aa785eb1a090e2f b5c8283145054ef38c79253112b21e09 - - -] Got routers updated notification :[u'fd1da85e-773f-4fab-926f-8560caa4578d']
2020-01-02 20:42:40.507 31081 INFO neutron.agent.l3.agent [-] Starting router update for fd1da85e-773f-4fab-926f-8560caa4578d, action 3, priority 1
--snip--

I'm wondering if this is possible if the fixed IP attached to the VM was previously associated with an older VM, and the floating IP was attached to that? The ip rule could be leftover from the previous association. Just a wild idea, given the lack of errors in the logs.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.