Floating IPs not removed on rfp interface in qrouter

Bug #1675187 reported by Arjun Baindur
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Brian Haley

Bug Description

Recently upgraded from liberty to newton. We had a lot of Active Floating IPs configured at the time.

DVR setup, with 2 VMs on same network, and both with Floating IPs, FIPA and FIPB.

 - ssh into both FIPA and FIPB works from external source like laptop
 - SSH from one VM into another works via internal fixed IP ONLY(for example, ssh into floating of VMA, then ssh into fixed IP of VM B, or vice versa)
 - ping from one VM to floating IP of other *appears* to work. But even after deleting VM, pings continued. I suspect the rfp interface is responding to ICMP since it has FIP address configured

Noticed that qrouter contained several /32 FIP addresses configured on rfp interface, but new Floating IPs we created were not being added as secondary IP addresses.

Fix for bug https://bugs.launchpad.net/neutron/+bug/1462154 is responsible, I suspect. Note: this was backported into mitaka, so same issue should exist for liberty->mitaka upgrade.

It removed the logic to both add and remove Floating IP on rfp - now the add/remove_floating_ip logic only handles the routes in the fip namespace, source ip rules in qrouter, etc.... The new agent is basically unaware of any floating IPs configured on the rfp interface.

So it seems any pre-existing FIPs added as secondary IP address in qrouter remain as zombies. Attaching/reattahing, or deleting and re-creating does not remove these IPs as the logic to remove was also gone! Please see: https://review.openstack.org/#/c/289172/13/neutron/agent/l3/dvr_local_router.py

As you can see below, lot of FIPs on 10.4.0.0/16 external network. Some correspond to VMs/floating IPs that were deleted. Others are still active but experiencing issue described above.

Manually removing the IP immediately fixed our issue (ssh to floating IP from VM to VM worked immediately)

root@barney:~# ip netns exec qrouter-37176403-cfb0-478d-b51c-971d89597cf5 ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: rfp-37176403-c: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 76:2f:be:73:9b:fc brd ff:ff:ff:ff:ff:ff
    inet 169.254.31.142/31 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.253.15/32 brd 10.4.253.15 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.253.118/32 brd 10.4.253.118 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.252.103/32 brd 10.4.252.103 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.252.105/32 brd 10.4.252.105 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.254.4/32 brd 10.4.254.4 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.254.52/32 brd 10.4.254.52 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.253.41/32 brd 10.4.253.41 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.254.228/32 brd 10.4.254.228 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.254.229/32 brd 10.4.254.229 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.253.212/32 brd 10.4.253.212 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.252.118/32 brd 10.4.252.118 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet 10.4.252.48/32 brd 10.4.252.48 scope global rfp-37176403-c
       valid_lft forever preferred_lft forever
    inet6 fe80::742f:beff:fe73:9bfc/64 scope link
       valid_lft forever preferred_lft forever
11: qr-43981e59-30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether fa:16:3e:42:a2:ec brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.1/16 brd 172.16.255.255 scope global qr-43981e59-30
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe42:a2ec/64 scope link
       valid_lft forever preferred_lft forever
40083: qr-6b09eb9d-40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether fa:16:3e:21:f8:fd brd ff:ff:ff:ff:ff:ff
    inet 192.168.42.100/24 brd 192.168.42.255 scope global qr-6b09eb9d-40
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fe21:f8fd/64 scope link
       valid_lft forever preferred_lft forever
46229: qr-39850654-40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
    link/ether fa:16:3e:f5:06:20 brd ff:ff:ff:ff:ff:ff
    inet 10.127.0.1/16 brd 10.127.255.255 scope global qr-39850654-40
       valid_lft forever preferred_lft forever
    inet6 fe80::f816:3eff:fef5:620/64 scope link
       valid_lft forever preferred_lft forever

Revision history for this message
Arjun Baindur (abaindur) wrote :

With the Floating IPs remaining on rfp- interface, definitely an issue. A VM pinging another floating IP - the reply is NOT from the VM, but the qrouter namespace that is actually responding locally

VMA with FIP 10.4.253.32 and fixed IP 172.16.198.250 tries to ping VMB (10.4.254.43) which I manually removed from rfp device. I see ping being NATT'd and going out rfp, as well as 2 ICMP replies:

root@barney:~# ip netns exec qrouter-37176403-cfb0-478d-b51c-971d89597cf5 tcpdump -l -evvvnn -i any host 10.4.254.43

14:50:38.404521 In fa:16:3e:82:08:39 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 64, id 31859, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.198.250 > 10.4.254.43: ICMP echo request, id 26843, seq 11, length 64
14:50:38.404573 Out 76:2f:be:73:9b:fc ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 63, id 31859, offset 0, flags [DF], proto ICMP (1), length 84)
    10.4.254.32 > 10.4.254.43: ICMP echo request, id 26843, seq 11, length 64
14:50:38.404590 In b2:63:de:da:b5:48 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 62, id 31859, offset 0, flags [DF], proto ICMP (1), length 84)
    10.4.254.32 > 10.4.254.43: ICMP echo request, id 26843, seq 11, length 64
14:50:38.404904 Out 76:2f:be:73:9b:fc ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 63, id 16514, offset 0, flags [none], proto ICMP (1), length 84)
    10.4.254.43 > 10.4.254.32: ICMP echo reply, id 26843, seq 11, length 64
14:50:38.404930 In b2:63:de:da:b5:48 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 62, id 16514, offset 0, flags [none], proto ICMP (1), length 84)
    10.4.254.43 > 10.4.254.32: ICMP echo reply, id 26843, seq 11, length 64
14:50:38.404939 Out fa:16:3e:42:a2:ec ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 61, id 16514, offset 0, flags [none], proto ICMP (1), length 84)
    10.4.254.43 > 172.16.198.250: ICMP echo reply, id 26843, seq 11, length 64

When I ping VMC, 10.4.153.41, which still has IP on rfp, looks like qrouter is generating ICMP reply internally. This explains why ping appears to work but ssh does not:

root@barney:~# ip netns exec qrouter-37176403-cfb0-478d-b51c-971d89597cf5 tcpdump -l -evvvnn -i any host 10.4.253.41

14:48:16.131074 In fa:16:3e:82:08:39 ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 64, id 40544, offset 0, flags [DF], proto ICMP (1), length 84)
    172.16.198.250 > 10.4.253.41: ICMP echo request, id 24022, seq 10, length 64
14:48:16.131134 Out fa:16:3e:42:a2:ec ethertype IPv4 (0x0800), length 100: (tos 0x0, ttl 64, id 5429, offset 0, flags [none], proto ICMP (1), length 84)
    10.4.253.41 > 172.16.198.250: ICMP echo reply, id 24022, seq 10, length 64

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Not sure if it's DVR only issue (I saw some connectivity issues via FIP from its own instance in fullstack environment lately), so added general l3 tag too.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
tags: added: l3-dvr-backlog
tags: added: l3-ipam-dhcp
tags: added: needs-attention
Revision history for this message
Brian Haley (brian-haley) wrote :

This looks specific to DVR since the legacy code is the only place where we'd actually add the IP to a device, DVR just adds rules and routes now.

I'll have to trace the code, since on the restart of the l3-agent we should be going through code like add_floating_ip() where it could make sure to remove any remnants, or there might be a better place somewhere called only once at startup. Xagent - I remember on IRC you saying you were playing with some patches, feel free to share them if you found something that works.

Changed in neutron:
assignee: nobody → Brian Haley (brian-haley)
Revision history for this message
Arjun Baindur (abaindur) wrote :

Yeah, I added following to scan_fip_ports() function in dvr_fip_ns.py:

            existing_cidrs = [addr['cidr'] for addr in device.addr.list()]
            fip_cidrs = [c for c in existing_cidrs if
                         common_utils.is_cidr_host(c)]
            for fip_cidr in fip_cidrs:
                device.delete_addr_and_conntrack_state(fip_cidr)

That is inside the if device.exists() block. Not sure if that's really the right place to add it - this technically needs to be done just once.

On the other hand, looks like this function returns whenever it has a fip_count > 0 and only gets invoked on startup/when a router is created/updated first time with a gateway port? So maybe this is the best place to leave it in

tags: removed: needs-attention
tags: added: newton-backport-potential ocata-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/451859

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/451859
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f84ce9a57ceaf6f931ca51e028bac64c93835d91
Submitter: Jenkins
Branch: master

commit f84ce9a57ceaf6f931ca51e028bac64c93835d91
Author: Brian Haley <email address hidden>
Date: Wed Mar 29 16:44:57 2017 -0400

    Remove stale floating IP addresses from rfp devices

    Old versions of the DVR code used to configure floating
    IP addresses directly on the rfp-* devices in the
    qrouter namespace. This was changed in Mitaka to use
    routes, but these stale floating IP addresses were
    never cleaned-up when the l3-agent was restarted.

    Add a check for these addresses and remove them at
    startup if they exist. This can be removed in a cycle
    since once they are cleaned they will not be added
    back by the agent.

    Change-Id: Id512d213cd7ee11da913a4e4b0da20c3ad5420b0
    Closes-bug: #1675187

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/453663

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/453664

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b1

This issue was fixed in the openstack/neutron 11.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/453663
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=0d1387c418c9540e28d977ff7d77f0cb7e8eb8b8
Submitter: Jenkins
Branch: stable/ocata

commit 0d1387c418c9540e28d977ff7d77f0cb7e8eb8b8
Author: Brian Haley <email address hidden>
Date: Wed Mar 29 16:44:57 2017 -0400

    Remove stale floating IP addresses from rfp devices

    Old versions of the DVR code used to configure floating
    IP addresses directly on the rfp-* devices in the
    qrouter namespace. This was changed in Mitaka to use
    routes, but these stale floating IP addresses were
    never cleaned-up when the l3-agent was restarted.

    Add a check for these addresses and remove them at
    startup if they exist. This can be removed in a cycle
    since once they are cleaned they will not be added
    back by the agent.

    Change-Id: Id512d213cd7ee11da913a4e4b0da20c3ad5420b0
    Closes-bug: #1675187
    (cherry picked from commit f84ce9a57ceaf6f931ca51e028bac64c93835d91)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/453664
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a1186b9ddcc111a27a11a61b4e39101f9c80e67d
Submitter: Jenkins
Branch: stable/newton

commit a1186b9ddcc111a27a11a61b4e39101f9c80e67d
Author: Brian Haley <email address hidden>
Date: Wed Mar 29 16:44:57 2017 -0400

    Remove stale floating IP addresses from rfp devices

    Old versions of the DVR code used to configure floating
    IP addresses directly on the rfp-* devices in the
    qrouter namespace. This was changed in Mitaka to use
    routes, but these stale floating IP addresses were
    never cleaned-up when the l3-agent was restarted.

    Add a check for these addresses and remove them at
    startup if they exist. This can be removed in a cycle
    since once they are cleaned they will not be added
    back by the agent.

    Change-Id: Id512d213cd7ee11da913a4e4b0da20c3ad5420b0
    Closes-bug: #1675187
    (cherry picked from commit f84ce9a57ceaf6f931ca51e028bac64c93835d91)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.4.0

This issue was fixed in the openstack/neutron 9.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.2

This issue was fixed in the openstack/neutron 10.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.