Gratuitous ARP for floating IPs not so gratuitous

Bug #1715734 reported by James Denton
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
neutron
Invalid
Medium
James Denton

Bug Description

OpenStack Release: Newton
OS: Ubuntu 16.04 LTS

When working in an environment with multiple application deployments that build up/tear down routers and floating ips, it has been observed that connectivity to new instances using recycled floating IPs may be impacted.

In this environment, the external provider network is connected to a Cisco Nexus 7010 with a default arp cache timeout of 1500 seconds. We have observed that the L3 agent is sending out the following arpings when floating IPs are assigned:

2017-09-07 16:57:17.396 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-A', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.36'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2017-09-07 16:57:19.644 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.29'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89
2017-09-07 16:57:19.913 13048 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-r14.1.0/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'ip', 'netns', 'exec', 'qrouter-3429e5bc-b41b-46fb-9bef-6ec6ccf0d3b6', 'arping', '-U', '-I', 'qg-6582bfec-7d', '-c', '1', '-w', '1.5', '172.29.77.44'] create_process /openstack/venvs/neutron-r14.1.0/lib/python2.7/site-packages/neutron/agent/linux/utils.py:89

Here's the respective packet capture:

18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 tell 172.29.77.39, length 28
18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 tell 172.29.77.39, length 28
18:09:06.366085 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 tell 172.29.77.39, length 28

The source address in all of those ARP requests is 172.29.77.39 - the IP primary address on the qg interface. The ARP entry for the recycled floating IPs on the Nexus is not being refreshed and remains stale. For the gratuitous ARP to be successful, the source IP needs to be changed to the respective floating IP, so that both the source and destination IPs are the same. The following code change was made in ip_lib.py:

FROM:
arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
              # Pass -w to set timeout to ensure exit if interface
              # removed while running
              '-w', 1.5, address]

TO:
arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
              # Pass -w to set timeout to ensure exit if interface
              # removed while running
              '-w', 1.5, '-S', address, address]

With that change in place, the following packet captures reflects the new behavior:

18:10:30.389966 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.36 tell 172.29.77.36, length 28
18:10:30.390068 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.29 tell 172.29.77.29, length 28
18:10:30.390143 fa:16:3e:99:af:5c > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.44 tell 172.29.77.44, length 28

Since making the change, we have not had a failed deployment and all recycled floating IPs appear to be reachable immediately.

Revision history for this message
Michael Johnson (johnsom) wrote :

Instead of setting a source why aren't you using the -U flag? It's missing from that command line.

This:
arping_cmd = ['arping', arg, '-I', iface_name, '-c', 1,
              # Pass -w to set timeout to ensure exit if interface
              # removed while running
              '-w', 1.5, address]

Does not generate a gratuitous ARP.

Revision history for this message
Brian Haley (brian-haley) wrote :

I'm surprised '-S' worked as arping uses '-s' but maybe that was just a typo.

Also, this must be non-DVR as with my DVR setup I do see the source IP same as the destination IP.
I tested w/DVR and didn't notice any difference in the packets.

Do you plan on creating a patch?

Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
James Denton (james-denton) wrote :

@Michael

In this environment, -A and -U look the same:

ip netns exec qrouter-cebf978f-e8c9-4c50-b983-c263d20ab5c6 arping -A -I qg-d8cb3fb7-48 -c 1 -w 1.5 172.29.77.57
ARPING 172.29.77.57
Timeout

--- 172.29.77.57 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)

20:59:55.376150 fa:16:3e:82:94:ef > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.57 tell 172.29.77.22, length 28

root@543232-infra03-neutron-agents-container-6e46fcb9:~# ip netns exec qrouter-cebf978f-e8c9-4c50-b983-c263d20ab5c6 arping -U -I qg-d8cb3fb7-48 -c 1 -w 1.5 172.29.77.57
ARPING 172.29.77.57
Timeout

--- 172.29.77.57 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)

21:00:55.460283 fa:16:3e:82:94:ef > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.57 (ff:ff:ff:ff:ff:ff) tell 172.29.77.22, length 28

Both send a broadcast and are originating from the primary interface IP. It looks like the code supports both -U and -A, and may execute both. I think the issue is that the address on the interface is a /32 address and normally wouldn't be used for locally-originated outbound traffic.

According the man page, -s can be used to specify source MAC, and -S can be used to specify source IP. Using -S we can see traffic sourced from the floating IP:

root@543232-infra03-neutron-agents-container-6e46fcb9:~# ip netns exec qrouter-cebf978f-e8c9-4c50-b983-c263d20ab5c6 arping -U -I qg-d8cb3fb7-48 -c 1 -w 1.5 -S 172.29.77.57 172.29.77.57
ARPING 172.29.77.57
Timeout

--- 172.29.77.57 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)

21:04:01.728458 fa:16:3e:82:94:ef > ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 103, p 0, ethertype ARP, Request who-has 172.29.77.57 (ff:ff:ff:ff:ff:ff) tell 172.29.77.57, length 28

----------

@Brian - Yes, this is non-DVR.

>> I do see the source IP same as the destination IP.

Perhaps floating IPs are implemented differently in fip namespaces vs qrouter namespaces? I don't have an OVS environment handy to find out.

Changed in neutron:
assignee: nobody → James Denton (james-denton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/502054

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
James Denton (james-denton) wrote :
Download full text (4.8 KiB)

Further investigation shows there is some inconsistency between Linux distributions in regards to the version of arping that may be installed. Ubuntu uses arping that supports -s (mac)/-S (ip), while openSUSE and CentOS appear to use the iputils-arping version that uses -s (ip).

What I have found is that the behavior differs between those versions, even with the same arguments. In this openSUSE environment, here is the interface that mimics what a router interface looks like with a floating IP:

3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 08:00:27:23:3c:64 brd ff:ff:ff:ff:ff:ff
    inet 10.254.254.11/24 brd 10.254.254.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet 10.254.254.89/32 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe23:3c64/64 scope link
       valid_lft forever preferred_lft forever

In these examples, I've mimicked the existing arping syntax Neutron uses. The following command with openSUSE generates an unsolicited ARP REPLY:

linux-p625:/home/jdenton # arping -A -I eth1 -c 1 -w 1.5 10.254.254.89
ARPING 10.254.254.89 from 10.254.254.89 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)

09:28:24.842357 08:00:27:23:3c:64 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Reply 10.254.254.89 is-at 08:00:27:23:3c:64, length 28

While the -U flag results in an ARP REQUEST (using the /32 as the source):

linux-p625:/home/jdenton # arping -U -I eth1 -c 1 -w 1.5 10.254.254.89
ARPING 10.254.254.89 from 10.254.254.89 eth1
Sent 1 probes (1 broadcast(s))
Received 0 response(s)

09:31:38.674689 08:00:27:23:3c:64 > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.254.254.89 (ff:ff:ff:ff:ff:ff) tell 10.254.254.89, length 28

In either case, the neighbor table ought to be updated appropriately. The behavior on CentOS was exactly the same.

In Ubuntu 16.04 LTS the behavior is as described in the initial bug report. Both the -A and -U flags result in ARP REQUESTS, and both are sourced from the primary interface IP -- not the floating IP. Therefore, the neighbor cache is not updated properly.

Ubuntu (w/ default arping):

sudo arping -A -I enp0s8 -c 1 -w 1.5 10.254.254.88

>> 09:48:18.034725 08:00:27:06:62:2e > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.254.254.88 tell 10.254.254.10, length 28

sudo arping -U -I enp0s8 -c 1 -w 1.5 10.254.254.88

>> 09:49:08.030762 08:00:27:06:62:2e > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 10.254.254.88 (ff:ff:ff:ff:ff:ff) tell 10.254.254.10, length 28

-----

Remediation:

I was able to install the iputils-arping package on the Ubuntu host, which resulted in expected behavior that is consistent with the Linux distros.

jdenton@ubuntu:~/neutron$ sudo apt install iputils-arping
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnet1
Use 'sudo apt autoremove' to remove it.
The following packages will be REMOVED:
  arping
The following NEW packages will b...

Read more...

Revision history for this message
Brian Haley (brian-haley) wrote :

James - yes, neutron requires iputils-arping, the "other" arping package is known not to work. I think the Debian packaging actually depends on it if installing the neutron-l3-agent package.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by James Denton (<email address hidden>) on branch: master
Review: https://review.openstack.org/502054
Reason: Neutron requires iputils-arping, and this is related to the "other" arping. Closing.

Revision history for this message
James Denton (james-denton) wrote :

Thanks, Brian. I confirmed that the other 'arping' package was being installed over iputils-arping post-deploy by another set of playbooks. The difference in behavior between the two packages is subtle and not enough to cause any outright errors, but will affect users in a negative way as described in the initial report.

Feel free to close this bug, and thanks again for your help.

no longer affects: openstack-ansible
Changed in neutron:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.