Booting VM with a Floating IP and pinging it via that takes a long time with errors in L3-Agent logs when using DVR

Bug #1625333 reported by Sai Sindhur Malleni
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Invalid
Undecided
Unassigned

Bug Description

A Rally test to launch a VM, attach a floating IP and ping the VM via the floating IP 40 times in case of legacy routers vs DVR routers was done for comparison. Time taken to create network,subnet, launch VM, attach floating IP etc. are similar in legacy and DVR cases but for the VM to be pingable via the floating ip(after it has been booted with floating ip) it takes a lot more time in some iterations with DVR. The VM is ping ready(after booting and being given a floating ip) in less than a second not counting time to boot or attach floating ip in case of Legacy. However in case of DVR sometimes we see the VM being ping ready in less than 1 second whereas in some cases it takes around 250 seconds. Digging into the L3-agent logs on the computes we see this for the instances that were taking the most time to be pingable via the floating ip

https://paste.fedoraproject.org/431117/74312098/

Specifically we keep seeing errors like this:

2016-09-19 18:58:52.675 23696 DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'fip-790354c7-f286-4fd1-a4a1-ec9749c61fbf', 'arping', '-A', '-I', 'fg-6b5906d0-d9', '-c', '3', '-w', '4.5', '10.16.30.99'] execute_rootwrap_daemon /usr/lib/python2.7/site-packages/neutron/agent/linux/utils.py:99
2016-09-19 18:58:52.696 23696 ERROR neutron.agent.linux.utils [-] Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address

2016-09-19 18:58:52.697 23696 ERROR neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 10.16.30.99 on fg-6b5906d0-d9 in namespace fip-790354c7-f286-4fd1-a4a1-ec9749c61fbf

However, it is worth noting that bumping up the ping timeout to about 300 seconds, all VMS are pingable but it takes 100-200x the time it was taking in legacy case.

This is the rally plugin used(create network, subnet, boot server with fip and ping):
https://github.com/openstack/browbeat/blob/master/rally/rally-plugins/netcreate-boot-ping/netcreate_nova-boot-fip-ping.py

The Rally results for legacy:
https://paste.fedoraproject.org/431120/43123471/

The Rally results for DVR:
https://paste.fedoraproject.org/431119/47431226/

Assaf Muller (amuller)
tags: added: l3-dvr-backlog
Revision history for this message
Assaf Muller (amuller) wrote :

Can you paste the parameters for the Rally scenario so people could reproduce this?

Revision history for this message
Brian Haley (brian-haley) wrote :

The DVR code should have been setting net.ipv4.ip_nonlocal_bind=1 to allow this, either inside the FIP namespace, or the root namespace for later kernels. Is that in the logs? Or did the kernel change such that this doesn't work the same anymore?

Revision history for this message
Assaf Muller (amuller) wrote :

That was also my line of thinking Brian and it may be correct but Sindhur was saying that that TRACE did not happen for every floating IP allocation, only some.

Revision history for this message
Sai Sindhur Malleni (smalleni) wrote :

Assaf and Brian,

We wrote a rally-plugin for this and the scenario and code for it can be found at:
https://github.com/openstack/browbeat/tree/master/rally/rally-plugins/netcreate-boot-ping
We run the scenario via browbeat.

If you look at the results here http://8.43.86.1:8088/smalleni/20160919-172902-browbeat-netcreate-boot-ping-10-iteration-0.html#/BrowbeatPlugin.create_network_nova_boot_ping/details, the green spikes which denote waiting for ping are only seen during some iterations of rally run and we have been able to correlate these iterations with the errors in the l3-agent logs.

Revision history for this message
Terry Wilson (otherwiseguy) wrote :

There is currently a bug in RHEL/Centos where the net.ipv4.ip_nonlocal_bind=1 setting doesn't work in a namespace. The feature was backported, but there was a patch that was missed.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Terry, do you have a link to the patch? Anyway, closing the bug as invalid because it does not affect neutron per se.

Changed in neutron:
status: New → Invalid
Revision history for this message
Assaf Muller (amuller) wrote :

Are we convinced that the nonlocal_bind issue is the root cause? I recall Sai said that the error did not appear in the L3 agent every time, only some of the time.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.