neutron dvr should lower proxy_delay when using proxy_arp

Bug #1920975 reported by Edward Hope-Morley
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Neutron Open vSwitch Charm
New
Undecided
Unassigned
neutron
Fix Released
Medium
Unassigned

Bug Description

Neutron DVR uses proxy_arp in fip namespaces to respond to arp requests for instance floating ips. In doing so it is susceptible to a random delay up to by default 800ms which is added to the time taken to respond to an arp request that has to be proxied i.e.

# ip netns exec fip-a297543b-9ef9-4bd5-b1ca-e85a726c1726 sysctl net.ipv4.{conf.fg-51f3e07b-2d.proxy_arp,neigh.fg-51f3e07b-2d.proxy_delay}
net.ipv4.conf.fg-51f3e07b-2d.proxy_arp = 1
net.ipv4.neigh.fg-51f3e07b-2d.proxy_delay = 80

The result of this is seen when e.g. you ping a vm fip and the first request takes significantly longer than subsequent requests:

$ ping -c 5 10.5.150.90
PING 10.5.150.90 (10.5.150.90) 56(84) bytes of data.
64 bytes from 10.5.150.90: icmp_seq=1 ttl=60 time=491 ms
64 bytes from 10.5.150.90: icmp_seq=2 ttl=60 time=1.08 ms
64 bytes from 10.5.150.90: icmp_seq=3 ttl=60 time=1.39 ms
64 bytes from 10.5.150.90: icmp_seq=4 ttl=60 time=1.16 ms
64 bytes from 10.5.150.90: icmp_seq=5 ttl=60 time=1.03 ms

--- 10.5.150.90 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4007ms
rtt min/avg/max/mdev = 1.034/99.157/491.134/195.988 ms

To repro again simply delete arp entry for fip from fip ns of source compute host.

By kernel standards this behaviour is by-design when using the default settings but some workloads may be impacted by this initial delay especially e.g. in loaded environments where the arp caches are under strain and hitting gc_thresh limits.

summary: - neutron dvr should lower arp_delay when using arp_proxy
+ neutron dvr should lower proxy_delay when using arp_proxy
summary: - neutron dvr should lower proxy_delay when using arp_proxy
+ neutron dvr should lower proxy_delay when using proxy_arp
Revision history for this message
Edward Hope-Morley (hopem) wrote :

This can alternatively easily be fixed by changing the default vi the charm using the sysctl config option but of course that would not fix existing fip namespaces.

description: updated
Revision history for this message
Edward Hope-Morley (hopem) wrote :
Changed in neutron:
status: New → In Progress
assignee: nobody → Edward Hope-Morley (hopem)
Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Slawek Kaplonski (slaweq) wrote : auto-abandon-script

This bug has had a related patch abandoned and has been automatically un-assigned due to inactivity. Please re-assign yourself if you are continuing work or adjust the state as appropriate if it is no longer valid.

Changed in neutron:
assignee: Edward Hope-Morley (hopem) → nobody
status: In Progress → New
tags: added: timeout-abandon
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by "Slawek Kaplonski <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/782570
Reason: This review is > 4 weeks without comment, and failed Zuul jobs the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/782570
Committed: https://opendev.org/openstack/neutron/commit/d7f68a0ce76ffb9a93dfba167dfffba53189350d
Submitter: "Zuul (22348)"
Branch: master

commit d7f68a0ce76ffb9a93dfba167dfffba53189350d
Author: Edward Hope-Morley <email address hidden>
Date: Tue Mar 23 17:18:48 2021 +0000

    Set proxy_delay to one when using proxy ARP

    Neutron DVR uses proxy ARP in fip namespaces to respond
    to ARP requests for instance floating IPs. In doing so
    it is susceptible to a random delay of up to (by
    default) 800ms which is added to the time taken to
    respond to ARP requests. This causes an initial delay
    to ARP reponses that is entirely avoidable by changing this
    parameter to one, instead of the default, to make it as
    short as possible.

    NOTE: Setting this to zero is actually undefined and will
    cause the kernel to choose a random delay from 0 to
    U32_MAX so is not advised. Gleaned from this comment in
    __get_random_u32_below(), which is eventually called
    from pneigh_enqueue():

    /*
     * This function is technically undefined for ceil == 0, and in fact
     * for the non-underscored constant version in the header, we build bug
     * on that. But for the non-constant case, it's convenient to have that
     * evaluate to being a straight call to get_random_u32(), so that
     * get_random_u32_inclusive() can work over its whole range without
     * undefined behavior.
     */

    Will propose a kernel change to fix this but cannot
    assume it will be in a distro kernel for a while.

    Change-Id: I0dc65b17ef436a97d0fcbd164d124ec59a1b2797
    Closes-Bug: #1920975

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Brian Haley (brian-haley) wrote :

Kernel patch sent, either link works.

https://marc.info/?l=linux-netdev&m=167450035611020&w=2
https://<email address hidden>/T/#u

Revision history for this message
Brian Haley (brian-haley) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 22.0.0.0rc1

This issue was fixed in the openstack/neutron 22.0.0.0rc1 release candidate.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.