Arping code should detect missing interface and return early

Bug #1696893 reported by Brian Haley
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Brian Haley

Bug Description

Since arping is spawned in a thread, if a router is added and quickly removed from a network, the arping calls could generate errors on the second or third loop, for example:

Exit code: 2; Stdin: ; Stdout: ; Stderr: arping: Device qr-1e77796c-2b not available.

This can happen in this scenario:

T(0): internal_network_added()
        port plugged
        arping started in thread

T(1): internal_network_removed()
        port unplugged

T(2): arping fails
T(3): arping fails

An example is in:

http://logs.openstack.org/02/469602/6/check/gate-tempest-dsvm-neutron-linuxbridge-ubuntu-xenial/7a048d9/logs/screen-q-l3.txt.gz#_Jun_09_00_23_55_483118

Just search for qr-1e77796c-2b in the logs before this time.

The arping code should detect this on a failure, log a warning and return early as there is no way to stop the thread once it is spawned.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/472500

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/472500
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=739daaa9555e734b94bef89f4fe1c5159c8fd435
Submitter: Jenkins
Branch: master

commit 739daaa9555e734b94bef89f4fe1c5159c8fd435
Author: Brian Haley <email address hidden>
Date: Thu Jun 8 22:40:19 2017 -0400

    Stop arping when interface gets deleted

    It is possible for an interface to be added to a
    router, have arping get started for it in a thread,
    then immediately remove the interface, causing
    arping errors in the l3-agent log. This concurrent
    deletion should be handled more gracefully by
    just logging a warning on the first detection and
    returning early.

    Change-Id: I615b60561b3b7f8c950d5f412fb4cdf7877b98f7
    Closes-bug: #1696893

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/474353

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/474354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/newton)

Reviewed: https://review.openstack.org/474354
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=7ff7ca21bbeea9028bd8582386d536bee62cf6f0
Submitter: Jenkins
Branch: stable/newton

commit 7ff7ca21bbeea9028bd8582386d536bee62cf6f0
Author: Brian Haley <email address hidden>
Date: Thu Jun 8 22:40:19 2017 -0400

    Stop arping when interface gets deleted

    It is possible for an interface to be added to a
    router, have arping get started for it in a thread,
    then immediately remove the interface, causing
    arping errors in the l3-agent log. This concurrent
    deletion should be handled more gracefully by
    just logging a warning on the first detection and
    returning early.

    Ocata changes: added translation marker to a warning message because
    it's enforced in the branch. (In Pike, we reversed the direction, not
    adding those markers to log messages.)

    Newton changes: ip_lib.send_ip_addr_adv_notif receives a 'config' object
    with send_arp_for_ha option set to the number of ARP queries to issue,
    and not the number itself. Adopted the test case to this.

    Change-Id: I615b60561b3b7f8c950d5f412fb4cdf7877b98f7
    Closes-bug: #1696893
    (cherry picked from commit 739daaa9555e734b94bef89f4fe1c5159c8fd435)
    (cherry picked from commit 46f7925d10ddaa40ca3518f142e1135003051703)

tags: added: in-stable-newton
tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/474353
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=be13101ec4ad521556ec890aafa462244de50f7d
Submitter: Jenkins
Branch: stable/ocata

commit be13101ec4ad521556ec890aafa462244de50f7d
Author: Brian Haley <email address hidden>
Date: Thu Jun 8 22:40:19 2017 -0400

    Stop arping when interface gets deleted

    It is possible for an interface to be added to a
    router, have arping get started for it in a thread,
    then immediately remove the interface, causing
    arping errors in the l3-agent log. This concurrent
    deletion should be handled more gracefully by
    just logging a warning on the first detection and
    returning early.

    Ocata changes: added translation marker to a warning message because
    it's enforced in the branch. (In Pike, we reversed the direction, not
    adding those markers to log messages.)

    Change-Id: I615b60561b3b7f8c950d5f412fb4cdf7877b98f7
    Closes-bug: #1696893
    (cherry picked from commit 739daaa9555e734b94bef89f4fe1c5159c8fd435)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b3

This issue was fixed in the openstack/neutron 11.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.4.1

This issue was fixed in the openstack/neutron 9.4.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.3

This issue was fixed in the openstack/neutron 10.0.3 release.

Revision history for this message
Brian Haley (brian-haley) wrote :
Download full text (4.5 KiB)

Seems we should have detected a missing IP address as well.

http://logs.openstack.org/82/507382/1/gate/gate-tempest-dsvm-neutron-full-ubuntu-xenial/18a8b46/logs/screen-q-l3.txt.gz

arping was started:

neutron-l3-agent[14419]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-b00c483f-cb99-4767-83ba-a9dac434c656', 'arping', '-U', '-I', 'qg-80c462ce-92', '-c', '1', '-w', '1.5', '172.24.5.3'] {{(pid=14419) execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108}}
neutron-l3-agent[14419]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-b00c483f-cb99-4767-83ba-a9dac434c656', 'arping', '-A', '-I', 'qg-80c462ce-92', '-c', '1', '-w', '1.5', '172.24.5.3'] {{(pid=14419) execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108}}

l3-agent removes floating IP address:

neutron-l3-agent[14419]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-b00c483f-cb99-4767-83ba-a9dac434c656', 'ip', '-4', 'addr', 'del', '172.24.5.3/32', 'dev', 'qg-80c462ce-92'] {{(pid=14419) execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108}}
neutron-l3-agent[14419]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-b00c483f-cb99-4767-83ba-a9dac434c656', 'conntrack', '-D', '-d', '172.24.5.3'] {{(pid=14419) execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108}}
neutron-l3-agent[14419]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-b00c483f-cb99-4767-83ba-a9dac434c656', 'conntrack', '-D', '-q', '172.24.5.3'] {{(pid=14419) execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108}}
[...]
neutron-l3-agent[14419]: DEBUG neutron.agent.l3.agent [-] Finished a router update for b00c483f-cb99-4767-83ba-a9dac434c656 {{(pid=14419) _process_router_update /opt/stack/new/neutron/neutron/agent/l3/agent.py:544}}

But since arping is in a thread, it runs again:

neutron-l3-agent[14419]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-b00c483f-cb99-4767-83ba-a9dac434c656', 'arping', '-U', '-I', 'qg-80c462ce-92', '-c', '1', '-w', '1.5', '172.24.5.3'] {{(pid=14419) execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108}}
neutron-l3-agent[14419]: ERROR neutron.agent.linux.utils [-] Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign requested address
neutron-l3-agent[14419]:
neutron-l3-agent[14419]: DEBUG neutron.agent.linux.utils [-] Running command (rootwrap daemon): ['ip', 'netns', 'exec', 'qrouter-b00c483f-cb99-4767-83ba-a9dac434c656', 'ip', '-o', 'link', 'show', 'qg-80c462ce-92'] {{(pid=14419) execute_rootwrap_daemon /opt/stack/new/neutron/neutron/agent/linux/utils.py:108}}
neutron-l3-agent[14419]: ERROR neutron.agent.linux.ip_lib [-] Failed sending gratuitous ARP to 172.24.5.3 on qg-80c462ce-92 in namespace qrouter-b00c483f-cb99-4767-83ba-a9dac434c656: Exit code: 2; Stdin: ; Stdout: ; Stderr: bind: Cannot assign ...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/507914

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/507914
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=80085abcd66c8625f830b082bfc6c31b125ddbb9
Submitter: Zuul
Branch: master

commit 80085abcd66c8625f830b082bfc6c31b125ddbb9
Author: Brian Haley <email address hidden>
Date: Wed Sep 27 11:57:31 2017 -0400

    Stop arping when IP address gets deleted

    A recent change decreased the severity of the arping log
    message when the interface was found to have been deleted,
    but it can also be the case that the interface exists, but
    the IP address was deleted concurrently. Downgrade log
    message to info since there is nothing to be done about it.

    Change-Id: Ie63d306802d07f7376e87daf5c88672f7d58093d
    Related-bug: #1696893

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.