OVS agents periodically fail to start in fullstack

Bug #1506503 reported by John Schwarz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
John Schwarz

Bug Description

Changeset [1] introduced a validation that the local_ip specified for tunneling is actually used by one of the devices on the machine running an OVS agent.

In Fullstack, multiple tests may run concurrently, which can cause a race condition: suppose an ovs agent starts running as part of test A. It retrieves the list of all devices on the host and starts a sequential loop on them. In the mean time, some *other* fullstack test (test B) completes and deletes the devices it created. The agent has that deleted device in the list and when it will reach the device it will find out it does not exist and crash.

[1]: https://review.openstack.org/#/c/154043/

John Schwarz (jschwarz)
Changed in neutron:
assignee: nobody → John Schwarz (jschwarz)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/235399

Changed in neutron:
status: New → In Progress
Assaf Muller (amuller)
Changed in neutron:
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/235399
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f3dece785e591ea68ed5bbecbbfa4ac3a29fbc8f
Submitter: Jenkins
Branch: master

commit f3dece785e591ea68ed5bbecbbfa4ac3a29fbc8f
Author: John Schwarz <email address hidden>
Date: Thu Oct 15 17:34:01 2015 +0300

    get_device_by_ip: don't fail if device was deleted

    The function gets a list of all the devices that exists on the machine,
    and then iterates on them one at a time in order to find the correct
    device which holds the ip specified. However, if one of the devices was
    in the mean time deleted, the code will raise an Exception. In the ovs
    agent's case, this will cause it to not run at all (requiring a
    restart).

    Also, changes to a few tests of LinuxBridge were made because
    linuxbridge doesn't check that cfg.CONF.VXLAN.local_ip is not empty
    before using it (a bug, surely). Since it's out of scope of this patch
    to fix this, workarounds were implemented to make sure the tests ignore
    the option instead.

    Closes-Bug: #1506503
    Change-Id: Iad285d7c763b0e8e8f877c6892aadb0043e9a186

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/251753

Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/251753
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=f8d1bf22cc13397a0742fafc5e4d240208bf3166
Submitter: Jenkins
Branch: stable/liberty

commit f8d1bf22cc13397a0742fafc5e4d240208bf3166
Author: John Schwarz <email address hidden>
Date: Thu Oct 15 17:34:01 2015 +0300

    get_device_by_ip: don't fail if device was deleted

    The function gets a list of all the devices that exists on the machine,
    and then iterates on them one at a time in order to find the correct
    device which holds the ip specified. However, if one of the devices was
    in the mean time deleted, the code will raise an Exception. In the ovs
    agent's case, this will cause it to not run at all (requiring a
    restart).

    Also, changes to a few tests of LinuxBridge were made because
    linuxbridge doesn't check that cfg.CONF.VXLAN.local_ip is not empty
    before using it (a bug, surely). Since it's out of scope of this patch
    to fix this, workarounds were implemented to make sure the tests ignore
    the option instead.

    Closes-Bug: #1506503
    Change-Id: Iad285d7c763b0e8e8f877c6892aadb0043e9a186
    (cherry picked from commit f3dece785e591ea68ed5bbecbbfa4ac3a29fbc8f)
    Conflicts:
            neutron/tests/functional/cmd/test_linuxbridge_cleanup.py

tags: added: in-stable-liberty
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.0.2

This issue was fixed in the openstack/neutron 7.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.