Fullstack native tests sometimes fail with an OVS agent failing to start with 'Address already in use' error

Bug #1551288 reported by Assaf Muller
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Assaf Muller

Bug Description

Example failure:
test_connectivity(VLANs,Native) fails with this error:

http://paste.openstack.org/show/488585/

wait_until_env_is_up is timing out, which typically means that the expected number of agents failed to start. Indeed in this particular example I saw this line being output repeatedly in neutron-server.log:

[29/Feb/2016 04:16:31] "GET /v2.0/agents.json HTTP/1.1" 200 1870 0.005458

Fullstack calls GET on agents to determine if the expected amount of agents were started and are successfully reporting back to neutron-server.

We then see that one of the three OVS agents crashed with this TRACE:
http://paste.openstack.org/show/488586/

This happens only with the native tests using the Ryu library.

Assaf Muller (amuller)
tags: added: fullstack
Revision history for this message
Assaf Muller (amuller) wrote :
Revision history for this message
Dariusz Smigiel (smigiel-dariusz) wrote :

build_name:"gate-neutron-dsvm-fullstack" AND message:"eventlet.timeout.Timeout" AND build_status:"FAILURE" (137) count per 1h | (137 hits)

Changed in neutron:
status: New → Confirmed
Revision history for this message
Dariusz Smigiel (smigiel-dariusz) wrote :

First timestamp, where this problem occurs is found at: 2016-02-20T14:19:57.123+00:00
http://logs.openstack.org/74/282874/1/check/gate-neutron-dsvm-fullstack/0afd424/console.html and it's connected with this patchset: https://review.openstack.org/#/c/282874/

Revision history for this message
Dariusz Smigiel (smigiel-dariusz) wrote :

Failure ratio for this issue is very low. There are over 1.3 mln successes in last 10 days.

Changed in neutron:
importance: Undecided → High
Revision history for this message
YAMAMOTO Takashi (yamamoto) wrote :
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

Hi Yamamoto-san,

your analysis might be true, but by the time the l2-agent is started, the neutron-server's child in question seems to have exited.
I'm slightly confused.

Also, it's a bit unfortunate that 2 processes picked the same port number from ~64k choices. Seeding the RNG with process ID might help. ;)

Changed in neutron:
assignee: nobody → Dariusz Smigiel (smigiel-dariusz)
Revision history for this message
Dariusz Smigiel (smigiel-dariusz) wrote :

This problem occurs, when two different agents are using the same bridge id.
One of agents deletes bridge, second tries and throws an error.
http://paste.openstack.org/show/490264/
http://paste.openstack.org/show/490265/

Revision history for this message
Assaf Muller (amuller) wrote :

@Darek, the error you pasted is with the linux bridge tests, this about is about the OVS agent.

Revision history for this message
Dariusz Smigiel (smigiel-dariusz) wrote :

@Assaf, true. I probably focused too much on error, and didn't realize that's about OVS not LinuxBridge :/
I'll look on this again.

Changed in neutron:
milestone: none → mitaka-rc1
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The sooner we make this job voting the better.

Changed in neutron:
milestone: mitaka-rc1 → newton-1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/292392

Changed in neutron:
assignee: Darek Smigiel (smigiel-dariusz) → Ihar Hrachyshka (ihar-hrachyshka)
status: Confirmed → In Progress
Changed in neutron:
milestone: newton-1 → mitaka-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/292392
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb567e9b32bf58cb5f74149f1f5cb9cb656e565e
Submitter: Jenkins
Branch: master

commit bb567e9b32bf58cb5f74149f1f5cb9cb656e565e
Author: Ihar Hrachyshka <email address hidden>
Date: Mon Mar 14 14:35:31 2016 +0100

    Reset RNG seed with current time and pid for each test started

    This will hopefully fix fullstack failures where different process
    fixtures running in parallel test processes and relying on the same
    random.choice() generator seeded by the same initial value could pick up
    the same value as a service free port, and spawn their respective
    resources using the same port.

    Which made one of those unlucky services to fail.

    Change-Id: I13cfa9392fd138c5e1b1b7d397b9ea91b2a47ed2
    Closes-Bug: #1551288

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0rc1

This issue was fixed in the openstack/neutron 8.0.0.0rc1 release candidate.

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

I tried to find RNG problem with this bug a while ago with no success.
What I saw instead was "Address already in use" errors with no offending simultaneous listen()s (judging from log files).
I suspected conflicts with ephemeral ports but didn't investigate further.

How many get_free_namespace_port calls happen in a test run? Apparent port conflict may be happening somewhat often. (see [1])

[1] https://en.wikipedia.org/wiki/Birthday_problem

Revision history for this message
Assaf Muller (amuller) wrote :

Still seeing instances of this bug. I have a deterministic solution coming up.

Changed in neutron:
status: Fix Released → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/298056

Changed in neutron:
assignee: Ihar Hrachyshka (ihar-hrachyshka) → Assaf Muller (amuller)
status: Confirmed → In Progress
Changed in neutron:
assignee: Assaf Muller (amuller) → Cedric Brandily (cbrandily)
Changed in neutron:
assignee: Cedric Brandily (cbrandily) → Assaf Muller (amuller)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/298578

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/298056
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=03999961ac620249950d8bca628719e9c14c4382
Submitter: Jenkins
Branch: master

commit 03999961ac620249950d8bca628719e9c14c4382
Author: Assaf Muller <email address hidden>
Date: Thu Mar 24 22:14:07 2016 -0400

    Add fullstack cross-process port/ip address fixtures

    We've had a series of bugs with resources that need
    to be unique on the system across test runner
    processes. Ports are used by neutron-server and the
    OVS agent when run in native openflow mode. The function
    that generates ports looks up random unused ports and
    starts the service. However, it is raceful: By the time the
    port is found to be unused and the service is started,
    another test runner can pick the same random port.
    With close to 65536 ports to choose from, the chance
    for collision is low, but given enough test runs, it's
    happened a non-trivial amount of times, and given that
    a voting job needs a very low false-negative rate, we
    need a more robust solution. The same applies to IP
    addresses that are used by the OVS agent in tunneling
    mode, and for the LB agent in all modes. With IP addresses,
    we don't check if the IP address is used, we simply
    pick a random address from a large pool, and again
    we've seen a non-trivial amount of test failures.

    The bugs referenced below had simple, short term solutions
    applied but the bugs remain remain. This patch is a correct,
    long term solution that doesn't rely on chance.

    This patch adds a resource allocator that uses the disk
    to persist allocations. Access to the disk is guarded
    via a file lock. IP address, networks and ports fixtures
    use an allocator internally.

    Closes-Bug: #1551288
    Closes-Bug: #1561248
    Closes-Bug: #1560277
    Change-Id: I46c0ca138b806759128462f8d44a5fab96a106d3

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

http://logstash.openstack.org/#dashboard/file/logstash.json?query=build_name%3A%5C%22gate-neutron-dsvm-fullstack%5C%22%20AND%20message%3A%5C%22self.wait_until_env_is_up()%5C%22%20AND%20build_status%3A%5C%22FAILURE%5C%22

build_name:"gate-neutron-dsvm-fullstack" AND message:"self.wait_until_env_is_up()" AND build_status:"FAILURE" (14) build_name:"gate-neutron-dsvm-fullstack" AND message:"Finished: SUCCESS" AND build_status:"SUCCESS" (169) count per 1h | (183 hits)

ovs-native port listen failures rate to successful runs is about 1:12 in these 7 days.

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/319807

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/323800

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :

FYI, The following query lists neutron server start up failures. These failures happen almost as often as ovs agent start up failures.
Let's see if https://review.openstack.org/323800 works.

http://logstash.openstack.org/#/dashboard/file/logstash.json?query=build_name:\%22gate-neutron-dsvm-fullstack\%22%20AND%20message:\%22self.wait_until_env_is_up%28%29\%22%20AND%20build_status:\%22FAILURE\%22

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/323800
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=9319b1a8366e00ec2ea9a763b5c3558649758f51
Submitter: Jenkins
Branch: master

commit 9319b1a8366e00ec2ea9a763b5c3558649758f51
Author: IWAMOTO Toshihiro <email address hidden>
Date: Wed Jun 1 20:17:34 2016 +0900

    Fix get_free_namespace_port to actually avoid used ports

    ss output is a string. It needs to be converted to int before a set
    operation. The function seems to have been broken from the beginning
    (commit e3fa0112).

    Change-Id: I0d10360ae8807b3688d36912b42fa2a140c45e04
    Closes-Bug: #1551288

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b1

This issue was fixed in the openstack/neutron 9.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/mitaka)

Fix proposed to branch: stable/mitaka
Review: https://review.openstack.org/325767

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/325770

Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/mitaka)

Reviewed: https://review.openstack.org/325767
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=adc7418c3eb253ca4bb9731b97ab69b4e14550ad
Submitter: Jenkins
Branch: stable/mitaka

commit adc7418c3eb253ca4bb9731b97ab69b4e14550ad
Author: IWAMOTO Toshihiro <email address hidden>
Date: Wed Jun 1 20:17:34 2016 +0900

    Fix get_free_namespace_port to actually avoid used ports

    ss output is a string. It needs to be converted to int before a set
    operation. The function seems to have been broken from the beginning
    (commit e3fa0112).

    Change-Id: I0d10360ae8807b3688d36912b42fa2a140c45e04
    Closes-Bug: #1551288
    (cherry picked from commit 9319b1a8366e00ec2ea9a763b5c3558649758f51)

tags: added: in-stable-mitaka
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/325770
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bbdd53a84f1ebafb9dbf57e79cafba4767dc3ebc
Submitter: Jenkins
Branch: stable/liberty

commit bbdd53a84f1ebafb9dbf57e79cafba4767dc3ebc
Author: IWAMOTO Toshihiro <email address hidden>
Date: Wed Jun 1 20:17:34 2016 +0900

    Fix get_free_namespace_port to actually avoid used ports

    ss output is a string. It needs to be converted to int before a set
    operation. The function seems to have been broken from the beginning
    (commit e3fa0112).

    Change-Id: I0d10360ae8807b3688d36912b42fa2a140c45e04
    Closes-Bug: #1551288
    (cherry picked from commit 9319b1a8366e00ec2ea9a763b5c3558649758f51)

tags: added: in-stable-liberty
Revision history for this message
IWAMOTO Toshihiro (iwamoto) wrote :
tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/298578
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 9.0.0.0b2

This issue was fixed in the openstack/neutron 9.0.0.0b2 development milestone.

tags: removed: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/319807
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=91a983f40ad4448235272e24f678a230f32385d7
Submitter: Jenkins
Branch: master

commit 91a983f40ad4448235272e24f678a230f32385d7
Author: IWAMOTO Toshihiro <email address hidden>
Date: Mon May 23 18:35:05 2016 +0900

    Avoid allocating ports from ip_local_port_range

    Ports within ip_local_port_range can be used by the local side
    of connections. Avoid using them as there should be no downside
    from using narrower port range thanks to ExclusiveResource
    allocators.

    Change-Id: I30e8e40073117e63bf9a99f13000d83a87e64f29
    Closes-Bug: #1551288

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.1.2

This issue was fixed in the openstack/neutron 7.1.2 release.

Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 8.2.0

This issue was fixed in the openstack/neutron 8.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 9.0.0.0b3

This issue was fixed in the openstack/neutron 9.0.0.0b3 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Armando Migliaccio (<email address hidden>) on branch: master
Review: https://review.openstack.org/298578
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.