test_keepalived_multiple_sighups_does_not_forfeit_mastership fails when neutron-server tries to bind with Linuxbridge driver (agent not enabled)

Bug #1696537 reported by Ihar Hrachyshka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Miguel Lavalle

Bug Description

This happens locally and in gate. Gate example: http://logs.openstack.org/59/471059/2/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/df11b90/testr_results.html.gz

    Traceback (most recent call last):
      File "neutron/tests/base.py", line 118, in func
        return f(self, *args, **kwargs)
      File "neutron/tests/fullstack/test_l3_agent.py", line 252, in test_keepalived_multiple_sighups_does_not_forfeit_mastership
        tenant_id, '13.37.0.0/24', network['id'], router['id'])
      File "neutron/tests/fullstack/test_l3_agent.py", line 61, in _create_and_attach_subnet
        router_interface_info['port_id'])
      File "neutron/tests/fullstack/test_l3_agent.py", line 51, in block_until_port_status_active
        common_utils.wait_until_true(lambda: is_port_status_active(), sleep=1)
      File "neutron/common/utils.py", line 685, in wait_until_true
        raise WaitTimeout("Timed out after %d seconds" % timeout)
    neutron.common.utils.WaitTimeout: Timed out after 60 seconds

This is not 100% failure rate, depends on which driver server picks to bind ports: ovs or linuxbridge. If the latter, it just spins attempting to bind with it over and over, until bails out. It never tries to switch to ovs.

In server log, we see this: http://logs.openstack.org/59/471059/2/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/df11b90/logs/dsvm-fullstack-logs/TestHAL3Agent.test_keepalived_multiple_sighups_does_not_forfeit_mastership/neutron-server--2017-06-05--21-41-34-957535.txt.gz#_2017-06-05_21_42_13_400

2017-06-05 21:42:13.400 12566 DEBUG neutron.plugins.ml2.drivers.mech_agent [req-6618e950-5260-404d-a511-e314408542f5 - - - - -] Port 4f8dcf10-6f91-4860-b239-6b04460244a3 on network 155ebfd5-20cf-44bc-9cb5-bc885b8d2eae not bound, no agent of type Linux bridge agent registered on host host-745fd526 bind_port /opt/stack/new/neutron/neutron/plugins/ml2/drivers/mech_agent.py:103
2017-06-05 21:42:13.401 12566 ERROR neutron.plugins.ml2.managers [req-6618e950-5260-404d-a511-e314408542f5 - - - - -] Failed to bind port 4f8dcf10-6f91-4860-b239-6b04460244a3 on host host-745fd526 for vnic_type normal using segments []
2017-06-05 21:42:13.401 12566 INFO neutron.plugins.ml2.plugin [req-6618e950-5260-404d-a511-e314408542f5 - - - - -] Attempt 2 to bind port 4f8dcf10-6f91-4860-b239-6b04460244a3
...
2017-06-05 21:42:13.822 12566 ERROR neutron.plugins.ml2.managers [req-6618e950-5260-404d-a511-e314408542f5 - - - - -] Failed to bind port 4f8dcf10-6f91-4860-b239-6b04460244a3 on host host-745fd526 for vnic_type normal using segments []

The fullstack test case configures both ml2 drivers.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

I wonder if the bug would affect a regular, non-fullstack, installation.

Changed in neutron:
importance: Undecided → High
status: New → Confirmed
tags: added: fullstack
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

It seems like sometimes context.network.network_segments is None (or empty, not sure), which makes the ml2 manager to not bind with the active agent (ovs) and switch to the next one that is linuxbridge.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Looking at the logs, seems like there are several threads running that all update the same port, trying to bind it and transition to ACTIVE. Some of them succeed, some loop with:

StaleDataError: UPDATE statement on table 'standardattributes' expected to update 1 row(s); 0 were matched. _notify_port_updated /opt/stack/neutron/neutron/plugins/ml2/plugin.py:654

I suspect there is some issue with parallel execution of provisioning blocks/binding logic.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/472454

Changed in neutron:
assignee: nobody → Kevin Benton (kevinbenton)
status: Confirmed → In Progress
Changed in neutron:
assignee: Kevin Benton (kevinbenton) → Brian Haley (brian-haley)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/475955

Changed in neutron:
assignee: Brian Haley (brian-haley) → Miguel Lavalle (minsel)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/472454
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3c1a25d9683263505b00cb0d6b04add1831f0ca0
Submitter: Jenkins
Branch: master

commit 3c1a25d9683263505b00cb0d6b04add1831f0ca0
Author: Kevin Benton <email address hidden>
Date: Thu Jun 8 14:56:19 2017 -0700

    Make HA deletion attempt on RouterNotFound race

    The L3 HA RPC code that creates HA interfaces can race
    with an HA router deletion on the server side. The L3 HA
    code ends up creating a port on the HA network while the
    server side is deleting the router and the HA network.

    This stops the L3 HA network from being deleted because
    it has a new port without a bound segment, which leaves the
    HA network in a segmentless condition and no ports after
    the L3 RPC code cleans up its port.

    This adjusts the L3 RPC logic to attempt an HA network cleanup
    whenever it encounters the concurrent router deletion case
    to ensure that the HA network gets cleaned up.

    To make this more robust in the future, we may need the L3
    HA code to recognize when an HA network has no segments and
    automatically create a new one.

    Change-Id: Idd301f6df92e9bc37187e8ed8ec00004e67da928
    Closes-Bug: #1696537

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b3

This issue was fixed in the openstack/neutron 11.0.0.0b3 development milestone.

tags: added: neutron-proactive-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/499981

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/ocata)

Reviewed: https://review.openstack.org/499981
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=990ea6935ede810cd48ab9d7b6086663851c404c
Submitter: Jenkins
Branch: stable/ocata

commit 990ea6935ede810cd48ab9d7b6086663851c404c
Author: Kevin Benton <email address hidden>
Date: Thu Jun 8 14:56:19 2017 -0700

    Make HA deletion attempt on RouterNotFound race

    The L3 HA RPC code that creates HA interfaces can race
    with an HA router deletion on the server side. The L3 HA
    code ends up creating a port on the HA network while the
    server side is deleting the router and the HA network.

    This stops the L3 HA network from being deleted because
    it has a new port without a bound segment, which leaves the
    HA network in a segmentless condition and no ports after
    the L3 RPC code cleans up its port.

    This adjusts the L3 RPC logic to attempt an HA network cleanup
    whenever it encounters the concurrent router deletion case
    to ensure that the HA network gets cleaned up.

    To make this more robust in the future, we may need the L3
    HA code to recognize when an HA network has no segments and
    automatically create a new one.

    Change-Id: Idd301f6df92e9bc37187e8ed8ec00004e67da928
    Closes-Bug: #1696537
    (cherry picked from commit 3c1a25d9683263505b00cb0d6b04add1831f0ca0)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 10.0.4

This issue was fixed in the openstack/neutron 10.0.4 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.