Agents report as started before neutron recognizes as active

Bug #1525901 reported by Brent Eagles
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Medium
Unassigned
Kilo
Fix Released
Undecided
Unassigned

Bug Description

In HA, there is a potential race condition between the openvswitch agent and other agents that "own", depend on or manipulate ports. As the neutron server resumes on a failover it will not immediately be aware of openvswitch agents that have also been activated on failover and act as though there are no active openvswitch agents (this is an example, it most likely affects other L2 agents). If an agent such as the L3 agent starts and begins resync before the neutron server is aware of the active openvswitch agent, ports for the routers on that agent will be marked as "binding_failed". Currently this is a "terminal" state for the port as neutron does not attempt to rebind failed bindings on the same host.

Unfortunately, the neutron agents do not provide even a best-effort deterministic indication to the outside service manager (systemd, pacemaker, etc...) that it has fully initialized and the neutron server should be aware that it is active. Agents should follow the same pattern as wsgi based services and notify systemd after it can be reasonably assumed that the neutron server should be aware that it is alive. That way service startup order logic or constraints can properly start an agent that is dependent on other agents *after* neutron should be aware that the required agents are active.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

I've triaged this bug myself, you can reproduce it by:

1) starting a 2 or 3 network nodes, and setting up ha routers
2) creating a few ha routers (10 would suffice)
3) stopping ovs-agent & l3-agent & dhcp agent on all the nodes for T>agent_down_time
4) starting them all at once.

like 50% of the time:

1) l3-agent will try to rebind some of the router ports before any ovs-agent has reported himself (via heartbeat) as UP.
2) The result is the port being moved into binding failed status.
3) Then ovs-agent boots up, and marks the ports as dead internal VLAN (4095).
4) This recovers if you restart the l3-agent again, because that tries again to rebind the port, and some agent is up now.
[5) I'm not sure now if you needed to restart OVS agent again or not]

Changed in neutron:
importance: Undecided → Medium
tags: added: ovs
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/254920
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=398d10e323216d46f0152a90f15e17e0c18aa0a2
Submitter: Jenkins
Branch: master

commit 398d10e323216d46f0152a90f15e17e0c18aa0a2
Author: Brent Eagles <email address hidden>
Date: Tue Dec 8 12:32:21 2015 -0330

    Add systemd notification after reporting initial state

    This patch adds a notification for systemd after the agent has reported
    its initial state to the Neutron server. This enables configuring
    orderly startup of services that are dependent on the server having a
    healthy openvswitch agent running.

    Related-Bug: #1525901

    Change-Id: I8d08f1b2ae196b1e48f9d91e06966687c0a8bd43

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/liberty)

Related fix proposed to branch: stable/liberty
Review: https://review.openstack.org/270889

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/kilo)

Related fix proposed to branch: stable/kilo
Review: https://review.openstack.org/273635

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/270889
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=e80f5dc2498ee4bca315efcf4c6f5a203f3486af
Submitter: Jenkins
Branch: stable/liberty

commit e80f5dc2498ee4bca315efcf4c6f5a203f3486af
Author: Brent Eagles <email address hidden>
Date: Tue Dec 8 12:32:21 2015 -0330

    Add systemd notification after reporting initial state

    This patch adds a notification for systemd after the agent has reported
    its initial state to the Neutron server. This enables configuring
    orderly startup of services that are dependent on the server having a
    healthy openvswitch agent running.

    Conflicts:
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
     neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py

    Related-Bug: #1525901

    Change-Id: I8d08f1b2ae196b1e48f9d91e06966687c0a8bd43
    (cherry picked from commit 398d10e323216d46f0152a90f15e17e0c18aa0a2)

tags: added: in-stable-liberty
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/kilo)

Reviewed: https://review.openstack.org/273635
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=da9f4f17b051bbb565cf92b2c32bdf5a1ce09027
Submitter: Jenkins
Branch: stable/kilo

commit da9f4f17b051bbb565cf92b2c32bdf5a1ce09027
Author: Brent Eagles <email address hidden>
Date: Tue Dec 8 12:32:21 2015 -0330

    Add systemd notification after reporting initial state

    This patch adds a notification for systemd after the agent has reported
    its initial state to the Neutron server. This enables configuring
    orderly startup of services that are dependent on the server having a
    healthy openvswitch agent running.

    Conflicts:
     neutron/plugins/openvswitch/agent/ovs_neutron_agent.py
     neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py

    Related-Bug: #1525901

    Change-Id: I8d08f1b2ae196b1e48f9d91e06966687c0a8bd43
    (cherry picked from commit 398d10e323216d46f0152a90f15e17e0c18aa0a2)

tags: added: in-stable-kilo
Changed in neutron:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.