Resync OVS, L3, DHCP agents upon revival

Bug #1505166 reported by Eugene Nikanorov
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
High
Eugene Nikanorov

Bug Description

In some cases on a loaded cloud when neutron is working over rabbitmq in clustered mode there could be a condition when one of the rabbitmq cluster member is stuck replicating queues.
During that period agents that connect via that instance can't communicate and send heartbeats.

Neutron-sever will reschedule resources from such agents in such case. After that, when rabbitmq finishes sync, agents will "revive", but will not do anything to cleanup resources which were rescheduled during their "sleep".

As a result, there could be resources in failed or conflicting state (dhcp/router namespaces, ports with binding_failed).
They should be either deleted or syncronized with server state.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/233557

Changed in neutron:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by enikanorov (<email address hidden>) on branch: master
Review: https://review.openstack.org/233557
Reason: accidently changed change-id.

Changed in neutron:
importance: Undecided → Medium
Revision history for this message
Ryan Moats (rmoats) wrote :

LP missed this:

Fix proposed to branch: master
Review: https://review.openstack.org/232661

Further, this problem is being seen with operators running kilo so marking for liberty/kilo backport potential

tags: added: kilo-backport-potential liberty-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/232661
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3b6bd917e4b968a47a5aacb7f590143fc83816d9
Submitter: Jenkins
Branch: master

commit 3b6bd917e4b968a47a5aacb7f590143fc83816d9
Author: Eugene Nikanorov <email address hidden>
Date: Mon Oct 12 13:59:01 2015 +0400

    Resync L3, DHCP and OVS/LB agents upon revival

    In big and busy clusters there could be a condition when
    rabbitmq clustering mechanism synchronizes queues and during
    this period agents connected to that instance of rabbitmq
    can't communicate with the server and server considers them
    dead moving resources away. After agent become active again,
    it needs to cleanup state entries and synchronize its state
    with neutron-server.
    The solution is to make agents aware of their state from
    neutron-server point of view. This is done by changing state
    reports from cast to call that would return agent's status.
    When agent was dead and becomes alive, it would receive special
    AGENT_REVIVED status indicating that it should refresh its
    local data which it would not do otherwise.

    Closes-Bug: #1505166
    Change-Id: Id28248f4f75821fbacf46e2c44e40f27f59172a9

Changed in neutron:
status: In Progress → Fix Committed
Changed in neutron:
importance: Medium → High
Revision history for this message
Thierry Carrez (ttx) wrote : Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Changed in neutron:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/271804

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/271804
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=aebd27f3b7cbbb6277440de029e773eb781dec9e
Submitter: Jenkins
Branch: stable/liberty

commit aebd27f3b7cbbb6277440de029e773eb781dec9e
Author: Eugene Nikanorov <email address hidden>
Date: Mon Oct 12 13:59:01 2015 +0400

    Resync L3, DHCP and OVS/LB agents upon revival

    In big and busy clusters there could be a condition when
    rabbitmq clustering mechanism synchronizes queues and during
    this period agents connected to that instance of rabbitmq
    can't communicate with the server and server considers them
    dead moving resources away. After agent become active again,
    it needs to cleanup state entries and synchronize its state
    with neutron-server.
    The solution is to make agents aware of their state from
    neutron-server point of view. This is done by changing state
    reports from cast to call that would return agent's status.
    When agent was dead and becomes alive, it would receive special
    AGENT_REVIVED status indicating that it should refresh its
    local data which it would not do otherwise.

    Conflicts:
     neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
     neutron/tests/unit/agent/dhcp/test_agent.py
     neutron/tests/unit/plugins/ml2/drivers/linuxbridge/agent/test_linuxbridge_neutron_agent.py
     neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py

    Closes-Bug: #1505166
    Change-Id: Id28248f4f75821fbacf46e2c44e40f27f59172a9
    (cherry picked from commit 3b6bd917e4b968a47a5aacb7f590143fc83816d9)

tags: added: in-stable-liberty
Revision history for this message
Doug Hellmann (doug-hellmann) wrote : Fix included in openstack/neutron 7.0.3

This issue was fixed in the openstack/neutron 7.0.3 release.

tags: removed: kilo-backport-potential liberty-backport-potential
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.