neutron

Resync OVS, L3, DHCP agents upon revival

Bug #1505166 reported by Eugene Nikanorov on 2015-10-12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Fix Released	High	Eugene Nikanorov

Bug Description

In some cases on a loaded cloud when neutron is working over rabbitmq in clustered mode there could be a condition when one of the rabbitmq cluster member is stuck replicating queues.
During that period agents that connect via that instance can't communicate and send heartbeats.

Neutron-sever will reschedule resources from such agents in such case. After that, when rabbitmq finishes sync, agents will "revive", but will not do anything to cleanup resources which were rescheduled during their "sleep".

As a result, there could be resources in failed or conflicting state (dhcp/router namespaces, ports with binding_failed).
They should be either deleted or syncronized with server state.

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-12: Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/233557

Changed in neutron:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-10-12: Change abandoned on neutron (master)

Change abandoned by enikanorov (<email address hidden>) on branch: master
Review: https://review.openstack.org/233557
Reason: accidently changed change-id.

Ihar Hrachyshka (ihar-hrachyshka) on 2015-10-12

Changed in neutron:
importance:	Undecided → Medium

Revision history for this message

Ryan Moats (rmoats) wrote on 2015-11-10:

LP missed this:

Fix proposed to branch: master
Review: https://review.openstack.org/232661

Further, this problem is being seen with operators running kilo so marking for liberty/kilo backport potential

tags:

added: kilo-backport-potential liberty-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-11-19: Fix merged to neutron (master)

Reviewed: https://review.openstack.org/232661
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3b6bd917e4b968a47a5aacb7f590143fc83816d9
Submitter: Jenkins
Branch: master

commit 3b6bd917e4b968a47a5aacb7f590143fc83816d9
Author: Eugene Nikanorov <email address hidden>
Date: Mon Oct 12 13:59:01 2015 +0400

Resync L3, DHCP and OVS/LB agents upon revival

    In big and busy clusters there could be a condition when
    rabbitmq clustering mechanism synchronizes queues and during
    this period agents connected to that instance of rabbitmq
    can't communicate with the server and server considers them
    dead moving resources away. After agent become active again,
    it needs to cleanup state entries and synchronize its state
    with neutron-server.
    The solution is to make agents aware of their state from
    neutron-server point of view. This is done by changing state
    reports from cast to call that would return agent's status.
    When agent was dead and becomes alive, it would receive special
    AGENT_REVIVED status indicating that it should refresh its
    local data which it would not do otherwise.

Closes-Bug: #1505166
Change-Id: Id28248f4f75821fbacf46e2c44e40f27f59172a9

Changed in neutron:
status:	In Progress → Fix Committed

Ihar Hrachyshka (ihar-hrachyshka) on 2015-11-27

Changed in neutron:
importance:	Medium → High

Revision history for this message

Thierry Carrez (ttx) wrote on 2015-12-03: Fix included in openstack/neutron 8.0.0.0b1

This issue was fixed in the openstack/neutron 8.0.0.0b1 development milestone.

Doug Hellmann (doug-hellmann) on 2015-12-03

Changed in neutron:
status:	Fix Committed → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-24: Fix proposed to neutron (stable/liberty)

Fix proposed to branch: stable/liberty
Review: https://review.openstack.org/271804

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-01-29: Fix merged to neutron (stable/liberty)

Reviewed: https://review.openstack.org/271804
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=aebd27f3b7cbbb6277440de029e773eb781dec9e
Submitter: Jenkins
Branch: stable/liberty

commit aebd27f3b7cbbb6277440de029e773eb781dec9e
Author: Eugene Nikanorov <email address hidden>
Date: Mon Oct 12 13:59:01 2015 +0400

Resync L3, DHCP and OVS/LB agents upon revival

    Conflicts:
     neutron/plugins/ml2/drivers/linuxbridge/agent/linuxbridge_neutron_agent.py
     neutron/plugins/ml2/drivers/openvswitch/agent/ovs_neutron_agent.py
     neutron/tests/unit/agent/dhcp/test_agent.py
     neutron/tests/unit/plugins/ml2/drivers/linuxbridge/agent/test_linuxbridge_neutron_agent.py
     neutron/tests/unit/plugins/ml2/drivers/openvswitch/agent/test_ovs_neutron_agent.py

    Closes-Bug: #1505166
    Change-Id: Id28248f4f75821fbacf46e2c44e40f27f59172a9
    (cherry picked from commit 3b6bd917e4b968a47a5aacb7f590143fc83816d9)

tags:

added: in-stable-liberty

Revision history for this message

Doug Hellmann (doug-hellmann) wrote on 2016-02-10: Fix included in openstack/neutron 7.0.3

This issue was fixed in the openstack/neutron 7.0.3 release.

Ihar Hrachyshka (ihar-hrachyshka) on 2016-10-07

tags:

removed: kilo-backport-potential liberty-backport-potential

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1433940

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.