state machines get lost after failovers

Bug #1527396 reported by Adam Gandelman on 2015-12-17
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Adam Gandelman

Bug Description

Clustering multiple orchestrators and bouncing a resource back and forth between them via failover causes the state machine for a resource to eventually get lost:

1. Spin up 1 orchestrator on host=astara1, create a router.
2. Spin up a 2nd orchestrator on host=astara2, rebalance occurs and router is now owned by astara2
3. Issue a POLL command.
3. Spin down orchestrator on astara2, router fails back over to astara1

At this point the router should be managed by astara1, but it is not. Commands are ignored, orchestrator thinks the resource has been deleted. After Step 2 there is some cleanup that happens on no-longer-managed state machines, I believe this is confusing the TRM that the unmanaged resource is actually deleted from Neutron.

Changed in astara:
status: New → Incomplete
status: Incomplete → New
importance: Undecided → High
assignee: nobody → Adam Gandelman (gandelman-a)
Adam Gandelman (gandelman-a) wrote :

So after a rebalance, when a resource is mapped away from astara1 over to astara2, astara1 attempts to clean up its local state machines and remove management of the resource from its tenant resource manager. It does this by recycling the same code we use for deleting a resource, which flags the resource as having been deleted. Upon the next rebalance, astara1 attempts to recreate state machines for resource that now map to it, but creation doesn't happen because the resource was flagged as being deleted.

Fix proposed to branch: master

Changed in astara:
status: New → In Progress
Changed in astara:
milestone: none → mitaka-2

Submitter: Jenkins
Branch: master

commit f2360d861f3904c8a06d94175be553fe5e7bab05
Author: Adam Gandelman <email address hidden>
Date: Thu Dec 17 15:16:35 2015 -0800

    Cleanup SM management during rebalance events.

    This cleans up the worker's handling of rebalance events a bit
    and ensures we dont drop state machines in a way that prevents
    them from later being recreated. It also avoids a bug where, upon
    failing over resources to a new orchestartor, we create a state
    machine per worker, instead of dispatching them to one single worker.

    To do this, the scheduler is passed into workers as well as the
    process name, allowing them to more intelligently figure out what
    they need to manage after a cluster event.

    Finally, this ensures a config update is issued to appliances after
    they have moved to a new orchestrator after a cluster event.

    Change-Id: I76bf702c33ac6ff831270e7185a6aa3fc4c464ca
    Partial-bug: #1524068
    Closes-bug: #1527396

Changed in astara:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers