state machines get lost after failovers
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Astara |
Fix Released
|
High
|
Adam Gandelman |
Bug Description
Clustering multiple orchestrators and bouncing a resource back and forth between them via failover causes the state machine for a resource to eventually get lost:
1. Spin up 1 orchestrator on host=astara1, create a router.
2. Spin up a 2nd orchestrator on host=astara2, rebalance occurs and router is now owned by astara2
3. Issue a POLL command.
3. Spin down orchestrator on astara2, router fails back over to astara1
At this point the router should be managed by astara1, but it is not. Commands are ignored, orchestrator thinks the resource has been deleted. After Step 2 there is some cleanup that happens on no-longer-managed state machines, I believe this is confusing the TRM that the unmanaged resource is actually deleted from Neutron.
Changed in astara: | |
status: | New → Incomplete |
status: | Incomplete → New |
importance: | Undecided → High |
assignee: | nobody → Adam Gandelman (gandelman-a) |
Changed in astara: | |
milestone: | none → mitaka-2 |
So after a rebalance, when a resource is mapped away from astara1 over to astara2, astara1 attempts to clean up its local state machines and remove management of the resource from its tenant resource manager. It does this by recycling the same code we use for deleting a resource, which flags the resource as having been deleted. Upon the next rebalance, astara1 attempts to recreate state machines for resource that now map to it, but creation doesn't happen because the resource was flagged as being deleted.