Allocations may not be removed from dest node during failed migrations
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
High
|
Matt Riedemann | ||
Pike |
Fix Committed
|
High
|
Matt Riedemann |
Bug Description
This could also be true for cold migrate/
As of this change in Pike:
https:/
Once all computes are upgraded, the resource tracker will no longer "heal" allocations in Placement for it's local node, meaning creating allocations for the node if the instance is on it, or removing allocations for the instance if the instance is not on the node.
During live migration, conductor will call the scheduler to select a host which is also going to claim resources against the dest node:
https:/
https:/
https:/
The problem during live migration is once the scheduler picks a host, conductor performs some additional checks:
https:/
Which could fail, and then conductor will retry the scheduler to get another host, until one is found and passes the pre-migration checks, or the number of retries are exhausted.
The problem is the allocation created in Placement for the destination node, which failed some later pre-migration check, is never cleaned up if the update_
We could rollback the allocation in conductor on a failure, or we could put some other kind of periodic cleanup task in the compute service which looks for failed migrations where the destination node in the migration record is for that node, and removes any failed allocations for that node and the given instance.
Changed in nova: | |
status: | New → Triaged |
Changed in nova: | |
importance: | Undecided → High |
Changed in nova: | |
assignee: | nobody → Matt Riedemann (mriedem) |
Semi-related, but probably a separate bug, is that there is the ability to cancel an in-progress live migration and that also does not remove the allocation on the destination node for the instance.