Reverting migration-based allocations leaks allocations if the server is deleted
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| OpenStack Compute (nova) |
Medium
|
Unassigned | ||
| Queens |
Undecided
|
Unassigned | ||
| Rocky |
Undecided
|
Unassigned | ||
| Stein |
Undecided
|
Unassigned | ||
| Train |
Undecided
|
Unassigned |
Bug Description
This came up in the cross-cell resize review:
https:/
And I was able to recreate with a functional test here:
https:/
That test is doing a cross-cell cold migration but looking at the code:
We can hit an issue for same-cell resize/cold migrate if we have swapped the allocations so the source node allocations are held by the migration consumer and the instance holds allocations on the target node (created by the scheduler):
If something fails between ^ and the cast to prep_resize, the task will rollback and revert the allocations so the target node allocations are dropped and the source node allocations are moved back to the instance:
Furthermore, if the instance was deleted when we perform that swap, the move_allocations method will recreate the allocations on the source node for the now-deleted instance since we don't assert consumer generations during the swap:
This results in leaking allocations for the source node since the instance is deleted.
Changed in nova: | |
status: | New → Triaged |
Matt Riedemann (mriedem) wrote : | #1 |
Matt Riedemann (mriedem) wrote : | #2 |
I think we can claim this as a regression going back to migration-based allocations added in Queens. Looking at where a resize would fail during prep_resize in Pike, we call this method to remove the 'doubled up' allocations or the instance:
https:/
That's to remove the dest host provider from the allocations created by the scheduler.
In this case if the instance was deleted concurrently and had no allocations, we'd get here and return False:
https:/
And we'd log an error there and here in the resource tracker:
https:/
But we wouldn't leak any allocations for the deleted instance which is the bug we have since Queens when reverting the allocation swap and the instance is gone.
Changed in nova: | |
importance: | Undecided → Medium |
Related fix proposed to branch: master
Review: https:/
OpenStack Infra (hudson-openstack) wrote : | #4 |
Related fix proposed to branch: master
Review: https:/
summary: |
- MigrationTask rollback can leak allocations for a deleted server + Reverting migration-based allocations leaks allocations if the server is + deleted |
OpenStack Infra (hudson-openstack) wrote : | #5 |
Related fix proposed to branch: master
Review: https:/
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
assignee: | nobody → Matt Riedemann (mriedem) |
status: | Triaged → In Progress |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 24318f8cd4b71f1
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 16 13:08:08 2019 -0400
Add functional recreate test for bug 1848343
This adds a functional test to recreate a bug where the
instance is deleted after conductor has swapped allocations
to the migration consumer but before casting to compute. In
this case, the scheduler fails due to NoValidHost which is
entirely reasonable. The bug is that the conductor task rollback
code re-creates the allocations on the source node for the
now-deleted instance and as such those allocations get leaked.
Note that we have similar exposures in the live migration
task and reverting allocations when resize fails in the
compute service. A TODO is left inline to add tests for those
separately.
Change-Id: Ifd156ac8789d3f
Related-Bug: #1848343
OpenStack Infra (hudson-openstack) wrote : | #8 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 252ee93086a4854
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 16 14:03:59 2019 -0400
Add live migration recreate test for bug 1848343
This adds a live migration functional recreate test
like Ifd156ac8789d3f
for cold migrate/resize.
Change-Id: I856db36d63779d
Related-Bug: #1848343
OpenStack Infra (hudson-openstack) wrote : | #9 |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 760ccb32dbbb0ba
Author: Matt Riedemann <email address hidden>
Date: Wed Oct 16 15:23:50 2019 -0400
Add compute side revert allocation test for bug 1848343
This adds a functional recreate test for a scenario where
reverting migration-based allocations during resize failure
in the compute service results in leaking allocations for a
deleted server.
Change-Id: Iac4dd9feebb1a4
Related-Bug: #1848343
Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https:/
Reason: I'm not working on this.
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → nobody |
status: | In Progress → Triaged |
Note that we could have the same issue in the compute service, for example if the server is deleted during the resize claim and we get to this exception block handler:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/compute/ manager. py#L4724
To revert the allocations here:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/compute/ manager. py#L4574
Which calls move_allocations which has the same problem described above.
We could also have the same issue if the instance is gone during resize_instance on the source host:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/compute/ manager. py#L4896
Or during finish_ revert_ resize I guess:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/compute/ manager. py#L4454
I'm not sure about the fix yet, but we might want callers to optionally tell the move_allocations method if it should require the usage of the target consumer (instance in this case) generation so we don't use a .get() here:
https:/ /github. com/openstack/ nova/blob/ 1a226aaa9e8c969 ddfdfe198c36f79 66b1f692f3/ nova/scheduler/ client/ report. py#L1886
Actually if the target consumer no longer exists in placement the 'consumer_ generation' key won't exist in that allocations response and we'd have to handle it earlier, like this:
https:/ /review. opendev. org/#/c/ 688832/ 2/nova/ scheduler/ client/ report. py