Allocations are not cleaned up in placement for instance 'local delete' case
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| OpenStack Compute (nova) |
Medium
|
Matt Riedemann | ||
| Pike |
Medium
|
Matt Riedemann | ||
| Queens |
Medium
|
Matt Riedemann |
Bug Description
This is semi-related to bug 1661312 for evacuate.
This is the case:
1. Create an instance on host A successfully. There are allocation records in the placement API for the instance (consumer for the allocation records) and host A (resource provider).
2. Host A goes down.
3. Delete the instance. This triggers the local delete flow in the compute API where we can't RPC cast to the compute to delete the instance because the nova-compute service is down. So we do the delete in the database from the compute API (local to compute API, hence local delete).
The problem is in #3 we don't remove the allocations for the instance from the host A resource provider during the local delete flow.
Maybe this doesn't matter while host A is down, since the scheduler can't schedule to it anyway. But if host A comes back up, it will have allocations tied to it for deleted instances.
On init_host in the compute service we call _complete_
Matt Riedemann (mriedem) wrote : | #2 |
Another thing to note here is the nova-api service does not currently use the placement API for anything. If we fixed this by doing allocation cleanup from nova-api, we'd have to make nova-api dependent on placement, which is something we might not want to do since only nova-scheduler and nova-compute depend on placement today. We might be able to just fix this by doing a cleanup during init_host in the compute service.
Matt Riedemann (mriedem) wrote : | #3 |
To move forward on this, we should write a functional test which does the following:
0. get the allocations for our single compute host, this is used later (it should be 0 to start)
1. create a server (this would be on our single compute host in the functional test env)
2. stop the compute (this triggers the local delete path in the api)
3. delete the instance (goes down the _local_delete path in the api code)
4. bring up the compute host again, this runs through the init_host routine and should update anything with the resource tracker and placement; I'm not sure how to tell when this is 'done'.
5. get the allocations for the compute host and compare it to the original allocations in step 0. If they are the same, then everything is good, we cleaned up somewhere. If they are different, meaning the deleted instance is still consuming allocations on the host, then we have this bug and it needs to be fixed.
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
Changed in nova: | |
assignee: | nobody → Sylvain Bauza (sylvain-bauza) |
Related fix proposed to branch: master
Review: https:/
Changed in nova: | |
status: | Confirmed → In Progress |
Changed in nova: | |
assignee: | Sylvain Bauza (sylvain-bauza) → nobody |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 4ba7baf451512d7
Author: melanie witt <email address hidden>
Date: Fri Jun 2 23:09:30 2017 +0000
Add functional test for local delete allocations
This adds a test that shows we clean up allocations for an instance
that was local deleted while the compute host was down, when the
compute host comes back up.
There might still be a problem if the compute host is never brought
back up, as allocations will still exist for the instance and show
up as usage during usage queries to placement.
Related-Bug: #1679750
Change-Id: Ia68a5a69783963
Balazs Gibizer (balazs-gibizer) wrote : | #6 |
Is this still a valid bug? It seems that the related functional test asserts the expected behavior and it is passing.
Changed in nova: | |
status: | In Progress → Incomplete |
Matt Riedemann (mriedem) wrote : | #7 |
As the commit message in Mel's test patch points out:
"There might still be a problem if the compute host is never brought
back up, as allocations will still exist for the instance and show
up as usage during usage queries to placement."
If the compute is brought back up and "heals" the allocation, things are OK, but if the compute is never brought back up it'll be a problem as the GET /usages call to placement would be wrong, and we plan on eventually using that for our quota/limit checks in the API if a cell is down.
Changed in nova: | |
status: | Incomplete → Confirmed |
Matt Riedemann (mriedem) wrote : | #8 |
Alex also raised a good question in this conductor code during a build:
https:/
That is a similar type of 'local delete' issue, but might be a separate bug.
Matt Riedemann (mriedem) wrote : | #9 |
Related patch: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 57a3af6921dea72
Author: Dan Smith <email address hidden>
Date: Sun Nov 5 15:32:01 2017 -0800
Clean up allocations if instance deleted during build
When we notice that an instance was deleted after scheduling, we punt on
instance creation. When that happens, the scheduler will have created
allocations already so we need to delete those to avoid leaking resources.
Related-Bug: #1679750
Change-Id: I54806fe4325752
Related fix proposed to branch: stable/pike
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit 2cd20613e323775
Author: Dan Smith <email address hidden>
Date: Sun Nov 5 15:32:01 2017 -0800
Clean up allocations if instance deleted during build
When we notice that an instance was deleted after scheduling, we punt on
instance creation. When that happens, the scheduler will have created
allocations already so we need to delete those to avoid leaking resources.
Related-Bug: #1679750
Change-Id: I54806fe4325752
(cherry picked from commit 57a3af6921dea72
tags: | added: in-stable-pike |
Henry Spanka (henryspanka) wrote : | #13 |
Hello,
I currently have the same issue. Is there any way to remove stale resources and clean up the placement api?
For example, the hypervisor itself reports 123 vCPUs while the placement API reports 146 vCPUs.
Jay Pipes (jaypipes) wrote : | #14 |
@henryspanka: could you provide some more information about your setup, please? Does the hypervisor report 123 **free vCPUs** and placement is reporting 146 **total** vCPUs? Because that would be an expected result.
Henry Spanka (henryspanka) wrote : | #15 |
@jaypipes: The hypervisor reports 123 **in use** vCPUs while the placement API reports 146 **in use** vCPUs.
However I have already traced that down to bug #1732976 so it does not apply here anymore.
Matt Riedemann (mriedem) wrote : | #16 |
@Henry, if you need to cleanup placement, there is the osc-placement CLI plugin available for some of this:
https:/
i.e. you can remove allocations for a given consumer (deleted instance) if nova failed to cleanup allocations for that instance when it was deleted. That's a workaround until we have a fix for this bug in nova.
Henry Spanka (henryspanka) wrote : | #17 |
@Matt, I already made some scripts that check if the records in the placement API match the nova instances and if not it prints a SQL command that corrects the issue.
Changed in nova: | |
assignee: | nobody → Matt Riedemann (mriedem) |
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
status: | Confirmed → In Progress |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit ea9d0af31395fbe
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400
Delete allocations from API if nova-compute is down
When performing a "local delete" of an instance, we
need to delete the allocations that the instance has
against any resource providers in Placement.
It should be noted that without this change, restarting
the nova-compute service will delete the allocations
for its compute node (assuming the compute node UUID
is the same as before the instance was deleted). That
is shown in the existing functional test modified here.
The more important reason for this change is that in
order to fix bug 1756179, we need to make sure the
resource provider allocations for a given compute node
are gone by the time the compute service is deleted.
This adds a new functional test and a release note for
the new behavior and need to configure nova-api for
talking to placement, which is idempotent if
not configured thanks to the @safe_connect decorator
used in SchedulerReport
Closes-Bug: #1679750
Related-Bug: #1756179
Change-Id: If507e23f0b7e5f
Changed in nova: | |
status: | In Progress → Fix Released |
Fix proposed to branch: stable/queens
Review: https:/
Related fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 3abd5f5797737b5
Author: Matt Riedemann <email address hidden>
Date: Wed May 16 14:28:26 2018 -0400
Update placement upgrade docs for nova-api dependency on placement
Change If507e23f0b7e5f
dependent on placement for deleting an instance when the nova-compute
service on which that instance is running is down, also known as
"local delete".
Change I7b8622b178d504
dependent on placement for deleting a nova-compute service record.
Both changes are idempotent if nova-api isn't configured to use
placement, but warnings will show up in the logs.
This change updates the upgrade docs to mention the new dependency.
Change-Id: I941a8f4b321e4c
Related-Bug: #1679750
Related-Bug: #1756179
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/queens
commit cba1a3e2c1b1612
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400
Delete allocations from API if nova-compute is down
When performing a "local delete" of an instance, we
need to delete the allocations that the instance has
against any resource providers in Placement.
It should be noted that without this change, restarting
the nova-compute service will delete the allocations
for its compute node (assuming the compute node UUID
is the same as before the instance was deleted). That
is shown in the existing functional test modified here.
The more important reason for this change is that in
order to fix bug 1756179, we need to make sure the
resource provider allocations for a given compute node
are gone by the time the compute service is deleted.
This adds a new functional test and a release note for
the new behavior and need to configure nova-api for
talking to placement, which is idempotent if
not configured thanks to the @safe_connect decorator
used in SchedulerReport
Closes-Bug: #1679750
Related-Bug: #1756179
Change-Id: If507e23f0b7e5f
(cherry picked from commit ea9d0af31395fbe
This issue was fixed in the openstack/nova 17.0.5 release.
This issue was fixed in the openstack/nova 18.0.0.0b2 development milestone.
Dr. Jens Harbott (j-harbott) wrote : | #26 |
It looks like we may be affected by this issue on pike, is it possible to backport the fixes there?
Matt Riedemann (mriedem) wrote : | #27 |
Well, we backported it to stable/queens and it's marked for pike as well, so I'm not sure I intended on backporting it to pike and just forgot, or if there was a good reason to avoid backporting it to pike - note it's not just a single change though and there is another related bug fix tied up in this series:
https:/
Matt Riedemann (mriedem) wrote : | #28 |
Someone else reported this biting them in Pike today:
So I can start working on backports.
Fix proposed to branch: stable/pike
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit cd50dcaf3e51722
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400
Delete allocations from API if nova-compute is down
When performing a "local delete" of an instance, we
need to delete the allocations that the instance has
against any resource providers in Placement.
It should be noted that without this change, restarting
the nova-compute service will delete the allocations
for its compute node (assuming the compute node UUID
is the same as before the instance was deleted). That
is shown in the existing functional test modified here.
The more important reason for this change is that in
order to fix bug 1756179, we need to make sure the
resource provider allocations for a given compute node
are gone by the time the compute service is deleted.
This adds a new functional test and a release note for
the new behavior and need to configure nova-api for
talking to placement, which is idempotent if
not configured thanks to the @safe_connect decorator
used in SchedulerReport
Closes-Bug: #1679750
Related-Bug: #1756179
Conflicts:
NOTE(mriedem): The compute/api conflict is due to not
having change I393118861d1f92
in Pike. In addition to this, the call to
delete_
context parameter which was introduced in change
If38e4a6d49
also not in Pike.
Change-Id: If507e23f0b7e5f
(cherry picked from commit ea9d0af31395fbe
(cherry picked from commit cba1a3e2c1b1612
This issue was fixed in the openstack/nova 16.1.5 release.
There might be something in the periodic resource tracker updates that fixes the allocations once host A comes back up, but I'm not sure.
We probably need a functional test to run this scenario and see what happens.