Bug #1679750 “Allocations are not cleaned up in placement for in...” : Bugs : OpenStack Compute (nova)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-04-04:

#1

There might be something in the periodic resource tracker updates that fixes the allocations once host A comes back up, but I'm not sure.

We probably need a functional test to run this scenario and see what happens.

tags:

added: placement

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-04-04:

#2

Another thing to note here is the nova-api service does not currently use the placement API for anything. If we fixed this by doing allocation cleanup from nova-api, we'd have to make nova-api dependent on placement, which is something we might not want to do since only nova-scheduler and nova-compute depend on placement today. We might be able to just fix this by doing a cleanup during init_host in the compute service.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-04-10:

#3

To move forward on this, we should write a functional test which does the following:

0. get the allocations for our single compute host, this is used later (it should be 0 to start)
1. create a server (this would be on our single compute host in the functional test env)
2. stop the compute (this triggers the local delete path in the api)
3. delete the instance (goes down the _local_delete path in the api code)
4. bring up the compute host again, this runs through the init_host routine and should update anything with the resource tracker and placement; I'm not sure how to tell when this is 'done'.
5. get the allocations for the compute host and compare it to the original allocations in step 0. If they are the same, then everything is good, we cleaned up somewhere. If they are different, meaning the deleted instance is still consuming allocations on the host, then we have this bug and it needs to be fixed.

Sean Dague (sdague) on 2017-04-11

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium

Sylvain Bauza (sylvain-bauza) on 2017-04-24

Changed in nova:
assignee:	nobody → Sylvain Bauza (sylvain-bauza)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-06-02: Related fix proposed to nova (master)

#4

Related fix proposed to branch: master
Review: https://review.openstack.org/470578

Sean Dague (sdague) on 2017-06-23

Changed in nova:
status:	Confirmed → In Progress

Matt Riedemann (mriedem) on 2017-08-03

Changed in nova:
assignee:	Sylvain Bauza (sylvain-bauza) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-08-09: Related fix merged to nova (master)

#5

Reviewed: https://review.openstack.org/470578
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4ba7baf451512d72b14f51cefe762642fabeda3e
Submitter: Jenkins
Branch: master

commit 4ba7baf451512d72b14f51cefe762642fabeda3e
Author: melanie witt <email address hidden>
Date: Fri Jun 2 23:09:30 2017 +0000

Add functional test for local delete allocations

    This adds a test that shows we clean up allocations for an instance
    that was local deleted while the compute host was down, when the
    compute host comes back up.

    There might still be a problem if the compute host is never brought
    back up, as allocations will still exist for the instance and show
    up as usage during usage queries to placement.

Related-Bug: #1679750

Change-Id: Ia68a5a69783963b063264edde84006973bb77ceb

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2017-08-15:

#6

Is this still a valid bug? It seems that the related functional test asserts the expected behavior and it is passing.

Changed in nova:
status:	In Progress → Incomplete

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-08-22:

#7

As the commit message in Mel's test patch points out:

"There might still be a problem if the compute host is never brought
back up, as allocations will still exist for the instance and show
up as usage during usage queries to placement."

If the compute is brought back up and "heals" the allocation, things are OK, but if the compute is never brought back up it'll be a problem as the GET /usages call to placement would be wrong, and we plan on eventually using that for our quota/limit checks in the API if a cell is down.

Changed in nova:
status:	Incomplete → Confirmed

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-09-07:

#8

Alex also raised a good question in this conductor code during a build:

https://review.openstack.org/#/c/501408/8/nova/conductor/manager.py@1049

That is a similar type of 'local delete' issue, but might be a separate bug.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2017-11-06:

#9

Related patch: https://review.openstack.org/#/c/517836/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-06:

#10

Reviewed: https://review.openstack.org/517836
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=57a3af6921dea72b3b2972f66e089578331dbb63
Submitter: Zuul
Branch: master

commit 57a3af6921dea72b3b2972f66e089578331dbb63
Author: Dan Smith <email address hidden>
Date: Sun Nov 5 15:32:01 2017 -0800

Clean up allocations if instance deleted during build

    When we notice that an instance was deleted after scheduling, we punt on
    instance creation. When that happens, the scheduler will have created
    allocations already so we need to delete those to avoid leaking resources.

Related-Bug: #1679750
Change-Id: I54806fe43257528fbec7d44c841ee4abb14c9dff

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-06: Related fix proposed to nova (stable/pike)

#11

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/517876

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-07: Related fix merged to nova (stable/pike)

#12

Reviewed: https://review.openstack.org/517876
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2cd20613e3237754df7c4e73995ac5fdc10a1298
Submitter: Zuul
Branch: stable/pike

commit 2cd20613e3237754df7c4e73995ac5fdc10a1298
Author: Dan Smith <email address hidden>
Date: Sun Nov 5 15:32:01 2017 -0800

Clean up allocations if instance deleted during build

    When we notice that an instance was deleted after scheduling, we punt on
    instance creation. When that happens, the scheduler will have created
    allocations already so we need to delete those to avoid leaking resources.

    Related-Bug: #1679750
    Change-Id: I54806fe43257528fbec7d44c841ee4abb14c9dff
    (cherry picked from commit 57a3af6921dea72b3b2972f66e089578331dbb63)

tags:

added: in-stable-pike

Revision history for this message

Henry Spanka (henryspanka) wrote on 2018-01-21:

#13

Hello,
I currently have the same issue. Is there any way to remove stale resources and clean up the placement api?

For example, the hypervisor itself reports 123 vCPUs while the placement API reports 146 vCPUs.

Revision history for this message

Jay Pipes (jaypipes) wrote on 2018-02-19:

#14

@henryspanka: could you provide some more information about your setup, please? Does the hypervisor report 123 **free vCPUs** and placement is reporting 146 **total** vCPUs? Because that would be an expected result.

Revision history for this message

Henry Spanka (henryspanka) wrote on 2018-02-19:

#15

@jaypipes: The hypervisor reports 123 **in use** vCPUs while the placement API reports 146 **in use** vCPUs.

However I have already traced that down to bug #1732976 so it does not apply here anymore.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-03-26:

#16

@Henry, if you need to cleanup placement, there is the osc-placement CLI plugin available for some of this:

https://docs.openstack.org/osc-placement/latest/index.html

i.e. you can remove allocations for a given consumer (deleted instance) if nova failed to cleanup allocations for that instance when it was deleted. That's a workaround until we have a fix for this bug in nova.

Revision history for this message

Henry Spanka (henryspanka) wrote on 2018-03-26:

#17

@Matt, I already made some scripts that check if the records in the placement API match the nova instances and if not it prints a SQL command that corrects the issue.

Matt Riedemann (mriedem) on 2018-04-12

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-12: Fix proposed to nova (master)

#18

Fix proposed to branch: master
Review: https://review.openstack.org/560706

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-20: Fix merged to nova (master)

#19

Reviewed: https://review.openstack.org/560706
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ea9d0af31395fbe1686fa681cd91226ee580796e
Submitter: Zuul
Branch: master

commit ea9d0af31395fbe1686fa681cd91226ee580796e
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400

Delete allocations from API if nova-compute is down

    When performing a "local delete" of an instance, we
    need to delete the allocations that the instance has
    against any resource providers in Placement.

    It should be noted that without this change, restarting
    the nova-compute service will delete the allocations
    for its compute node (assuming the compute node UUID
    is the same as before the instance was deleted). That
    is shown in the existing functional test modified here.

    The more important reason for this change is that in
    order to fix bug 1756179, we need to make sure the
    resource provider allocations for a given compute node
    are gone by the time the compute service is deleted.

    This adds a new functional test and a release note for
    the new behavior and need to configure nova-api for
    talking to placement, which is idempotent if
    not configured thanks to the @safe_connect decorator
    used in SchedulerReportClient.

Closes-Bug: #1679750
Related-Bug: #1756179

Change-Id: If507e23f0b7e5fa417041c3870d77786498f741d

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-04-20: Fix proposed to nova (stable/queens)

#20

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/563236

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-16: Related fix proposed to nova (master)

#21

Related fix proposed to branch: master
Review: https://review.openstack.org/568925

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-19: Related fix merged to nova (master)

#22

Reviewed: https://review.openstack.org/568925
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=3abd5f5797737b54c10ea85d4b833aff054d1bee
Submitter: Zuul
Branch: master

commit 3abd5f5797737b54c10ea85d4b833aff054d1bee
Author: Matt Riedemann <email address hidden>
Date: Wed May 16 14:28:26 2018 -0400

Update placement upgrade docs for nova-api dependency on placement

    Change If507e23f0b7e5fa417041c3870d77786498f741d makes nova-api
    dependent on placement for deleting an instance when the nova-compute
    service on which that instance is running is down, also known as
    "local delete".

Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 makes nova-api
dependent on placement for deleting a nova-compute service record.

Both changes are idempotent if nova-api isn't configured to use
placement, but warnings will show up in the logs.

This change updates the upgrade docs to mention the new dependency.

    Change-Id: I941a8f4b321e4c90a45f7d9fccb74489fae0d62d
    Related-Bug: #1679750
    Related-Bug: #1756179

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-22: Fix merged to nova (stable/queens)

#23

Reviewed: https://review.openstack.org/563236
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cba1a3e2c1b161204a3662a0d9fbf33da38aa7d3
Submitter: Zuul
Branch: stable/queens

commit cba1a3e2c1b161204a3662a0d9fbf33da38aa7d3
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400

Delete allocations from API if nova-compute is down

    When performing a "local delete" of an instance, we
    need to delete the allocations that the instance has
    against any resource providers in Placement.

    It should be noted that without this change, restarting
    the nova-compute service will delete the allocations
    for its compute node (assuming the compute node UUID
    is the same as before the instance was deleted). That
    is shown in the existing functional test modified here.

    The more important reason for this change is that in
    order to fix bug 1756179, we need to make sure the
    resource provider allocations for a given compute node
    are gone by the time the compute service is deleted.

    This adds a new functional test and a release note for
    the new behavior and need to configure nova-api for
    talking to placement, which is idempotent if
    not configured thanks to the @safe_connect decorator
    used in SchedulerReportClient.

Closes-Bug: #1679750
Related-Bug: #1756179

Change-Id: If507e23f0b7e5fa417041c3870d77786498f741d
(cherry picked from commit ea9d0af31395fbe1686fa681cd91226ee580796e)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-04: Fix included in openstack/nova 17.0.5

#24

This issue was fixed in the openstack/nova 17.0.5 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-08: Fix included in openstack/nova 18.0.0.0b2

#25

This issue was fixed in the openstack/nova 18.0.0.0b2 development milestone.

Revision history for this message

Dr. Jens Harbott (j-harbott) wrote on 2018-06-28:

#26

It looks like we may be affected by this issue on pike, is it possible to backport the fixes there?

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-06-28:

#27

Well, we backported it to stable/queens and it's marked for pike as well, so I'm not sure I intended on backporting it to pike and just forgot, or if there was a good reason to avoid backporting it to pike - note it's not just a single change though and there is another related bug fix tied up in this series:

https://review.openstack.org/#/q/topic:bug/1756179+branch:stable/queens

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-07-05:

#28

Someone else reported this biting them in Pike today:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2018-07-05.log.html#t2018-07-05T18:38:49

So I can start working on backports.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-05: Fix proposed to nova (stable/pike)

#29

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/580498

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-07-09: Fix merged to nova (stable/pike)

#30

Reviewed: https://review.openstack.org/580498
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cd50dcaf3e51722c9510d417c1724d8cdafe450b
Submitter: Zuul
Branch: stable/pike

commit cd50dcaf3e51722c9510d417c1724d8cdafe450b
Author: Matt Riedemann <email address hidden>
Date: Wed Apr 11 21:24:43 2018 -0400

Delete allocations from API if nova-compute is down

    When performing a "local delete" of an instance, we
    need to delete the allocations that the instance has
    against any resource providers in Placement.

    It should be noted that without this change, restarting
    the nova-compute service will delete the allocations
    for its compute node (assuming the compute node UUID
    is the same as before the instance was deleted). That
    is shown in the existing functional test modified here.

    The more important reason for this change is that in
    order to fix bug 1756179, we need to make sure the
    resource provider allocations for a given compute node
    are gone by the time the compute service is deleted.

    This adds a new functional test and a release note for
    the new behavior and need to configure nova-api for
    talking to placement, which is idempotent if
    not configured thanks to the @safe_connect decorator
    used in SchedulerReportClient.

Closes-Bug: #1679750
Related-Bug: #1756179

Conflicts:
nova/compute/api.py

    NOTE(mriedem): The compute/api conflict is due to not
    having change I393118861d1f921cc2d71011ddedaf43a2e8dbdf
    in Pike. In addition to this, the call to
    delete_allocation_for_instance() does not include the
    context parameter which was introduced in change
    If38e4a6d49910f0aa5016e1bcb61aac2be416fa7 which is
    also not in Pike.

    Change-Id: If507e23f0b7e5fa417041c3870d77786498f741d
    (cherry picked from commit ea9d0af31395fbe1686fa681cd91226ee580796e)
    (cherry picked from commit cba1a3e2c1b161204a3662a0d9fbf33da38aa7d3)

Reviewed:  https://review.openstack.org/580498
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cd50dcaf3e51722c9510d417c1724d8cdafe450b
Submitter: Zuul
Branch:    stable/pike

commit cd50dcaf3e51722c9510d417c1724d8cdafe450b
Author: Matt Riedemann <mriedem.os@gmail.com>
Date:   Wed Apr 11 21:24:43 2018 -0400

Delete allocations from API if nova-compute is down
    
    When performing a "local delete" of an instance, we
    need to delete the allocations that the instance has
    against any resource providers in Placement.
    
    It should be noted that without this change, restarting
    the nova-compute service will delete the allocations
    for its compute node (assuming the compute node UUID
    is the same as before the instance was deleted). That
    is shown in the existing functional test modified here.
    
    The more important reason for this change is that in
    order to fix bug 1756179, we need to make sure the
    resource provider allocations for a given compute node
    are gone by the time the compute service is deleted.
    
    This adds a new functional test and a release note for
    the new behavior and need to configure nova-api for
    talking to placement, which is idempotent if
    not configured thanks to the @safe_connect decorator
    used in SchedulerReportClient.
    
    Closes-Bug: #1679750
    Related-Bug: #1756179
    
    Conflicts:
          nova/compute/api.py
    
    NOTE(mriedem): The compute/api conflict is due to not
    having change I393118861d1f921cc2d71011ddedaf43a2e8dbdf
    in Pike. In addition to this, the call to
    delete_allocation_for_instance() does not include the
    context parameter which was introduced in change
    If38e4a6d49910f0aa5016e1bcb61aac2be416fa7 which is
    also not in Pike.
    
    Change-Id: If507e23f0b7e5fa417041c3870d77786498f741d
    (cherry picked from commit ea9d0af31395fbe1686fa681cd91226ee580796e)
    (cherry picked from commit cba1a3e2c1b161204a3662a0d9fbf33da38aa7d3)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-09-24: Fix included in openstack/nova 16.1.5

#31

This issue was fixed in the openstack/nova 16.1.5 release.

OpenStack Compute (nova)

Allocations are not cleaned up in placement for instance 'local delete' case

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Matt Riedemann
Pike	Fix Committed	Medium	Matt Riedemann
Queens	Fix Committed	Medium	Matt Riedemann