The allocation table has residual records when instance is evacuated and the source physical node is removed

Bug #1829479 reported by Sun Mengyun
34
This bug affects 8 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Xena
Fix Released
Undecided
Unassigned

Bug Description

Description

===========

When the compute node service is down due to a failure, we choose to evacuate instances located on it. After successful evacuation, the relevant records in allocation table will not be cleared, it will only be cleared when the compute service of the source node is restored.

Unfortunately, if the failure node is down because of some unrecoverable failures, and compute service on it will never be restored, there will be residual records in the allocation table.

Further more, if we try to delete the down compute service, record associated with this service will not be deleted in reource_provider table, because of the residual record in allocation table.

Perhaps after a successful evacuation, we need to add operations to clear the allocation table, not just after the source node service is restored.

Steps to reproduce

==================

1.down a compute service

2.evacuate the instances on it

3.delete compute service with command: nova service-delete uuid

Expected result

===============

compute service is deleted successful, and resource_provider has no relevant record

Actual result

=============

compute service is deleted successful, but resource_provider still has relevant record

Revision history for this message
Sun Mengyun (kmehxhcr) wrote :

Maybe it sounds like this bug: https://bugs.launchpad.net/nova/+bug/1724172, but this bug considers the scenario of reusing the original physical node, rather than the case where the node is completely unavailable.

Chris Dent (cdent)
tags: added: compute placement scheduler
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

If the compute service is delete and never started up again then the fix for https://bugs.launchpad.net/nova/+bug/1724172 cannot clean up the allocation as that runs when the compute service comes up after the compute host is redeployed.

In this bug I see two possible cases:
a) if the compute-service is deleted and the compute host name has never been used again, then I don't see why it is a problem to keep some allocation in placement as we are only leaking allocation for something that will never be used again. Admin also can go and manually delete the allocation and the compute RP in placement after the compute service delete.

b) If the compute-service will be later re-created with the same host name then the bugfix in https://bugs.launchpad.net/nova/+bug/1724172 will clean up the allocation in placement at compute service restart

Dear reporter, which use case you are targeting with this bug report?

Cheers,
gibi

Changed in nova:
status: New → Incomplete
Revision history for this message
Matt Riedemann (mriedem) wrote :

Chris Friesen brought up what sounds like a similar issue in IRC today:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-06-05.log.html#t2019-06-05T20:13:23

A host goes down and their tooling automatically evacuates the instances from it. The allocations will still be on the source host in this case because nova doesn't remove the allocations from the evacuated host until the service is restarted.

If you try to delete the compute service in this case it will fail here but be ignored:

https://github.com/openstack/nova/blob/653515a45032811b6bc2f1d0fb651472005496ec/nova/scheduler/client/report.py#L2183

Which means we'll continue to delete the compute_nodes and services table records for that service:

https://github.com/openstack/nova/blob/653515a45032811b6bc2f1d0fb651472005496ec/nova/api/openstack/compute/services.py#L279

But a resource provider still exists with that hostname, so trying to restart the compute service after that will fail because a provider already exists with that name but has a different UUID (which maybe makes this related to bug 1817833).

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in nova:
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/663737

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/663737
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2629d65fbc15d8698f98117e0d6072810f70da03
Submitter: Zuul
Branch: master

commit 2629d65fbc15d8698f98117e0d6072810f70da03
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/678100

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

I have a recreate on devstack with some notes on cleaning up the allocations for the instance against the source compute node resource provider:

http://paste.openstack.org/show/785587/

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/691427

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/691427
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dcd3f516d2fa44c4056a307a11f6e14433476fb0
Submitter: Zuul
Branch: master

commit dcd3f516d2fa44c4056a307a11f6e14433476fb0
Author: Matt Riedemann <email address hidden>
Date: Fri Oct 25 16:42:09 2019 -0400

    doc: add troubleshooting guide for cleaning up orphaned allocations

    While we do not have an automated fix for bug 1849479 this provides
    a troubleshooting document for working around that issue where
    allocations from a server that was evacuated from a down host need
    to be cleaned up manually in order to delete the resource provider
    and associated compute node/service.

    In general this is also a useful guide for linking up the various
    resources and terms in nova and how they are reflected in placement
    with the relevant commands which is probably something we should
    do more of in our docs.

    Change-Id: I120e1ddd7946a371888bfc890b5979f2e19288cd
    Related-Bug: #1829479

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Created a related bug 1852610 for the orphaned provider scenario with a pending resize/cold migrate where the source compute service/node is deleted.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/695932

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/696582

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/695932
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b18e42d20bd7d341e713292bdb179ae8e5530d33
Submitter: Zuul
Branch: stable/stein

commit b18e42d20bd7d341e713292bdb179ae8e5530d33
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/698106

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/696582
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6c704cc1c5648947b7a9b1ccbfd8037caa436766
Submitter: Zuul
Branch: master

commit 6c704cc1c5648947b7a9b1ccbfd8037caa436766
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 28 09:18:21 2019 -0500

    Add resource provider allocation unset example to troubleshooting doc

    Now that the openstack resource provider allocation unset command is
    available [1] this change adds a note about using it in the troubleshooting
    doc for cleaning up orphaned allocations.

    Sub-sections are used to try and separate the two non-heal_allocations
    solutions with the recommended solution first (using the new unset command).

    While in here I noticed a typo in the heal_allocations section as well and
    fixed it.

    [1] I627bfd1ff699d075028da6afafbe7fb9b2f13058

    Change-Id: I896bb68c4bdd35d051ef3e95e19bdeb472f9bc99
    Related-Bug: #1829479

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/698106
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6eda7409fff75449c97843b2d6ead0b3267a1099
Submitter: Zuul
Branch: stable/rocky

commit 6eda7409fff75449c97843b2d6ead0b3267a1099
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    NOTE(mriedem): Note that in this backport a simple version of
    assertFlavorMatchesUsage is added since the original version from
    change If6aa37d9b6b48791e070799ab026c816fda4441c is not in Rocky.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)
    (cherry picked from commit b18e42d20bd7d341e713292bdb179ae8e5530d33)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.opendev.org/699698

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/699698
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=23ca5e5ac9b90ff45074ae9171f63ca060ebcedd
Submitter: Zuul
Branch: stable/queens

commit 23ca5e5ac9b90ff45074ae9171f63ca060ebcedd
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to not having change
    Iea283322124cb35fc0bc6d25f35548621e8c8c2f in Queens so the
    change to ProviderUsageBaseTestCase is made in test_servers.py
    rather than integrated_helpers.py.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)
    (cherry picked from commit b18e42d20bd7d341e713292bdb179ae8e5530d33)
    (cherry picked from commit 6eda7409fff75449c97843b2d6ead0b3267a1099)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/778696
Committed: https://opendev.org/openstack/nova/commit/e5a34fffdf97fcda7d0abfdc9e23485479ca2c4f
Submitter: "Zuul (22348)"
Branch: master

commit e5a34fffdf97fcda7d0abfdc9e23485479ca2c4f
Author: Takashi Kajinami <email address hidden>
Date: Thu Mar 4 22:27:25 2021 +0900

    Clean up allocations left by evacuation when deleting service

    When a compute node goes down and all instances on the compute node
    are evacuated, allocation records about these instance are still left
    in the source compute node until nova-compute service is again started
    on the node. However if a compute node is completely broken, it is not
    possible to start the service again.
    In this situation deleting nova-compute service for the compute node
    doesn't delete its resource provider record, and even if a user tries
    to delete the resource provider, the delete request is rejected because
    allocations are still left on that node.

    This change ensures that remaining allocations left by successful
    evacuations are cleared when deleting a nova-compute service, to avoid
    any resource provider record left even if a compute node can't be
    recovered. Migration records are still left in 'done' status to trigger
    clean-up tasks in case the compute node is recovered later.

    Closes-Bug: #1829479
    Change-Id: I3ce6f6275bfe09d43718c3a491b3991a804027bd

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/xena)

Fix proposed to branch: stable/xena
Review: https://review.opendev.org/c/openstack/nova/+/816954

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/xena)

Reviewed: https://review.opendev.org/c/openstack/nova/+/816954
Committed: https://opendev.org/openstack/nova/commit/037e588788e60d7b51ebe2cbb0787b3008f402fd
Submitter: "Zuul (22348)"
Branch: stable/xena

commit 037e588788e60d7b51ebe2cbb0787b3008f402fd
Author: Takashi Kajinami <email address hidden>
Date: Thu Mar 4 22:27:25 2021 +0900

    Clean up allocations left by evacuation when deleting service

    When a compute node goes down and all instances on the compute node
    are evacuated, allocation records about these instance are still left
    in the source compute node until nova-compute service is again started
    on the node. However if a compute node is completely broken, it is not
    possible to start the service again.
    In this situation deleting nova-compute service for the compute node
    doesn't delete its resource provider record, and even if a user tries
    to delete the resource provider, the delete request is rejected because
    allocations are still left on that node.

    This change ensures that remaining allocations left by successful
    evacuations are cleared when deleting a nova-compute service, to avoid
    any resource provider record left even if a compute node can't be
    recovered. Migration records are still left in 'done' status to trigger
    clean-up tasks in case the compute node is recovered later.

    Closes-Bug: #1829479
    Change-Id: I3ce6f6275bfe09d43718c3a491b3991a804027bd
    (cherry picked from commit e5a34fffdf97fcda7d0abfdc9e23485479ca2c4f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.1.0

This issue was fixed in the openstack/nova 24.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 25.0.0.0rc1

This issue was fixed in the openstack/nova 25.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/844753

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/844753
Committed: https://opendev.org/openstack/nova/commit/783598eec147223f41cc31e3488119bf0aeed656
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 783598eec147223f41cc31e3488119bf0aeed656
Author: Takashi Kajinami <email address hidden>
Date: Thu Mar 4 22:27:25 2021 +0900

    Clean up allocations left by evacuation when deleting service

    When a compute node goes down and all instances on the compute node
    are evacuated, allocation records about these instance are still left
    in the source compute node until nova-compute service is again started
    on the node. However if a compute node is completely broken, it is not
    possible to start the service again.
    In this situation deleting nova-compute service for the compute node
    doesn't delete its resource provider record, and even if a user tries
    to delete the resource provider, the delete request is rejected because
    allocations are still left on that node.

    This change ensures that remaining allocations left by successful
    evacuations are cleared when deleting a nova-compute service, to avoid
    any resource provider record left even if a compute node can't be
    recovered. Migration records are still left in 'done' status to trigger
    clean-up tasks in case the compute node is recovered later.

    Conflicts:
            nova/scheduler/client/report.py

    Closes-Bug: #1829479
    Change-Id: I3ce6f6275bfe09d43718c3a491b3991a804027bd
    (cherry picked from commit e5a34fffdf97fcda7d0abfdc9e23485479ca2c4f)
    (cherry picked from commit 037e588788e60d7b51ebe2cbb0787b3008f402fd)

tags: added: in-stable-wallaby
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova wallaby-eom

This issue was fixed in the openstack/nova wallaby-eom release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.