The allocation table has residual records when instance is evacuated and the source physical node is removed

Bug #1829479 reported by Sun Mengyun on 2019-05-17
26
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matt Riedemann

Bug Description

Description

===========

When the compute node service is down due to a failure, we choose to evacuate instances located on it. After successful evacuation, the relevant records in allocation table will not be cleared, it will only be cleared when the compute service of the source node is restored.

Unfortunately, if the failure node is down because of some unrecoverable failures, and compute service on it will never be restored, there will be residual records in the allocation table.

Further more, if we try to delete the down compute service, record associated with this service will not be deleted in reource_provider table, because of the residual record in allocation table.

Perhaps after a successful evacuation, we need to add operations to clear the allocation table, not just after the source node service is restored.

Steps to reproduce

==================

1.down a compute service

2.evacuate the instances on it

3.delete compute service with command: nova service-delete uuid

Expected result

===============

compute service is deleted successful, and resource_provider has no relevant record

Actual result

=============

compute service is deleted successful, but resource_provider still has relevant record

Sun Mengyun (kmehxhcr) wrote :

Maybe it sounds like this bug: https://bugs.launchpad.net/nova/+bug/1724172, but this bug considers the scenario of reusing the original physical node, rather than the case where the node is completely unavailable.

Chris Dent (cdent) on 2019-05-17
tags: added: compute placement scheduler
Balazs Gibizer (balazs-gibizer) wrote :

If the compute service is delete and never started up again then the fix for https://bugs.launchpad.net/nova/+bug/1724172 cannot clean up the allocation as that runs when the compute service comes up after the compute host is redeployed.

In this bug I see two possible cases:
a) if the compute-service is deleted and the compute host name has never been used again, then I don't see why it is a problem to keep some allocation in placement as we are only leaking allocation for something that will never be used again. Admin also can go and manually delete the allocation and the compute RP in placement after the compute service delete.

b) If the compute-service will be later re-created with the same host name then the bugfix in https://bugs.launchpad.net/nova/+bug/1724172 will clean up the allocation in placement at compute service restart

Dear reporter, which use case you are targeting with this bug report?

Cheers,
gibi

Changed in nova:
status: New → Incomplete
Matt Riedemann (mriedem) wrote :

Chris Friesen brought up what sounds like a similar issue in IRC today:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-06-05.log.html#t2019-06-05T20:13:23

A host goes down and their tooling automatically evacuates the instances from it. The allocations will still be on the source host in this case because nova doesn't remove the allocations from the evacuated host until the service is restarted.

If you try to delete the compute service in this case it will fail here but be ignored:

https://github.com/openstack/nova/blob/653515a45032811b6bc2f1d0fb651472005496ec/nova/scheduler/client/report.py#L2183

Which means we'll continue to delete the compute_nodes and services table records for that service:

https://github.com/openstack/nova/blob/653515a45032811b6bc2f1d0fb651472005496ec/nova/api/openstack/compute/services.py#L279

But a resource provider still exists with that hostname, so trying to restart the compute service after that will fail because a provider already exists with that name but has a different UUID (which maybe makes this related to bug 1817833).

Matt Riedemann (mriedem) wrote :
Changed in nova:
status: Incomplete → Triaged
importance: Undecided → Medium

Reviewed: https://review.opendev.org/663737
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2629d65fbc15d8698f98117e0d6072810f70da03
Submitter: Zuul
Branch: master

commit 2629d65fbc15d8698f98117e0d6072810f70da03
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833

Fix proposed to branch: master
Review: https://review.opendev.org/678100

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Triaged → In Progress
Matt Riedemann (mriedem) wrote :

I have a recreate on devstack with some notes on cleaning up the allocations for the instance against the source compute node resource provider:

http://paste.openstack.org/show/785587/

Reviewed: https://review.opendev.org/691427
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=dcd3f516d2fa44c4056a307a11f6e14433476fb0
Submitter: Zuul
Branch: master

commit dcd3f516d2fa44c4056a307a11f6e14433476fb0
Author: Matt Riedemann <email address hidden>
Date: Fri Oct 25 16:42:09 2019 -0400

    doc: add troubleshooting guide for cleaning up orphaned allocations

    While we do not have an automated fix for bug 1849479 this provides
    a troubleshooting document for working around that issue where
    allocations from a server that was evacuated from a down host need
    to be cleaned up manually in order to delete the resource provider
    and associated compute node/service.

    In general this is also a useful guide for linking up the various
    resources and terms in nova and how they are reflected in placement
    with the relevant commands which is probably something we should
    do more of in our docs.

    Change-Id: I120e1ddd7946a371888bfc890b5979f2e19288cd
    Related-Bug: #1829479

Matt Riedemann (mriedem) wrote :

Created a related bug 1852610 for the orphaned provider scenario with a pending resize/cold migrate where the source compute service/node is deleted.

Reviewed: https://review.opendev.org/695932
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b18e42d20bd7d341e713292bdb179ae8e5530d33
Submitter: Zuul
Branch: stable/stein

commit b18e42d20bd7d341e713292bdb179ae8e5530d33
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)

tags: added: in-stable-stein

Reviewed: https://review.opendev.org/696582
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6c704cc1c5648947b7a9b1ccbfd8037caa436766
Submitter: Zuul
Branch: master

commit 6c704cc1c5648947b7a9b1ccbfd8037caa436766
Author: Matt Riedemann <email address hidden>
Date: Thu Nov 28 09:18:21 2019 -0500

    Add resource provider allocation unset example to troubleshooting doc

    Now that the openstack resource provider allocation unset command is
    available [1] this change adds a note about using it in the troubleshooting
    doc for cleaning up orphaned allocations.

    Sub-sections are used to try and separate the two non-heal_allocations
    solutions with the recommended solution first (using the new unset command).

    While in here I noticed a typo in the heal_allocations section as well and
    fixed it.

    [1] I627bfd1ff699d075028da6afafbe7fb9b2f13058

    Change-Id: I896bb68c4bdd35d051ef3e95e19bdeb472f9bc99
    Related-Bug: #1829479

Reviewed: https://review.opendev.org/698106
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6eda7409fff75449c97843b2d6ead0b3267a1099
Submitter: Zuul
Branch: stable/rocky

commit 6eda7409fff75449c97843b2d6ead0b3267a1099
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    NOTE(mriedem): Note that in this backport a simple version of
    assertFlavorMatchesUsage is added since the original version from
    change If6aa37d9b6b48791e070799ab026c816fda4441c is not in Rocky.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)
    (cherry picked from commit b18e42d20bd7d341e713292bdb179ae8e5530d33)

tags: added: in-stable-rocky

Reviewed: https://review.opendev.org/699698
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=23ca5e5ac9b90ff45074ae9171f63ca060ebcedd
Submitter: Zuul
Branch: stable/queens

commit 23ca5e5ac9b90ff45074ae9171f63ca060ebcedd
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Conflicts:
          nova/tests/functional/integrated_helpers.py

    NOTE(mriedem): The conflict is due to not having change
    Iea283322124cb35fc0bc6d25f35548621e8c8c2f in Queens so the
    change to ProviderUsageBaseTestCase is made in test_servers.py
    rather than integrated_helpers.py.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833
    (cherry picked from commit 2629d65fbc15d8698f98117e0d6072810f70da03)
    (cherry picked from commit b18e42d20bd7d341e713292bdb179ae8e5530d33)
    (cherry picked from commit 6eda7409fff75449c97843b2d6ead0b3267a1099)

tags: added: in-stable-queens
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related blueprints