The allocation table has residual records when instance is evacuated and the source physical node is removed

Bug #1829479 reported by Sun Mengyun on 2019-05-17
26
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned

Bug Description

Description

===========

When the compute node service is down due to a failure, we choose to evacuate instances located on it. After successful evacuation, the relevant records in allocation table will not be cleared, it will only be cleared when the compute service of the source node is restored.

Unfortunately, if the failure node is down because of some unrecoverable failures, and compute service on it will never be restored, there will be residual records in the allocation table.

Further more, if we try to delete the down compute service, record associated with this service will not be deleted in reource_provider table, because of the residual record in allocation table.

Perhaps after a successful evacuation, we need to add operations to clear the allocation table, not just after the source node service is restored.

Steps to reproduce

==================

1.down a compute service

2.evacuate the instances on it

3.delete compute service with command: nova service-delete uuid

Expected result

===============

compute service is deleted successful, and resource_provider has no relevant record

Actual result

=============

compute service is deleted successful, but resource_provider still has relevant record

Sun Mengyun (kmehxhcr) wrote :

Maybe it sounds like this bug: https://bugs.launchpad.net/nova/+bug/1724172, but this bug considers the scenario of reusing the original physical node, rather than the case where the node is completely unavailable.

Chris Dent (cdent) on 2019-05-17
tags: added: compute placement scheduler
Balazs Gibizer (balazs-gibizer) wrote :

If the compute service is delete and never started up again then the fix for https://bugs.launchpad.net/nova/+bug/1724172 cannot clean up the allocation as that runs when the compute service comes up after the compute host is redeployed.

In this bug I see two possible cases:
a) if the compute-service is deleted and the compute host name has never been used again, then I don't see why it is a problem to keep some allocation in placement as we are only leaking allocation for something that will never be used again. Admin also can go and manually delete the allocation and the compute RP in placement after the compute service delete.

b) If the compute-service will be later re-created with the same host name then the bugfix in https://bugs.launchpad.net/nova/+bug/1724172 will clean up the allocation in placement at compute service restart

Dear reporter, which use case you are targeting with this bug report?

Cheers,
gibi

Changed in nova:
status: New → Incomplete
Matt Riedemann (mriedem) wrote :

Chris Friesen brought up what sounds like a similar issue in IRC today:

http://eavesdrop.openstack.org/irclogs/%23openstack-nova/%23openstack-nova.2019-06-05.log.html#t2019-06-05T20:13:23

A host goes down and their tooling automatically evacuates the instances from it. The allocations will still be on the source host in this case because nova doesn't remove the allocations from the evacuated host until the service is restarted.

If you try to delete the compute service in this case it will fail here but be ignored:

https://github.com/openstack/nova/blob/653515a45032811b6bc2f1d0fb651472005496ec/nova/scheduler/client/report.py#L2183

Which means we'll continue to delete the compute_nodes and services table records for that service:

https://github.com/openstack/nova/blob/653515a45032811b6bc2f1d0fb651472005496ec/nova/api/openstack/compute/services.py#L279

But a resource provider still exists with that hostname, so trying to restart the compute service after that will fail because a provider already exists with that name but has a different UUID (which maybe makes this related to bug 1817833).

Matt Riedemann (mriedem) wrote :
Changed in nova:
status: Incomplete → Triaged
importance: Undecided → Medium

Reviewed: https://review.opendev.org/663737
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2629d65fbc15d8698f98117e0d6072810f70da03
Submitter: Zuul
Branch: master

commit 2629d65fbc15d8698f98117e0d6072810f70da03
Author: Matt Riedemann <email address hidden>
Date: Thu Jun 6 13:41:09 2019 -0400

    Add functional recreate test for bug 1829479 and bug 1817833

    Change I7b8622b178d5043ed1556d7bdceaf60f47e5ac80 started deleting
    the associated resource provider when a compute service is deleted.
    However, the delete_resource_provider cascade=True logic only looks
    for instances on the given compute service host being deleted which
    will miss (1) allocations remaining from evacuated servers and
    (2) unconfirmed migrations.

    Attempting to delete the resource provider results in an
    ResourceProviderInUse error which delete_resource_provider ignores
    for legacy reasons. This results in the compute service being
    deleted but the resource provider being orphaned. What's more,
    attempting to restart the now-deleted compute service will fail
    because nova-compute will try to create a new resource provider
    with a new uuid but with the same name (based on the hypervisor
    hostname). That failure is actually reported in bug 1817833.

    Change-Id: I69f52f1282c8361c9cdf90a523f3612139cb8423
    Related-Bug: #1829479
    Related-Bug: #1817833

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers