Allocation of an evacuated instance is not cleaned on the source host if instance is not defined on the hypervisor

Bug #1724172 reported by Balazs Gibizer on 2017-10-17
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Balazs Gibizer
Queens
Medium
Unassigned
Rocky
Medium
Unassigned
Stein
Medium
Balazs Gibizer

Bug Description

Nova does not clean up the allocation of an evacuated instance from the recovered source compute host if the instance is not any more defined on the hypervisor.

To reproduce:
* Boot an instance
* Kill the compute host the instance is booted on
* Evacuate the instance
* Recover the original compute host in a way that clears the instance definition from the hypervisor (e.g. redeploy the compute host).
* Check the allocations of the instance in placement API. The allocation against the source compute host is not cleaned up.

The compute manager is supposed to clean up evacuated instances during the compute manager init_host method by calling _destroy_evacuated_instances. However that function only iterates on instances known by the hypervisor [1].

[1] https://github.com/openstack/nova/blob/5e4c98a58f1afeaa903829f5e3f28cd6dc30bf4b/nova/compute/manager.py#L654

tags: added: evacuate
Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)

Related fix proposed to branch: master
Review: https://review.openstack.org/512553

Fix proposed to branch: master
Review: https://review.openstack.org/512623

Changed in nova:
status: New → In Progress
Changed in nova:
importance: Undecided → Medium
Matt Riedemann (mriedem) wrote :

But, will the source provider get the same uuid? Or will re-deploying the compute generate a new compute node uuid and thus a new provider in placement? I guess the compute node looking up the RT is based on hostname so as long as the service/compute wasn't deleted before the re-deploy it should still exist and use the same uuid.

https://github.com/openstack/nova/blob/509a2cca241f61311579c5f53dafd15ad2a40a63/nova/compute/resource_tracker.py#L787

Balazs Gibizer (balazs-gibizer) wrote :

Yeah, I always thought that if the host has the same hostname then it is considered the same compute host. From nova-compute perspective it is just a nova-compute service restart that happened due to a host redeploy. The nova-compute service does not know that it was re-deployed. It sees the same nova.conf and hostname and therefore it will use the same compute node object.

Reviewed: https://review.opendev.org/512623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9cacaad14e8c18e99e85d9dc04308fee91303f8f
Submitter: Zuul
Branch: master

commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

    Note that the functional test coverage will be added on top as it needs
    some refactoring that would make the bugfix non backportable.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.opendev.org/662189
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b8f2cd689f0a747778080ba4b6e148e71eb53085
Submitter: Zuul
Branch: stable/stein

commit b8f2cd689f0a747778080ba4b6e148e71eb53085
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

    Note that the functional test coverage will be added on top as it needs
    some refactoring that would make the bugfix non backportable.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)

This issue was fixed in the openstack/nova 19.0.1 release.

Reviewed: https://review.opendev.org/512552
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2794748d9c58623045023f34c7793c58ce41447c
Submitter: Zuul
Branch: master

commit 2794748d9c58623045023f34c7793c58ce41447c
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

    Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

    Related-Bug: #1724172
    Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1

Reviewed: https://review.opendev.org/512553
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4deab182ba59ee4112c28213f922c051179ba948
Submitter: Zuul
Branch: master

commit 4deab182ba59ee4112c28213f922c051179ba948
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    Add functional test coverage for bug 1724172

    Change-Id: I83bc056e35d3f3b93a58fb615db596166fb9ad57
    Related-Bug: #1724172

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers