Allocation of an evacuated instance is not cleaned on the source host if instance is not defined on the hypervisor

Bug #1724172 reported by Balazs Gibizer
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Balazs Gibizer
Pike
Fix Released
Low
Balazs Gibizer
Queens
Fix Released
Medium
Elod Illes
Rocky
Fix Committed
Medium
Balazs Gibizer
Stein
Fix Committed
Medium
Balazs Gibizer

Bug Description

Nova does not clean up the allocation of an evacuated instance from the recovered source compute host if the instance is not any more defined on the hypervisor.

To reproduce:
* Boot an instance
* Kill the compute host the instance is booted on
* Evacuate the instance
* Recover the original compute host in a way that clears the instance definition from the hypervisor (e.g. redeploy the compute host).
* Check the allocations of the instance in placement API. The allocation against the source compute host is not cleaned up.

The compute manager is supposed to clean up evacuated instances during the compute manager init_host method by calling _destroy_evacuated_instances. However that function only iterates on instances known by the hypervisor [1].

[1] https://github.com/openstack/nova/blob/5e4c98a58f1afeaa903829f5e3f28cd6dc30bf4b/nova/compute/manager.py#L654

tags: added: evacuate
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/512552

Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.openstack.org/512553

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/512623

Changed in nova:
status: New → In Progress
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

But, will the source provider get the same uuid? Or will re-deploying the compute generate a new compute node uuid and thus a new provider in placement? I guess the compute node looking up the RT is based on hostname so as long as the service/compute wasn't deleted before the re-deploy it should still exist and use the same uuid.

https://github.com/openstack/nova/blob/509a2cca241f61311579c5f53dafd15ad2a40a63/nova/compute/resource_tracker.py#L787

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Yeah, I always thought that if the host has the same hostname then it is considered the same compute host. From nova-compute perspective it is just a nova-compute service restart that happened due to a host redeploy. The nova-compute service does not know that it was re-deployed. It sees the same nova.conf and hostname and therefore it will use the same compute node object.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/512623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9cacaad14e8c18e99e85d9dc04308fee91303f8f
Submitter: Zuul
Branch: master

commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

    Note that the functional test coverage will be added on top as it needs
    some refactoring that would make the bugfix non backportable.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/662189

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/662189
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b8f2cd689f0a747778080ba4b6e148e71eb53085
Submitter: Zuul
Branch: stable/stein

commit b8f2cd689f0a747778080ba4b6e148e71eb53085
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

    Note that the functional test coverage will be added on top as it needs
    some refactoring that would make the bugfix non backportable.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.1

This issue was fixed in the openstack/nova 19.0.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/512552
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2794748d9c58623045023f34c7793c58ce41447c
Submitter: Zuul
Branch: master

commit 2794748d9c58623045023f34c7793c58ce41447c
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

    Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

    Related-Bug: #1724172
    Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/512553
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4deab182ba59ee4112c28213f922c051179ba948
Submitter: Zuul
Branch: master

commit 4deab182ba59ee4112c28213f922c051179ba948
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    Add functional test coverage for bug 1724172

    Change-Id: I83bc056e35d3f3b93a58fb615db596166fb9ad57
    Related-Bug: #1724172

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.0.0.0rc1

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/687550

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/687873

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/687912

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.opendev.org/687550
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=09de94e39bbcfc7f8130638e73a8248e49cb6ab7
Submitter: Zuul
Branch: stable/rocky

commit 09de94e39bbcfc7f8130638e73a8248e49cb6ab7
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

    Note that the functional test coverage will be added on top as it needs
    some refactoring that would make the bugfix non backportable.

    Conflicts:
          nova/compute/manager.py
    Conflict is due to not having I8ec3a3a697e55941ee447d0b52d29785717e4bf0
    in Rocky. Also changes needed to be made in test_compute_mgr.py due to
    I2af45a9540e7ccd60ace80d9fcadc79972da7df7 is missing form rocky.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)
    (cherry picked from commit b8f2cd689f0a747778080ba4b6e148e71eb53085)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.opendev.org/687873
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=52e68f121eda49dbb817404d3ab1468c2059e1a3
Submitter: Zuul
Branch: stable/queens

commit 52e68f121eda49dbb817404d3ab1468c2059e1a3
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

    Note that the functional test coverage will be added on top as it needs
    some refactoring that would make the bugfix non backportable.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)
    (cherry picked from commit b8f2cd689f0a747778080ba4b6e148e71eb53085)
    (cherry picked from commit 09de94e39bbcfc7f8130638e73a8248e49cb6ab7)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.opendev.org/687912
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=07a938d388b03a641c677a937599cfea4e36a13a
Submitter: Zuul
Branch: stable/pike

commit 07a938d388b03a641c677a937599cfea4e36a13a
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

    cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

    Note: test_compute_mgr.py is needed to be changed, due to patch
    I7891b98f225f97ad47f189afb9110ef31c810717 is missing from stable/pike

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)
    (cherry picked from commit b8f2cd689f0a747778080ba4b6e148e71eb53085)
    (cherry picked from commit 09de94e39bbcfc7f8130638e73a8248e49cb6ab7)
    (cherry picked from commit 52e68f121eda49dbb817404d3ab1468c2059e1a3)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/stein)

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/703103

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/703103
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b874c409c11b5d83508d2f0276a9a648f72192a4
Submitter: Zuul
Branch: stable/stein

commit b874c409c11b5d83508d2f0276a9a648f72192a4
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

    Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

    Related-Bug: #1724172

    On stable/stein:

    Closes-Bug: #1859766

    Note: mock package import added to nova/test.py (due to not having patch
    Ibe7cb29620f06d31059f2a5f94ca180b8671046e in stable/stein)

    Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1
    (cherry picked from commit 2794748d9c58623045023f34c7793c58ce41447c)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.3.0

This issue was fixed in the openstack/nova 18.3.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713033

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)
Download full text (3.5 KiB)

Reviewed: https://review.opendev.org/713033
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=53a893f7c97e35de3e9ac26101827cdb43ed35cc
Submitter: Zuul
Branch: stable/rocky

commit 53a893f7c97e35de3e9ac26101827cdb43ed35cc
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

    Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

    Related-Bug: #1724172

    On stable/stein:

    Closes-Bug: #1859766

    Conflicts:
        doc/notification_samples/libvirt-connect-error.json
        nova/test.py
        nova/tests/functional/libvirt/test_reshape.py
        nova/tests/functional/test_servers.py

    NOTE(elod.illes): files conflicts details:
    * libvirt-connect-error.json:
      File added only in Stein with libvirt.error notification
      transformation patch I7d2287ce06d77c0afdef0ea8bdfb70f6c52d3c50
    * test.py:
      Patches Iecf4dcf8e648c9191bf8846428683ec81812c026 (Remove patching
      the mock lib) and Ibb8c12fb2799bb5ceb9e3d72a2b86dbb4f14451e (Use a
      static resource tracker in compute manager) were not backported to
      Rocky
    * test_reshape.py:
      File added only in Stein in the frame of 'Handling Reshaped Provider
      Trees' feature, with patch Ide797ebf7790d69042ae275ebec6ced3fa4787b6
    * test_servers.py:
      Patch I7cbd5d9fb875ebf72995362e0b6693492ce32051 (Reject forced move
      wit...

Read more...

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova pike-eol

This issue was fixed in the openstack/nova pike-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova queens-eol

This issue was fixed in the openstack/nova queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.