Bug #1724172 “Allocation of an evacuated instance is not cleaned...” : Series rocky : Bugs : OpenStack Compute (nova)

Balazs Gibizer (balazs-gibizer) on 2017-10-17

tags:

added: evacuate

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-17: Related fix proposed to nova (master)

#1

Related fix proposed to branch: master
Review: https://review.openstack.org/512552

Balazs Gibizer (balazs-gibizer) on 2017-10-17

Changed in nova:
assignee:	nobody → Balazs Gibizer (balazs-gibizer)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-17:

#2

Related fix proposed to branch: master
Review: https://review.openstack.org/512553

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-10-17: Fix proposed to nova (master)

#3

Fix proposed to branch: master
Review: https://review.openstack.org/512623

Changed in nova:
status:	New → In Progress

Balazs Gibizer (balazs-gibizer) on 2018-04-25

Changed in nova:
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-08-08:

#4

But, will the source provider get the same uuid? Or will re-deploying the compute generate a new compute node uuid and thus a new provider in placement? I guess the compute node looking up the RT is based on hostname so as long as the service/compute wasn't deleted before the re-deploy it should still exist and use the same uuid.

https://github.com/openstack/nova/blob/509a2cca241f61311579c5f53dafd15ad2a40a63/nova/compute/resource_tracker.py#L787

Revision history for this message

Balazs Gibizer (balazs-gibizer) wrote on 2018-08-09:

#5

Yeah, I always thought that if the host has the same hostname then it is considered the same compute host. From nova-compute perspective it is just a nova-compute service restart that happened due to a host redeploy. The nova-compute service does not know that it was re-deployed. It sees the same nova.conf and hostname and therefore it will use the same compute node object.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-29: Fix merged to nova (master)

#6

Reviewed: https://review.opendev.org/512623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9cacaad14e8c18e99e85d9dc04308fee91303f8f
Submitter: Zuul
Branch: master

commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

Note that the functional test coverage will be added on top as it needs
some refactoring that would make the bugfix non backportable.

Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
Closes-Bug: #1724172

Changed in nova:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-30: Fix proposed to nova (stable/stein)

#7

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/662189

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-31: Fix merged to nova (stable/stein)

#8

Reviewed: https://review.opendev.org/662189
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b8f2cd689f0a747778080ba4b6e148e71eb53085
Submitter: Zuul
Branch: stable/stein

commit b8f2cd689f0a747778080ba4b6e148e71eb53085
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

Note that the functional test coverage will be added on top as it needs
some refactoring that would make the bugfix non backportable.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-06: Fix included in openstack/nova 19.0.1

#9

This issue was fixed in the openstack/nova 19.0.1 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-27: Related fix merged to nova (master)

#10

Reviewed: https://review.opendev.org/512552
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2794748d9c58623045023f34c7793c58ce41447c
Submitter: Zuul
Branch: master

commit 2794748d9c58623045023f34c7793c58ce41447c
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

Related-Bug: #1724172
Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1

Reviewed:  https://review.opendev.org/512552
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=2794748d9c58623045023f34c7793c58ce41447c
Submitter: Zuul
Branch:    master

commit 2794748d9c58623045023f34c7793c58ce41447c
Author: Balazs Gibizer <balazs.gibizer@ericsson.com>
Date:   Wed May 1 23:38:40 2019 +0200

Enhance service restart in functional env
    
    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.
    
    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.
    
    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.
    
    Related-Bug: #1724172
    Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-08-05:

#11

Reviewed: https://review.opendev.org/512553
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4deab182ba59ee4112c28213f922c051179ba948
Submitter: Zuul
Branch: master

commit 4deab182ba59ee4112c28213f922c051179ba948
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

Add functional test coverage for bug 1724172

Change-Id: I83bc056e35d3f3b93a58fb615db596166fb9ad57
Related-Bug: #1724172

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-09-27: Fix included in openstack/nova 20.0.0.0rc1

#12

This issue was fixed in the openstack/nova 20.0.0.0rc1 release candidate.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-09: Fix proposed to nova (stable/rocky)

#13

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/687550

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-10: Fix proposed to nova (stable/queens)

#14

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/687873

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-10: Fix proposed to nova (stable/pike)

#15

Fix proposed to branch: stable/pike
Review: https://review.opendev.org/687912

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-10-25: Fix merged to nova (stable/rocky)

#16

Reviewed: https://review.opendev.org/687550
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=09de94e39bbcfc7f8130638e73a8248e49cb6ab7
Submitter: Zuul
Branch: stable/rocky

commit 09de94e39bbcfc7f8130638e73a8248e49cb6ab7
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

Note that the functional test coverage will be added on top as it needs
some refactoring that would make the bugfix non backportable.

    Conflicts:
          nova/compute/manager.py
    Conflict is due to not having I8ec3a3a697e55941ee447d0b52d29785717e4bf0
    in Rocky. Also changes needed to be made in test_compute_mgr.py due to
    I2af45a9540e7ccd60ace80d9fcadc79972da7df7 is missing form rocky.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)
    (cherry picked from commit b8f2cd689f0a747778080ba4b6e148e71eb53085)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-11-05: Fix merged to nova (stable/queens)

#17

Reviewed: https://review.opendev.org/687873
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=52e68f121eda49dbb817404d3ab1468c2059e1a3
Submitter: Zuul
Branch: stable/queens

commit 52e68f121eda49dbb817404d3ab1468c2059e1a3
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

Note that the functional test coverage will be added on top as it needs
some refactoring that would make the bugfix non backportable.

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)
    (cherry picked from commit b8f2cd689f0a747778080ba4b6e148e71eb53085)
    (cherry picked from commit 09de94e39bbcfc7f8130638e73a8248e49cb6ab7)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-15: Fix merged to nova (stable/pike)

#18

Reviewed: https://review.opendev.org/687912
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=07a938d388b03a641c677a937599cfea4e36a13a
Submitter: Zuul
Branch: stable/pike

commit 07a938d388b03a641c677a937599cfea4e36a13a
Author: Balazs Gibizer <email address hidden>
Date: Tue Oct 17 15:06:59 2017 +0200

cleanup evacuated instances not on hypervisor

    When the compute host recovers from a failure the compute manager
    destroys instances that were evacuated from the host while it was down.
    However these code paths only consider evacuated instances that are
    still reported by the hypervisor. This means that if the compute
    host is recovered in a way that the hypervisor lost the definition
    of the instances (for example the compute host was redeployed) then
    the allocation of these instances will not be deleted.

    This patch makes sure that the instance allocation is cleaned up
    even if the driver doesn't return that instance as exists on the
    hypervisor.

Note: test_compute_mgr.py is needed to be changed, due to patch
I7891b98f225f97ad47f189afb9110ef31c810717 is missing from stable/pike

    Change-Id: I4bc81b482380c5778781659c4d167a712316dab4
    Closes-Bug: #1724172
    (cherry picked from commit 9cacaad14e8c18e99e85d9dc04308fee91303f8f)
    (cherry picked from commit b8f2cd689f0a747778080ba4b6e148e71eb53085)
    (cherry picked from commit 09de94e39bbcfc7f8130638e73a8248e49cb6ab7)
    (cherry picked from commit 52e68f121eda49dbb817404d3ab1468c2059e1a3)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-01-17: Related fix proposed to nova (stable/stein)

#19

Related fix proposed to branch: stable/stein
Review: https://review.opendev.org/703103

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-21: Related fix merged to nova (stable/stein)

#20

Reviewed: https://review.opendev.org/703103
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b874c409c11b5d83508d2f0276a9a648f72192a4
Submitter: Zuul
Branch: stable/stein

commit b874c409c11b5d83508d2f0276a9a648f72192a4
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

Related-Bug: #1724172

On stable/stein:

Closes-Bug: #1859766

Note: mock package import added to nova/test.py (due to not having patch
Ibe7cb29620f06d31059f2a5f94ca180b8671046e in stable/stein)

Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1
(cherry picked from commit 2794748d9c58623045023f34c7793c58ce41447c)

Reviewed:  https://review.opendev.org/703103
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b874c409c11b5d83508d2f0276a9a648f72192a4
Submitter: Zuul
Branch:    stable/stein

commit b874c409c11b5d83508d2f0276a9a648f72192a4
Author: Balazs Gibizer <balazs.gibizer@ericsson.com>
Date:   Wed May 1 23:38:40 2019 +0200

Enhance service restart in functional env
    
    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.
    
    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.
    
    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.
    
    Related-Bug: #1724172
    
    On stable/stein:
    
    Closes-Bug: #1859766
    
    Note: mock package import added to nova/test.py (due to not having patch
    Ibe7cb29620f06d31059f2a5f94ca180b8671046e in stable/stein)
    
    Change-Id: I9d6cd6259659a35383c0c9c21db72a9434ba86b1
    (cherry picked from commit 2794748d9c58623045023f34c7793c58ce41447c)

tags:

added: in-stable-stein

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-02-25: Fix included in openstack/nova 18.3.0

#21

This issue was fixed in the openstack/nova 18.3.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-13: Related fix proposed to nova (stable/rocky)

#22

Related fix proposed to branch: stable/rocky
Review: https://review.opendev.org/713033

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-03-14: Related fix merged to nova (stable/rocky)

#23

Download full text (3.5 KiB)

Reviewed: https://review.opendev.org/713033
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=53a893f7c97e35de3e9ac26101827cdb43ed35cc
Submitter: Zuul
Branch: stable/rocky

commit 53a893f7c97e35de3e9ac26101827cdb43ed35cc
Author: Balazs Gibizer <email address hidden>
Date: Wed May 1 23:38:40 2019 +0200

Enhance service restart in functional env

    Bugfix Icaf1bae8cb040b939f916a19ce026031ddb84af7 showed that restarting
    a compute service in the functional env is unrealistic causing faults
    to slip through. During that bug fix only the minimal change was done
    in the functional env regarding compute service restart to reproduce
    the reported fault. However the restart of the compute service could
    be made even more realistic.

    This patch simulates a compute service restart in the functional env
    by stopping the original compute service and starting a totally new
    compute service for the same host and node. This way we can make sure
    that we get a brand new ComputeManager in the new service and no
    state can leak between the old and the new service.

    This change revealed another shortcoming of the functional env.
    In the real world the nova-compute service could be restarted without
    loosing any running servers on the compute host. But with the naive
    implementation of this change the compute service is re-created. This
    means that a new ComputeManager is instantiated that loads a new
    FakeDriver instance as well. That new FakeDriver instance then reports
    an empty hypervisor. This behavior is not totally unrealistic as it
    simulates such a compute host restart that cleans the hypervisor state
    as well (e.g. compute host redeployment). However this type of restart
    shows another bug in the code path that destroys and deallocates
    evacuated instance from the source host. Therefore this patch
    implements the compute service restart in a way that simulates only a
    service restart and not a full compute restart. A subsequent patch will
    add a test that uses the clean hypervisor case to reproduces the
    revealed bug.

Related-Bug: #1724172

On stable/stein:

Closes-Bug: #1859766

    Conflicts:
        doc/notification_samples/libvirt-connect-error.json
        nova/test.py
        nova/tests/functional/libvirt/test_reshape.py
        nova/tests/functional/test_servers.py

    NOTE(elod.illes): files conflicts details:
    * libvirt-connect-error.json:
      File added only in Stein with libvirt.error notification
      transformation patch I7d2287ce06d77c0afdef0ea8bdfb70f6c52d3c50
    * test.py:
      Patches Iecf4dcf8e648c9191bf8846428683ec81812c026 (Remove patching
      the mock lib) and Ibb8c12fb2799bb5ceb9e3d72a2b86dbb4f14451e (Use a
      static resource tracker in compute manager) were not backported to
      Rocky
    * test_reshape.py:
      File added only in Stein in the frame of 'Handling Reshaped Provider
      Trees' feature, with patch Ide797ebf7790d69042ae275ebec6ced3fa4787b6
    * test_servers.py:
      Patch I7cbd5d9fb875ebf72995362e0b6693492ce32051 (Reject forced move
      wit...

OpenStack Compute (nova)

Allocation of an evacuated instance is not cleaned on the source host if instance is not defined on the hypervisor

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to
OpenStack Compute (nova)	Fix Released	Medium	Balazs Gibizer
Pike	Fix Released	Low	Balazs Gibizer
Queens	Fix Released	Medium	Elod Illes
Rocky	Fix Committed	Medium	Balazs Gibizer
Stein	Fix Committed	Medium	Balazs Gibizer