_destroy_evacuated_instances fails and kills n-cpu startup if lazy-loading flavor on a deleted instance

Bug #1794996 reported by Matt Riedemann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann
Pike
Fix Committed
High
Matt Riedemann
Queens
Fix Committed
High
Matt Riedemann
Rocky
Fix Committed
High
Matt Riedemann

Bug Description

Seen here:

http://logs.openstack.org/00/604400/5/check/nova-live-migration/6aa7a4b/logs/subnode-2/screen-n-cpu.txt.gz#_Sep_26_23_53_06_475005

Sep 26 23:53:06.475005 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: DEBUG nova.objects.instance [None req-0b213f62-b666-4a02-b465-7f36d24e2fbf None None] Lazy-loading 'flavor' on Instance uuid 1d4eaa12-1477-4528-a06b-52e5672c6c61 {{(pid=31767) obj_load_attr /opt/stack/new/nova/nova/objects/instance.py:1109}}
Sep 26 23:53:06.547498 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service [None req-0b213f62-b666-4a02-b465-7f36d24e2fbf None None] Error starting thread.: InstanceNotFound_Remote: Instance 1d4eaa12-1477-4528-a06b-52e5672c6c61 could not be found.
Sep 26 23:53:06.547786 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: Traceback (most recent call last):
Sep 26 23:53:06.548012 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/conductor/manager.py", line 126, in _object_dispatch
Sep 26 23:53:06.548241 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: return getattr(target, method)(*args, **kwargs)
Sep 26 23:53:06.548456 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 184, in wrapper
Sep 26 23:53:06.548670 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: result = fn(cls, context, *args, **kwargs)
Sep 26 23:53:06.548881 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/objects/instance.py", line 503, in get_by_uuid
Sep 26 23:53:06.549100 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: use_slave=use_slave)
Sep 26 23:53:06.549319 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 210, in wrapper
Sep 26 23:53:06.549530 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: return f(*args, **kwargs)
Sep 26 23:53:06.549743 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/objects/instance.py", line 495, in _db_instance_get_by_uuid
Sep 26 23:53:06.549991 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: columns_to_join=columns_to_join)
Sep 26 23:53:06.550272 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/db/api.py", line 758, in instance_get_by_uuid
Sep 26 23:53:06.550515 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: return IMPL.instance_get_by_uuid(context, uuid, columns_to_join)
Sep 26 23:53:06.551049 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 168, in wrapper
Sep 26 23:53:06.551268 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: return f(*args, **kwargs)
Sep 26 23:53:06.551480 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 255, in wrapped
Sep 26 23:53:06.551696 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: return f(context, *args, **kwargs)
Sep 26 23:53:06.551908 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 1843, in instance_get_by_uuid
Sep 26 23:53:06.552125 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: columns_to_join=columns_to_join)
Sep 26 23:53:06.552344 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: File "/opt/stack/new/nova/nova/db/sqlalchemy/api.py", line 1852, in _instance_get_by_uuid
Sep 26 23:53:06.552561 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: raise exception.InstanceNotFound(instance_id=uuid)
Sep 26 23:53:06.552772 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: InstanceNotFound: Instance 1d4eaa12-1477-4528-a06b-52e5672c6c61 could not be found.
Sep 26 23:53:06.552989 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service Traceback (most recent call last):
Sep 26 23:53:06.553210 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_service/service.py", line 796, in run_service
Sep 26 23:53:06.553420 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service service.start()
Sep 26 23:53:06.553628 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/opt/stack/new/nova/nova/service.py", line 162, in start
Sep 26 23:53:06.553878 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service self.manager.init_host()
Sep 26 23:53:06.554108 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/opt/stack/new/nova/nova/compute/manager.py", line 1204, in init_host
Sep 26 23:53:06.554318 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service evacuated_instances = self._destroy_evacuated_instances(context)
Sep 26 23:53:06.554527 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/opt/stack/new/nova/nova/compute/manager.py", line 710, in _destroy_evacuated_instances
Sep 26 23:53:06.554741 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service context, instance, cn_uuid, self.reportclient):
Sep 26 23:53:06.554950 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/opt/stack/new/nova/nova/scheduler/utils.py", line 998, in remove_allocation_from_compute
Sep 26 23:53:06.555159 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service flavor = instance.flavor
Sep 26 23:53:06.555383 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 67, in getter
Sep 26 23:53:06.555665 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service self.obj_load_attr(name)
Sep 26 23:53:06.555876 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/opt/stack/new/nova/nova/objects/instance.py", line 1139, in obj_load_attr
Sep 26 23:53:06.556086 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service self._load_flavor()
Sep 26 23:53:06.556883 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/opt/stack/new/nova/nova/objects/instance.py", line 961, in _load_flavor
Sep 26 23:53:06.557104 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service expected_attrs=['flavor'])
Sep 26 23:53:06.557321 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 177, in wrapper
Sep 26 23:53:06.557531 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service args, kwargs)
Sep 26 23:53:06.557741 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 241, in object_class_action_versions
Sep 26 23:53:06.557979 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service args=args, kwargs=kwargs)
Sep 26 23:53:06.558191 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 179, in call
Sep 26 23:53:06.558409 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service retry=self.retry)
Sep 26 23:53:06.558638 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 133, in _send
Sep 26 23:53:06.558853 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service retry=retry)
Sep 26 23:53:06.559062 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
Sep 26 23:53:06.559271 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service call_monitor_timeout, retry=retry)
Sep 26 23:53:06.559485 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 575, in _send
Sep 26 23:53:06.559707 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service raise result
Sep 26 23:53:06.559918 ubuntu-xenial-rax-ord-0002347061 nova-compute[31767]: ERROR oslo_service.service InstanceNotFound_Remote: Instance 1d4eaa12-1477-4528-a06b-52e5672c6c61 could not be found.

This is a test for evacuate where we take down nova-compute on one host and evacuate instances from it. Then those instances are deleted and then the compute service is re-enabled and restarted. On restart, nova-compute is trying to cleanup allocations in placement for the evacuated instances, and to do that it tries to lazy-load the flavor from the deleted instance which fails because we're not using read_deleted='yes' on the context.

This is similar to bug 1745977 which was fixed with change:

https://review.openstack.org/#/q/Ide6cc5bb1fce2c9aea9fa3efdf940e8308cd9ed0

But that only handled loading of generic attributes, in that case system_metadata.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This goes back to Pike because that's when the resource provider allocation cleanup code was added to the _destroy_evacuated_instances method.

Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/606106

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/606122

Changed in nova:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/606106
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d252f81573cdfe7a0966f134608bb85d17311e33
Submitter: Zuul
Branch: master

commit d252f81573cdfe7a0966f134608bb85d17311e33
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 28 10:58:48 2018 -0400

    Add functional regression test for bug 1794996

    The _destroy_evacuated_instances method on compute
    startup tries to cleanup guests from the hypervisor
    and allocations held against that compute node resource
    provider by evacuated instances, but doesn't take into
    account that those evacuated instances could have been
    deleted in the meantime which leads to a lazy-load
    InstanceNotFound error that kills the startup of the
    compute service.

    This change adds a functional regression test to recreate
    the bug. A subsequent change with the fix will update
    the test to show the bug is fixed.

    Note that assertFlavorMatchesAllocation and
    _boot_and_check_allocations are redefined in the test
    class because If6aa37d9b6b48791e070799ab026c816fda4441c
    refactored those methods which will cause problems with
    backports of this test. The redefined methods will be
    removed in a follow up cleanup patch.

    Change-Id: I19b0d8baea5440f5d5bc49a6956d9a97bf031a05
    Related-Bug: #1794996

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/606122
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=05cd8d128211adbbfb3cf5d626034ccd0f75a452
Submitter: Zuul
Branch: master

commit 05cd8d128211adbbfb3cf5d626034ccd0f75a452
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 28 11:18:14 2018 -0400

    Fix InstanceNotFound during _destroy_evacuated_instances

    The _destroy_evacuated_instances method on compute
    startup tries to cleanup guests on the hypervisor and
    allocations held against that compute node resource
    provider by evacuated instances, but doesn't take into
    account that those evacuated instances could have been
    deleted in the meantime which leads to a lazy-load
    InstanceNotFound error that kills the startup of the
    compute service.

    This change does two things in the _destroy_evacuated_instances
    method:

    1. Loads the evacuated instances with a read_deleted='yes'
       context when calling _get_instances_on_driver(). This
       should be fine since _get_instances_on_driver() is already
       returning deleted instances anyway (InstanceList.get_by_filters
       defaults to read deleted instances unless the filters tell
       it otherwise - which we don't in this case). This is needed
       so that things like driver.destroy() don't raise
       InstanceNotFound while lazy-loading fields on the instance.

    2. Skips the call to remove_allocation_from_compute() if the
       evacuated instance is already deleted. If the instance is
       already deleted, its allocations should have been cleaned
       up by its hosting compute service (or the API).

    The functional regression test is updated to show the bug is
    now fixed.

    Change-Id: I1f4b3540dd453650f94333b36d7504ba164192f7
    Closes-Bug: #1794996

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/575190
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=604819b29c0bd43969747d32f6e3d818b3cbece7
Submitter: Zuul
Branch: master

commit 604819b29c0bd43969747d32f6e3d818b3cbece7
Author: Dan Smith <email address hidden>
Date: Wed Jun 13 11:14:37 2018 -0700

    Always read-deleted=yes on lazy-load

    For some reason we were only reading deleted instances when loading generic
    fields and not things like flavor. That weird behavior isn't very helpful,
    so this makes us always read deleted for that case. Some of the fields, like
    tags, will short-circuit that and just immediately lazy-load an empty set.
    But for anything else, we should allow reading that data if it's still there.

    With this change, we are able to remove a specific read_deleted='yes' usage
    from ComputeManager._destroy_evacuated_instances() which is handled with
    the generic solution. TestEvacuateDeleteServerRestartOriginalCompute asserts
    that the evacuate scenario is still fixed.

    Related-Bug: #1794996
    Related-Bug: #1745977

    Change-Id: I8ec3a3a697e55941ee447d0b52d29785717e4bf0

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/rocky)

Related fix proposed to branch: stable/rocky
Review: https://review.openstack.org/623348

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.openstack.org/623349

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/queens)

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/623354

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.openstack.org/623355

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/623358

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/623359

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/623348
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=83d74dbbb6b45bfeada0c0b9ac13385b126709bb
Submitter: Zuul
Branch: stable/rocky

commit 83d74dbbb6b45bfeada0c0b9ac13385b126709bb
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 28 10:58:48 2018 -0400

    Add functional regression test for bug 1794996

    The _destroy_evacuated_instances method on compute
    startup tries to cleanup guests from the hypervisor
    and allocations held against that compute node resource
    provider by evacuated instances, but doesn't take into
    account that those evacuated instances could have been
    deleted in the meantime which leads to a lazy-load
    InstanceNotFound error that kills the startup of the
    compute service.

    This change adds a functional regression test to recreate
    the bug. A subsequent change with the fix will update
    the test to show the bug is fixed.

    Note that assertFlavorMatchesAllocation and
    _boot_and_check_allocations are redefined in the test
    class because If6aa37d9b6b48791e070799ab026c816fda4441c
    refactored those methods which will cause problems with
    backports of this test. The redefined methods will be
    removed in a follow up cleanup patch.

    Change-Id: I19b0d8baea5440f5d5bc49a6956d9a97bf031a05
    Related-Bug: #1794996
    (cherry picked from commit d252f81573cdfe7a0966f134608bb85d17311e33)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/rocky)

Reviewed: https://review.openstack.org/623349
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0208d64397731afa829bc08cd7b3b6494f0f05d5
Submitter: Zuul
Branch: stable/rocky

commit 0208d64397731afa829bc08cd7b3b6494f0f05d5
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 28 11:18:14 2018 -0400

    Fix InstanceNotFound during _destroy_evacuated_instances

    The _destroy_evacuated_instances method on compute
    startup tries to cleanup guests on the hypervisor and
    allocations held against that compute node resource
    provider by evacuated instances, but doesn't take into
    account that those evacuated instances could have been
    deleted in the meantime which leads to a lazy-load
    InstanceNotFound error that kills the startup of the
    compute service.

    This change does two things in the _destroy_evacuated_instances
    method:

    1. Loads the evacuated instances with a read_deleted='yes'
       context when calling _get_instances_on_driver(). This
       should be fine since _get_instances_on_driver() is already
       returning deleted instances anyway (InstanceList.get_by_filters
       defaults to read deleted instances unless the filters tell
       it otherwise - which we don't in this case). This is needed
       so that things like driver.destroy() don't raise
       InstanceNotFound while lazy-loading fields on the instance.

    2. Skips the call to remove_allocation_from_compute() if the
       evacuated instance is already deleted. If the instance is
       already deleted, its allocations should have been cleaned
       up by its hosting compute service (or the API).

    The functional regression test is updated to show the bug is
    now fixed.

    Conflicts:
          nova/compute/manager.py
          nova/tests/unit/compute/test_compute.py

    NOTE(mriedem): The conflicts are due to not having change
    I2af45a9540e7ccd60ace80d9fcadc79972da7df7 in Rocky.

    Change-Id: I1f4b3540dd453650f94333b36d7504ba164192f7
    Closes-Bug: #1794996
    (cherry picked from commit 05cd8d128211adbbfb3cf5d626034ccd0f75a452)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 18.1.0

This issue was fixed in the openstack/nova 18.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/623354
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c3fd5e5061b837a78a95705074239c3d2e41e644
Submitter: Zuul
Branch: stable/queens

commit c3fd5e5061b837a78a95705074239c3d2e41e644
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 28 10:58:48 2018 -0400

    Add functional regression test for bug 1794996

    The _destroy_evacuated_instances method on compute
    startup tries to cleanup guests from the hypervisor
    and allocations held against that compute node resource
    provider by evacuated instances, but doesn't take into
    account that those evacuated instances could have been
    deleted in the meantime which leads to a lazy-load
    InstanceNotFound error that kills the startup of the
    compute service.

    This change adds a functional regression test to recreate
    the bug. A subsequent change with the fix will update
    the test to show the bug is fixed.

    Note that assertFlavorMatchesAllocation and
    _boot_and_check_allocations are redefined in the test
    class because If6aa37d9b6b48791e070799ab026c816fda4441c
    refactored those methods which will cause problems with
    backports of this test. The redefined methods will be
    removed in a follow up cleanup patch.

    NOTE(mriedem): The ProviderUsageBaseTestCase import
    had to change since Iea283322124cb35fc0bc6d25f35548621e8c8c2f
    is not in Queens.

    Change-Id: I19b0d8baea5440f5d5bc49a6956d9a97bf031a05
    Related-Bug: #1794996
    (cherry picked from commit d252f81573cdfe7a0966f134608bb85d17311e33)
    (cherry picked from commit 83d74dbbb6b45bfeada0c0b9ac13385b126709bb)

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/queens)

Reviewed: https://review.openstack.org/623355
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6c7e53e21059f80325d728cf7dee2766da7a9471
Submitter: Zuul
Branch: stable/queens

commit 6c7e53e21059f80325d728cf7dee2766da7a9471
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 28 11:18:14 2018 -0400

    Fix InstanceNotFound during _destroy_evacuated_instances

    The _destroy_evacuated_instances method on compute
    startup tries to cleanup guests on the hypervisor and
    allocations held against that compute node resource
    provider by evacuated instances, but doesn't take into
    account that those evacuated instances could have been
    deleted in the meantime which leads to a lazy-load
    InstanceNotFound error that kills the startup of the
    compute service.

    This change does two things in the _destroy_evacuated_instances
    method:

    1. Loads the evacuated instances with a read_deleted='yes'
       context when calling _get_instances_on_driver(). This
       should be fine since _get_instances_on_driver() is already
       returning deleted instances anyway (InstanceList.get_by_filters
       defaults to read deleted instances unless the filters tell
       it otherwise - which we don't in this case). This is needed
       so that things like driver.destroy() don't raise
       InstanceNotFound while lazy-loading fields on the instance.

    2. Skips the call to remove_allocation_from_compute() if the
       evacuated instance is already deleted. If the instance is
       already deleted, its allocations should have been cleaned
       up by its hosting compute service (or the API).

    The functional regression test is updated to show the bug is
    now fixed.

    Change-Id: I1f4b3540dd453650f94333b36d7504ba164192f7
    Closes-Bug: #1794996
    (cherry picked from commit 05cd8d128211adbbfb3cf5d626034ccd0f75a452)
    (cherry picked from commit 0208d64397731afa829bc08cd7b3b6494f0f05d5)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/623358
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=91455a522738572def003dfe5e4f0bec08255074
Submitter: Zuul
Branch: stable/pike

commit 91455a522738572def003dfe5e4f0bec08255074
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 28 10:58:48 2018 -0400

    Add functional regression test for bug 1794996

    The _destroy_evacuated_instances method on compute
    startup tries to cleanup guests from the hypervisor
    and allocations held against that compute node resource
    provider by evacuated instances, but doesn't take into
    account that those evacuated instances could have been
    deleted in the meantime which leads to a lazy-load
    InstanceNotFound error that kills the startup of the
    compute service.

    This change adds a functional regression test to recreate
    the bug. A subsequent change with the fix will update
    the test to show the bug is fixed.

    Note that assertFlavorMatchesAllocation and
    _boot_and_check_allocations are redefined in the test
    class because If6aa37d9b6b48791e070799ab026c816fda4441c
    refactored those methods which will cause problems with
    backports of this test. The redefined methods will be
    removed in a follow up cleanup patch.

    NOTE(mriedem): The restart_compute_service() method
    needed to be added to the functional test class because
    change I17f67a02b27a90658df48856963ea3fb327e81dc is not
    in Pike.

    Change-Id: I19b0d8baea5440f5d5bc49a6956d9a97bf031a05
    Related-Bug: #1794996
    (cherry picked from commit d252f81573cdfe7a0966f134608bb85d17311e33)
    (cherry picked from commit 83d74dbbb6b45bfeada0c0b9ac13385b126709bb)
    (cherry picked from commit c3fd5e5061b837a78a95705074239c3d2e41e644)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/623359
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fb61e864b98dde4955ab29f7a39c9935fbd5ffd0
Submitter: Zuul
Branch: stable/pike

commit fb61e864b98dde4955ab29f7a39c9935fbd5ffd0
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 28 11:18:14 2018 -0400

    Fix InstanceNotFound during _destroy_evacuated_instances

    The _destroy_evacuated_instances method on compute
    startup tries to cleanup guests on the hypervisor and
    allocations held against that compute node resource
    provider by evacuated instances, but doesn't take into
    account that those evacuated instances could have been
    deleted in the meantime which leads to a lazy-load
    InstanceNotFound error that kills the startup of the
    compute service.

    This change does two things in the _destroy_evacuated_instances
    method:

    1. Loads the evacuated instances with a read_deleted='yes'
       context when calling _get_instances_on_driver(). This
       should be fine since _get_instances_on_driver() is already
       returning deleted instances anyway (InstanceList.get_by_filters
       defaults to read deleted instances unless the filters tell
       it otherwise - which we don't in this case). This is needed
       so that things like driver.destroy() don't raise
       InstanceNotFound while lazy-loading fields on the instance.

    2. Skips the call to remove_allocation_from_compute() if the
       evacuated instance is already deleted. If the instance is
       already deleted, its allocations should have been cleaned
       up by its hosting compute service (or the API).

    The functional regression test is updated to show the bug is
    now fixed.

    Conflicts:
          nova/compute/manager.py

    NOTE(mriedem): The conflict is due to not having change
    I1073faca6760bff3da0aaf3e8357bd8e64854be3 in Pike.

    Change-Id: I1f4b3540dd453650f94333b36d7504ba164192f7
    Closes-Bug: #1794996
    (cherry picked from commit 05cd8d128211adbbfb3cf5d626034ccd0f75a452)
    (cherry picked from commit 0208d64397731afa829bc08cd7b3b6494f0f05d5)
    (cherry picked from commit 6c7e53e21059f80325d728cf7dee2766da7a9471)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/ocata)

Related fix proposed to branch: stable/ocata
Review: https://review.openstack.org/639357

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.0.0.0rc1

This issue was fixed in the openstack/nova 19.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.10

This issue was fixed in the openstack/nova 17.0.10 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.1.8

This issue was fixed in the openstack/nova 16.1.8 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ocata)

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/ocata
Review: https://review.opendev.org/639357

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.