Evacuate cleanup fails at _delete_allocation_for_moved_instance

Bug #1721652 reported by Charles Volzka
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Balazs Gibizer
Pike
Fix Committed
High
Matt Riedemann

Bug Description

Description
===========
After an evacuation, when nova-compute is restarted on the source host, the clean up of the old instance on the source host fails. The traceback in nova-compute.log ends with:
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 679, in _destroy_evacuated_instances
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, migration.source_node)
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1216, in delete_allocation_for_evacuated_instance
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service instance, node, 'evacuated', node_type)
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 1227, in _delete_allocation_for_moved_instance
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service cn_uuid = self.compute_nodes[node].uuid
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service KeyError: u'<SOURCE_HOST_NAME>'
2017-10-04 05:32:18.725 5575 ERROR oslo_service.service

Steps to reproduce
==================
Deploy instance on Host A.
Shut down Host A.
Evacuate instance to Host B.
Turn back on Host A.
Wait for cleanup of old instance allocation to occur

Expected result
===============
Clean up of old instance from Host A is successful

Actual result
=============
Old instance clean up appears to work but there's a traceback in the log and allocation is not cleaned up.

Environment
===========
(pike)nova-compute/now 10:16.0.0-201710030907

Additional Info:
================
Problem seems to come from this change: https://github.com/openstack/nova/commit/0de806684f5d670dd5f961f7adf212961da3ed87 at:
rt = self._get_resource_tracker()
rt.delete_allocation_for_evacuated_instance
That is called very early in init_host flow to clean up the allocations. The problem is that at this point in the startup the resource tracker's self.compute_node is still None. That makes delete_allocation_for_evacuated_instance blow up with a key error at:
cn_uuid = self.compute_nodes[node].uuid
The resource tracker's self.compute_node is actually initialized later on in the startup process via the update_available_resources() -> _update_available_resources() -> _init_compute_node(). It isn't initialized when the tracker is first created which appears to be the assumption made by the referenced commit.

Chris Dent (cdent)
tags: added: placement resource-tracker
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

This seems like a valid bug. Unfortunately the provided functional test did not catch the reported problem.

Changed in nova:
status: New → Confirmed
Revision history for this message
Matt Riedemann (mriedem) wrote :

We'll have to backport whatever the fix is to stable/pike:

https://review.openstack.org/#/q/I0df401a7c91f012fdb25cb0e6b344ca51de8c309

Changed in nova:
importance: Undecided → High
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

So the functional test did not catch this because there we just stop and start the compute service but did not destroy it so some state survives. We need to enhance the functional test to reproduce the problem.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/510176

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/510938

Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/510176
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a3c556963eaf408a29b27a1f6fc44bc424e48efc
Submitter: Jenkins
Branch: master

commit a3c556963eaf408a29b27a1f6fc44bc424e48efc
Author: Balazs Gibizer <email address hidden>
Date: Fri Oct 6 18:25:17 2017 +0200

    Reproduce bug 1721652 in the functional test env

    When bug 1709902 was fixed in I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    the fix assumed that when the _destroy_evacuated_instances() is called
    during the init of the nova-compute service the resource tracker
    already knows the compute node ids associated to the given compute
    host. This is not true and therefore _destroy_evacuated_instances fails
    with an exception and does not clean up the evacuated instance.

    The solution for the original bug included a functional regression
    test. However the functional test env is not capable of fully simulate
    a nova-compute service restart. The service only stopped then started
    again. This allows some in memory state of the compute service to
    survive the simulated restart. This caused that the functional test
    was not able to catch that the above described assumption is not
    correct.

    Related-Bug: #1721652
    Related-Bug: #1709902
    Change-Id: Icaf1bae8cb040b939f916a19ce026031ddb84af7

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/511759

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/512716

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/510938
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9252ffdacf262008bc41409d4fb574ec472dc913
Submitter: Zuul
Branch: master

commit 9252ffdacf262008bc41409d4fb574ec472dc913
Author: Balazs Gibizer <email address hidden>
Date: Thu Oct 12 16:07:28 2017 +0200

    fix cleaning up evacuated instances

    When bug 1709902 was fixed in I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    the fix assumed that when the _destroy_evacuated_instances() is called
    during the init of the nova-compute service the resource tracker
    already knows the compute node ids associated to the given compute
    host. This is not true and therefore _destroy_evacuated_instances fails
    with and exception and does not clean up the evacuated instance.

    The resource tracker's compute_nodes dict only initalized during the
    first update_available_resource call that happens in the
    pre_start_hook. While the _destroy_evacuate_instances called from
    init_host which is called before the pre_hook_start.
    The _destroy_evacuated_instances call uses the
    _delete_allocation_for_moved_instance that relies on the resource
    tracker's compute_nodes dict.

    This patch inlines _delete_allocation_for_moved_instance in
    _destroy_evacuated_instances and queries the db for the compute node
    uuid. As ironic uses 1:M host:node setup we cannot ask the db only once
    about the node uuid as different instances might be on different nodes.

    Change-Id: I35749374ff09b0e98064c75ff9c33dad577579c6
    Closes-Bug: #1721652
    Related-Bug: #1709902

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/511759
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1657b8721d82056cb6e9b2ec7ae7690ca215cb14
Submitter: Zuul
Branch: stable/pike

commit 1657b8721d82056cb6e9b2ec7ae7690ca215cb14
Author: Balazs Gibizer <email address hidden>
Date: Fri Oct 6 18:25:17 2017 +0200

    Reproduce bug 1721652 in the functional test env

    When bug 1709902 was fixed in I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    the fix assumed that when the _destroy_evacuated_instances() is called
    during the init of the nova-compute service the resource tracker
    already knows the compute node ids associated to the given compute
    host. This is not true and therefore _destroy_evacuated_instances fails
    with an exception and does not clean up the evacuated instance.

    The solution for the original bug included a functional regression
    test. However the functional test env is not capable of fully simulate
    a nova-compute service restart. The service only stopped then started
    again. This allows some in memory state of the compute service to
    survive the simulated restart. This caused that the functional test
    was not able to catch that the above described assumption is not
    correct.

    Related-Bug: #1721652
    Related-Bug: #1709902
    Change-Id: Icaf1bae8cb040b939f916a19ce026031ddb84af7
    (cherry picked from commit a3c556963eaf408a29b27a1f6fc44bc424e48efc)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/512716
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fb968e18b2472b0dd7231ff3244b683d59f04cd0
Submitter: Zuul
Branch: stable/pike

commit fb968e18b2472b0dd7231ff3244b683d59f04cd0
Author: Balazs Gibizer <email address hidden>
Date: Thu Oct 12 16:07:28 2017 +0200

    fix cleaning up evacuated instances

    When bug 1709902 was fixed in I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    the fix assumed that when the _destroy_evacuated_instances() is called
    during the init of the nova-compute service the resource tracker
    already knows the compute node ids associated to the given compute
    host. This is not true and therefore _destroy_evacuated_instances fails
    with and exception and does not clean up the evacuated instance.

    The resource tracker's compute_nodes dict only initalized during the
    first update_available_resource call that happens in the
    pre_start_hook. While the _destroy_evacuate_instances called from
    init_host which is called before the pre_hook_start.
    The _destroy_evacuated_instances call uses the
    _delete_allocation_for_moved_instance that relies on the resource
    tracker's compute_nodes dict.

    This patch inlines _delete_allocation_for_moved_instance in
    _destroy_evacuated_instances and queries the db for the compute node
    uuid. As ironic uses 1:M host:node setup we cannot ask the db only once
    about the node uuid as different instances might be on different nodes.

    NOTE(mriedem): A couple of changes had to be made to the compute
    manager code since I0883c2ba1989c5d5a46e23bcbcda53598707bcbc is
    not in stable/pike.

    Change-Id: I35749374ff09b0e98064c75ff9c33dad577579c6
    Closes-Bug: #1721652
    Related-Bug: #1709902
    (cherry picked from commit 9252ffdacf262008bc41409d4fb574ec472dc913)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b1

This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.2

This issue was fixed in the openstack/nova 16.0.2 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.