source host allocation not cleaned up in placement after evacuation

Bug #1709902 reported by Balazs Gibizer
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Balazs Gibizer

Bug Description

1) boot a server
2) kill the compute (optionally force-down it)
3) evacuate the server
4) start up the original compute
5) check the allocations in placement

We expect that the allocation on the original compute is removed when that compute start up (init_host) after the evacuation but it isn't.
The compute host periodic resource healing also skips this case here https://review.openstack.org/#/c/491850/4/nova/compute/resource_tracker.py@1084

Here is a patch to reproduce the problem in the functional test env: https://review.openstack.org/#/c/492548/
Here is the debug log for that run: https://pastebin.com/hzb33Awu

Revision history for this message
Matt Riedemann (mriedem) wrote :

As mentioned in https://review.openstack.org/#/c/492548/1/nova/tests/functional/test_servers.py - I think we likely need to cleanup the stale allocations for the source compute host when the init_host routine in the compute manager is cleaning up locally deleted instances that were evacuated to another host.

Changed in nova:
status: New → Triaged
importance: Undecided → Medium
tags: added: evacu
tags: added: evacuate placement
removed: evacu
tags: added: openstack-version.pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/493037

Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/492548
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=96938426e6351fd0ed0e485c04fccb11e37e91bc
Submitter: Jenkins
Branch: master

commit 96938426e6351fd0ed0e485c04fccb11e37e91bc
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 10 16:08:23 2017 +0200

    test server evacuation with placement

    This patch test evacuation between two Pike compute hosts and
    checks the resource state in the placement API.

    Related-Bug: #1709902

    Change-Id: Idedb8a911a2ab0c096f4c9f61c5db362b08758ba

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/494623

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/pike)

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/494625

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/493037
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9b9c2c52f323c536fb90bb29a40efbed1b129594
Submitter: Jenkins
Branch: master

commit 9b9c2c52f323c536fb90bb29a40efbed1b129594
Author: Balazs Gibizer <email address hidden>
Date: Fri Aug 11 11:40:35 2017 +0200

    delete allocation of evacuated instance

    After evacuation the instance has allocations on both the source and
    the destination computes. This is OK as the source compute is down.
    However after the source compute is brought up the allocation from
    the source host needs to be cleaned up.

    Closes-Bug: #1709902
    Change-Id: I0df401a7c91f012fdb25cb0e6b344ca51de8c309

Changed in nova:
status: In Progress → Fix Released
Dan Smith (danms)
tags: added: pike-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/494623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fef0a58d4b868637b0139e5c7000c0702cb71ff0
Submitter: Jenkins
Branch: stable/pike

commit fef0a58d4b868637b0139e5c7000c0702cb71ff0
Author: Balazs Gibizer <email address hidden>
Date: Thu Aug 10 16:08:23 2017 +0200

    test server evacuation with placement

    This patch test evacuation between two Pike compute hosts and
    checks the resource state in the placement API.

    Related-Bug: #1709902

    Change-Id: Idedb8a911a2ab0c096f4c9f61c5db362b08758ba
    (cherry picked from commit 96938426e6351fd0ed0e485c04fccb11e37e91bc)

tags: added: in-stable-pike
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/494625
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=0de806684f5d670dd5f961f7adf212961da3ed87
Submitter: Jenkins
Branch: stable/pike

commit 0de806684f5d670dd5f961f7adf212961da3ed87
Author: Balazs Gibizer <email address hidden>
Date: Fri Aug 11 11:40:35 2017 +0200

    delete allocation of evacuated instance

    After evacuation the instance has allocations on both the source and
    the destination computes. This is OK as the source compute is down.
    However after the source compute is brought up the allocation from
    the source host needs to be cleaned up.

    Closes-Bug: #1709902
    Change-Id: I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    (cherry picked from commit 9b9c2c52f323c536fb90bb29a40efbed1b129594)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.0.0rc2

This issue was fixed in the openstack/nova 16.0.0.0rc2 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/510938

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/510176
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a3c556963eaf408a29b27a1f6fc44bc424e48efc
Submitter: Jenkins
Branch: master

commit a3c556963eaf408a29b27a1f6fc44bc424e48efc
Author: Balazs Gibizer <email address hidden>
Date: Fri Oct 6 18:25:17 2017 +0200

    Reproduce bug 1721652 in the functional test env

    When bug 1709902 was fixed in I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    the fix assumed that when the _destroy_evacuated_instances() is called
    during the init of the nova-compute service the resource tracker
    already knows the compute node ids associated to the given compute
    host. This is not true and therefore _destroy_evacuated_instances fails
    with an exception and does not clean up the evacuated instance.

    The solution for the original bug included a functional regression
    test. However the functional test env is not capable of fully simulate
    a nova-compute service restart. The service only stopped then started
    again. This allows some in memory state of the compute service to
    survive the simulated restart. This caused that the functional test
    was not able to catch that the above described assumption is not
    correct.

    Related-Bug: #1721652
    Related-Bug: #1709902
    Change-Id: Icaf1bae8cb040b939f916a19ce026031ddb84af7

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/pike)

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/511759

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/pike
Review: https://review.openstack.org/512716

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/510938
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9252ffdacf262008bc41409d4fb574ec472dc913
Submitter: Zuul
Branch: master

commit 9252ffdacf262008bc41409d4fb574ec472dc913
Author: Balazs Gibizer <email address hidden>
Date: Thu Oct 12 16:07:28 2017 +0200

    fix cleaning up evacuated instances

    When bug 1709902 was fixed in I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    the fix assumed that when the _destroy_evacuated_instances() is called
    during the init of the nova-compute service the resource tracker
    already knows the compute node ids associated to the given compute
    host. This is not true and therefore _destroy_evacuated_instances fails
    with and exception and does not clean up the evacuated instance.

    The resource tracker's compute_nodes dict only initalized during the
    first update_available_resource call that happens in the
    pre_start_hook. While the _destroy_evacuate_instances called from
    init_host which is called before the pre_hook_start.
    The _destroy_evacuated_instances call uses the
    _delete_allocation_for_moved_instance that relies on the resource
    tracker's compute_nodes dict.

    This patch inlines _delete_allocation_for_moved_instance in
    _destroy_evacuated_instances and queries the db for the compute node
    uuid. As ironic uses 1:M host:node setup we cannot ask the db only once
    about the node uuid as different instances might be on different nodes.

    Change-Id: I35749374ff09b0e98064c75ff9c33dad577579c6
    Closes-Bug: #1721652
    Related-Bug: #1709902

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/pike)

Reviewed: https://review.openstack.org/511759
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1657b8721d82056cb6e9b2ec7ae7690ca215cb14
Submitter: Zuul
Branch: stable/pike

commit 1657b8721d82056cb6e9b2ec7ae7690ca215cb14
Author: Balazs Gibizer <email address hidden>
Date: Fri Oct 6 18:25:17 2017 +0200

    Reproduce bug 1721652 in the functional test env

    When bug 1709902 was fixed in I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    the fix assumed that when the _destroy_evacuated_instances() is called
    during the init of the nova-compute service the resource tracker
    already knows the compute node ids associated to the given compute
    host. This is not true and therefore _destroy_evacuated_instances fails
    with an exception and does not clean up the evacuated instance.

    The solution for the original bug included a functional regression
    test. However the functional test env is not capable of fully simulate
    a nova-compute service restart. The service only stopped then started
    again. This allows some in memory state of the compute service to
    survive the simulated restart. This caused that the functional test
    was not able to catch that the above described assumption is not
    correct.

    Related-Bug: #1721652
    Related-Bug: #1709902
    Change-Id: Icaf1bae8cb040b939f916a19ce026031ddb84af7
    (cherry picked from commit a3c556963eaf408a29b27a1f6fc44bc424e48efc)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/512716
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fb968e18b2472b0dd7231ff3244b683d59f04cd0
Submitter: Zuul
Branch: stable/pike

commit fb968e18b2472b0dd7231ff3244b683d59f04cd0
Author: Balazs Gibizer <email address hidden>
Date: Thu Oct 12 16:07:28 2017 +0200

    fix cleaning up evacuated instances

    When bug 1709902 was fixed in I0df401a7c91f012fdb25cb0e6b344ca51de8c309
    the fix assumed that when the _destroy_evacuated_instances() is called
    during the init of the nova-compute service the resource tracker
    already knows the compute node ids associated to the given compute
    host. This is not true and therefore _destroy_evacuated_instances fails
    with and exception and does not clean up the evacuated instance.

    The resource tracker's compute_nodes dict only initalized during the
    first update_available_resource call that happens in the
    pre_start_hook. While the _destroy_evacuate_instances called from
    init_host which is called before the pre_hook_start.
    The _destroy_evacuated_instances call uses the
    _delete_allocation_for_moved_instance that relies on the resource
    tracker's compute_nodes dict.

    This patch inlines _delete_allocation_for_moved_instance in
    _destroy_evacuated_instances and queries the db for the compute node
    uuid. As ironic uses 1:M host:node setup we cannot ask the db only once
    about the node uuid as different instances might be on different nodes.

    NOTE(mriedem): A couple of changes had to be made to the compute
    manager code since I0883c2ba1989c5d5a46e23bcbcda53598707bcbc is
    not in stable/pike.

    Change-Id: I35749374ff09b0e98064c75ff9c33dad577579c6
    Closes-Bug: #1721652
    Related-Bug: #1709902
    (cherry picked from commit 9252ffdacf262008bc41409d4fb574ec472dc913)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 17.0.0.0b1

This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.

Revision history for this message
Jiri Suchomel (jsuchome) wrote :

The placement setup is present in Ocata and Newton. Is there any reason NOT to backport this fix?

Revision history for this message
Matt Riedemann (mriedem) wrote :

Newton is end of life upstream. I'm not sure how cleanly the backport would apply to Ocata, but if the bug exists in Ocata then I don't see why we wouldn't backport it there as well. The only major difference is in Ocata, the ResourceTracker in nova-compute should periodically (every minute by default) heal the allocations for the instances running on that host. That stopped in Pike once all your computes were upgraded because the scheduler (in Pike) creates the allocations, not the computes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.