After failed evacuation the recovered source compute tries to delete the instance

Bug #1713783 reported by Elod Illes on 2017-08-29
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Oleksiy Molchanov
OpenStack Compute (nova)
High
Elod Illes
Ocata
High
Balazs Gibizer
Pike
High
Matt Riedemann
OpenStack Security Advisory
Undecided
Unassigned

Bug Description

Description
===========
In case of a failed evacuation attempt the status of the migration is 'accepted' instead of 'failed' so when source compute is recovered the compute manager tries to delete the instance from the source host. However a secondary fault prevents deleting the allocation in placement so the actual deletion of the instance fails as well.

Steps to reproduce
==================
The following functional test reproduces the bug: https://review.openstack.org/#/c/498482/
What it does: initiate evacuation when no valid host is available and evacuation fails, but nova manager still tries to delete the instance.
Logs:

    2017-08-29 19:11:15,751 ERROR [oslo_messaging.rpc.server] Exception during message handling
    NoValidHost: No valid host was found. There are not enough hosts available.
    2017-08-29 19:11:16,103 INFO [nova.tests.functional.test_servers] Running periodic for compute1 (host1)
    2017-08-29 19:11:16,115 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/4e8e23ff-0c52-4cf7-8356-d9fa88536316/aggregates" status: 200 len: 18 microversion: 1.1
    2017-08-29 19:11:16,120 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/4e8e23ff-0c52-4cf7-8356-d9fa88536316/inventories" status: 200 len: 401 microversion: 1.0
    2017-08-29 19:11:16,131 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/4e8e23ff-0c52-4cf7-8356-d9fa88536316/allocations" status: 200 len: 152 microversion: 1.0
    2017-08-29 19:11:16,138 INFO [nova.compute.resource_tracker] Final resource view: name=host1 phys_ram=8192MB used_ram=1024MB phys_disk=1028GB used_disk=1GB total_vcpus=10 used_vcpus=1 pci_stats=[]
    2017-08-29 19:11:16,146 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/4e8e23ff-0c52-4cf7-8356-d9fa88536316/aggregates" status: 200 len: 18 microversion: 1.1
    2017-08-29 19:11:16,151 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/4e8e23ff-0c52-4cf7-8356-d9fa88536316/inventories" status: 200 len: 401 microversion: 1.0
    2017-08-29 19:11:16,152 INFO [nova.tests.functional.test_servers] Running periodic for compute2 (host2)
    2017-08-29 19:11:16,163 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/531b1ce8-def1-455d-95b3-4140665d956f/aggregates" status: 200 len: 18 microversion: 1.1
    2017-08-29 19:11:16,168 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/531b1ce8-def1-455d-95b3-4140665d956f/inventories" status: 200 len: 401 microversion: 1.0
    2017-08-29 19:11:16,176 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/531b1ce8-def1-455d-95b3-4140665d956f/allocations" status: 200 len: 54 microversion: 1.0
    2017-08-29 19:11:16,184 INFO [nova.compute.resource_tracker] Final resource view: name=host2 phys_ram=8192MB used_ram=512MB phys_disk=1028GB used_disk=0GB total_vcpus=10 used_vcpus=0 pci_stats=[]
    2017-08-29 19:11:16,192 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/531b1ce8-def1-455d-95b3-4140665d956f/aggregates" status: 200 len: 18 microversion: 1.1
    2017-08-29 19:11:16,197 INFO [nova.api.openstack.placement.requestlog] 127.0.0.1 "GET /placement/resource_providers/531b1ce8-def1-455d-95b3-4140665d956f/inventories" status: 200 len: 401 microversion: 1.0
    2017-08-29 19:11:16,198 INFO [nova.tests.functional.test_servers] Finished with periodics
    2017-08-29 19:11:16,255 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572/servers/5058200c-478e-4449-88c1-906fdd572662" status: 200 len: 1875 microversion: 2.53 time: 0.056198
    2017-08-29 19:11:16,262 INFO [nova.api.openstack.requestlog] 127.0.0.1 "GET /v2.1/6f70656e737461636b20342065766572/os-migrations" status: 200 len: 373 microversion: 2.53 time: 0.004618
    2017-08-29 19:11:16,280 INFO [nova.api.openstack.requestlog] 127.0.0.1 "PUT /v2.1/6f70656e737461636b20342065766572/os-services/c269bc74-4720-4de4-a6e5-889080b892a0" status: 200 len: 245 microversion: 2.53 time: 0.016442
    2017-08-29 19:11:16,281 INFO [nova.service] Starting compute node (version 16.0.0)
    2017-08-29 19:11:16,296 INFO [nova.compute.manager] Deleting instance as it has been evacuated from this host

description: updated
tags: added: evacuate placement
summary: - Failed evacuation also tries to delete the VM
+ After failed evacuation the recovered source compute tries to delete the
+ instance
Balazs Gibizer (balazs-gibizer) wrote :

A regression test was pushed for this bug https://review.openstack.org/#/c/498482/. (It seems that the automatic linking doesn't work in later patch sets)

Matt Riedemann (mriedem) wrote :

Somewhere in the evacuate cleanup flow on the source host once it comes back up, we should be comparing local instances and their instance.host to the current host and if they are the same, we shouldn't consider them evacuated. I don't see that happening anywhere though.

Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/491808/ could be a related recent regression, but I'm not sure yet.

Matt Riedemann (mriedem) wrote :

I'm also not sure why the _destroy_evacuated_instances method in the ComputeManager is including 'accepted' migrations, since that is set in the API when the migration record is created and doesn't mean the migration is done.

Looking at the functional recreate patch https://review.openstack.org/#/c/498482/, one way this could happen is that conductor fails to find a host and the migration record is never set to failed or error. It still seems odd that we don't filter the instance.host != self.host in the compute _destroy_evacuated_instances method.

Matt Riedemann (mriedem) wrote :

It looks like the _destroy_evacuated_instances method in the compute manager has always filtered migrations on the 'accepted' status since originally this code was just meant to cleanup local resources once an evacuation from the source host has started, which is fine. The problem is with removing the source node allocations if the evacuation failed, but if it failed in the conductor, we can fix that here:

https://review.openstack.org/#/c/499237/

If it failed in the destination compute service, the migration status should be set to 'failed' and the migration filter in _destroy_evacuated_instances would filter it out.

Changed in nova:
importance: Undecided → High
Matt Riedemann (mriedem) wrote :

We should backport this as it actually leads to deleting things in the source node...

Changed in nova:
status: New → Triaged
information type: Public → Private Security
Matt Riedemann (mriedem) wrote :

I've marked this as a potential security issue. If conductor fails to find a host for the evacuation, the migration record status is left in 'accepted' state and the compute service on the source node, if brought back up, will delete the guest from the source node because of this method pulling in 'accepted' state migration records:

https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L624

The guest is destroyed here:

https://github.com/openstack/nova/blob/16.0.0.0rc2/nova/compute/manager.py#L668

And this code goes back to at least Newton.

Matt Riedemann (mriedem) wrote :

We could maybe argue that if the source node is down, the guest is already down, but it's at least not deleted.

Also note that by default, evacuate is an admin-only operation, but is configurable via policy.

Sean Dague (sdague) wrote :

I'm really not clear why this is a security issue.

We're talking about evacuation, which is an admin action.

And the potential issue is the delete of a guest incorrectly.

Regular users can't trigger this (unless the admin gave them pretty broad privs, which have way other security issues). They aren't getting access to sensitive information. The worst case is a loss of information. Which in a cloud environment is not super unexpected.

I tend to agree with Sean, admin action based security issue are usually classed as type C1 (Not a vulnerability, just a bug with (some) security implications) according to VMT's taxonomy.

I've subscribed nova-coresec to double check the impact.

Matt Riedemann (mriedem) wrote :

OK after talking with a few people we can mark this as public again.

information type: Private Security → Public
Jeremy Stanley (fungi) wrote :

Agreeing with Tristan et al, and adding the "security" bug tag to indicate it's a hardening opportunity (C1).

tags: added: security
Changed in ossa:
status: New → Won't Fix
Changed in nova:
assignee: nobody → Illes Elod (elod-illes)
status: Triaged → In Progress
Changed in nova:
assignee: Illes Elod (elod-illes) → Matt Riedemann (mriedem)

Reviewed: https://review.openstack.org/500057
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5687c170ea003aafa2720fb00af9d84836bb41df
Submitter: Jenkins
Branch: master

commit 5687c170ea003aafa2720fb00af9d84836bb41df
Author: Előd Illés <email address hidden>
Date: Fri Sep 1 15:36:20 2017 +0200

    Functional test for regression bug #1713783

    Add functional test for evacuation, when no valid host available.
    Migration should end up in 'error' state.

    Change-Id: I1adc20f2a5261e6906a18b9aee5cd2c8ecf0cf4d
    Related-bug: #1713783

Reviewed: https://review.openstack.org/499237
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a8ebf5f1aac080854704e27146e8c98b053c6224
Submitter: Jenkins
Branch: master

commit a8ebf5f1aac080854704e27146e8c98b053c6224
Author: Előd Illés <email address hidden>
Date: Wed Aug 30 16:54:36 2017 +0200

    Set error state after failed evacuation

    When evacuation fails with NoValidHost, the migration status remains
    'accepted' instead of 'error'. This causes problem in case the compute
    service starts up again and looks for evacuations with status 'accepted',
    as it then removes the local instances for those evacuations even though
    the instance was never actually evacuated to another host.

    Change-Id: I06d78c744fa75ae5f34c5cfa76bc3c9460767b84
    Closes-Bug: #1713783

Changed in nova:
status: In Progress → Fix Released
Matt Riedemann (mriedem) on 2017-09-18
Changed in nova:
assignee: Matt Riedemann (mriedem) → Illes Elod (elod-illes)

Reviewed: https://review.openstack.org/504943
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e87d9f51abea0b1c97628ef02435157da4b334ec
Submitter: Jenkins
Branch: stable/pike

commit e87d9f51abea0b1c97628ef02435157da4b334ec
Author: Előd Illés <email address hidden>
Date: Fri Sep 1 15:36:20 2017 +0200

    Functional test for regression bug #1713783

    Add functional test for evacuation, when no valid host available.
    Migration should end up in 'error' state.

    Change-Id: I1adc20f2a5261e6906a18b9aee5cd2c8ecf0cf4d
    Related-bug: #1713783
    (cherry picked from commit 5687c170ea003aafa2720fb00af9d84836bb41df)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/504979
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a3f286f43d866cd343d26d9bafadecab1c225e4b
Submitter: Jenkins
Branch: stable/pike

commit a3f286f43d866cd343d26d9bafadecab1c225e4b
Author: Előd Illés <email address hidden>
Date: Wed Aug 30 16:54:36 2017 +0200

    Set error state after failed evacuation

    When evacuation fails with NoValidHost, the migration status remains
    'accepted' instead of 'error'. This causes problem in case the compute
    service starts up again and looks for evacuations with status 'accepted',
    as it then removes the local instances for those evacuations even though
    the instance was never actually evacuated to another host.

    Change-Id: I06d78c744fa75ae5f34c5cfa76bc3c9460767b84
    Closes-Bug: #1713783
    (cherry picked from commit a8ebf5f1aac080854704e27146e8c98b053c6224)

Reviewed: https://review.openstack.org/500144
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aec81e7d98c1ca082abd157dab5cc83b5bedfd89
Submitter: Jenkins
Branch: master

commit aec81e7d98c1ca082abd157dab5cc83b5bedfd89
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 1 13:56:38 2017 -0400

    Update docs for _destroy_evacuated_instances

    The docstring on this method no longer reflects reality,
    and hasn't since we started filtering on migration records
    starting back in Liberty. This change updates the docstring
    for what the method actually does now, and also includes a
    comment about why we include 'accepted' but not yet complete
    migrations when looking for things to remove from this source
    node.

    Change-Id: Ia0c67aeffe41e133de9dee5080be9c5ec1739aa6
    Related-Bug: #1713783

This issue was fixed in the openstack/nova 16.0.1 release.

This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.

Reviewed: https://review.openstack.org/498482
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f1215e07a5185a8590104235fcbde4b5a23c3545
Submitter: Zuul
Branch: master

commit f1215e07a5185a8590104235fcbde4b5a23c3545
Author: Előd Illés <email address hidden>
Date: Tue Aug 22 14:14:56 2017 +0200

    Functional test: evacuate with no compute

    Resource allocation test case:
    Initiate evacuation when no valid host to evacuate to and
    check all resource usages and allocations after.

    This testcase uncovered the bug below. Bug is fixed already, so the
    test contains asserts that cover the correct behavior.

    Related-Bug: #1713783
    Change-Id: I4ced19bd9259f0b5a50b89dd5908abe35ca73894

Matt Riedemann (mriedem) on 2018-04-16
no longer affects: nova/newton

Reviewed: https://review.openstack.org/505160
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=590fd6d2f8497af83307025c8cb796fb433ae80b
Submitter: Zuul
Branch: stable/ocata

commit 590fd6d2f8497af83307025c8cb796fb433ae80b
Author: Előd Illés <email address hidden>
Date: Fri Sep 1 15:36:20 2017 +0200

    Functional test for regression bug #1713783

    Add functional test for evacuation, when no valid host available.
    Migration should end up in 'error' state.

    Original patch is modified as in Ocata os-service API still uses
    the old way of identifing services.

    Also, due to bugfix of #1744325 the state of the server after a
    failed evacuation is no longer ACTIVE rather ERROR.

    Change-Id: I1adc20f2a5261e6906a18b9aee5cd2c8ecf0cf4d
    Related-bug: #1713783
    (cherry picked from commit 5687c170ea003aafa2720fb00af9d84836bb41df)
    (cherry picked from commit e87d9f51abea0b1c97628ef02435157da4b334ec)

tags: added: in-stable-ocata

Reviewed: https://review.openstack.org/518733
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=604954a70a5dbf6a1bf79d8b67e8d92c2bf46386
Submitter: Zuul
Branch: stable/ocata

commit 604954a70a5dbf6a1bf79d8b67e8d92c2bf46386
Author: Előd Illés <email address hidden>
Date: Wed Aug 30 16:54:36 2017 +0200

    Set error state after failed evacuation

    When evacuation fails with NoValidHost, the migration status remains
    'accepted' instead of 'error'. This causes problem in case the compute
    service starts up again and looks for evacuations with status 'accepted',
    as it then removes the local instances for those evacuations even though
    the instance was never actually evacuated to another host.

    Conflicts:
          nova/conductor/manager.py

    NOTE(mriedem): The conflict is due to not having change
    I6590f0eda4ec4996543ad40d8c2640b83fc3dd9d in Ocata.

    Change-Id: I06d78c744fa75ae5f34c5cfa76bc3c9460767b84
    Closes-Bug: #1713783
    (cherry picked from commit a8ebf5f1aac080854704e27146e8c98b053c6224)
    (cherry picked from commit a3f286f43d866cd343d26d9bafadecab1c225e4b)

This issue was fixed in the openstack/nova 15.1.1 release.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/561707

Reviewed: https://review.openstack.org/621203
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c5cd6c8e50201dfaa192cdfbbc64d714eeadcd95
Submitter: Zuul
Branch: stable/pike

commit c5cd6c8e50201dfaa192cdfbbc64d714eeadcd95
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 1 13:56:38 2017 -0400

    Update docs for _destroy_evacuated_instances

    The docstring on this method no longer reflects reality,
    and hasn't since we started filtering on migration records
    starting back in Liberty. This change updates the docstring
    for what the method actually does now, and also includes a
    comment about why we include 'accepted' but not yet complete
    migrations when looking for things to remove from this source
    node.

    Change-Id: Ia0c67aeffe41e133de9dee5080be9c5ec1739aa6
    Related-Bug: #1713783
    (cherry picked from commit aec81e7d98c1ca082abd157dab5cc83b5bedfd89)

Changed in fuel:
assignee: nobody → Oleksiy Molchanov (omolchanov)
importance: Undecided → High
status: New → Confirmed
milestone: none → 9.2-mu-11

Change abandoned by Oleksiy Molchanov <email address hidden> on branch: 9.0/mitaka
Review: https://review.fuel-infra.org/40162

Fix proposed to branch: mcp/1.0/mitaka
Change author: Előd Illés <email address hidden>
Review: https://review.fuel-infra.org/40217

Reviewed: https://review.fuel-infra.org/40217
Submitter: Pkgs Jenkins <email address hidden>
Branch: mcp/1.0/mitaka

Commit: b7db765eb4d9a21da73d0a92d92bfa2bf3bd88c2
Author: Előd Illés <email address hidden>
Date: Tue Feb 12 11:44:31 2019

Set error state after failed evacuation

When evacuation fails with NoValidHost, the migration status remains
'accepted' instead of 'error'. This causes problem in case the compute
service starts up again and looks for evacuations with status 'accepted',
as it then removes the local instances for those evacuations even though
the instance was never actually evacuated to another host.

Conflicts:
      nova/conductor/manager.py

NOTE(mriedem): The conflict is due to not having change
I6590f0eda4ec4996543ad40d8c2640b83fc3dd9d in Ocata.

Change-Id: I06d78c744fa75ae5f34c5cfa76bc3c9460767b84
Closes-Bug: #1713783
(cherry picked from commit a8ebf5f1aac080854704e27146e8c98b053c6224)
(cherry picked from commit a3f286f43d866cd343d26d9bafadecab1c225e4b)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers