After failed evacuation the recovered source compute tries to delete the instance
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Fuel for OpenStack |
High
|
Oleksiy Molchanov | ||
| OpenStack Compute (nova) |
High
|
Elod Illes | ||
| Ocata |
High
|
Balazs Gibizer | ||
| Pike |
High
|
Matt Riedemann | ||
| OpenStack Security Advisory |
Undecided
|
Unassigned |
Bug Description
Description
===========
In case of a failed evacuation attempt the status of the migration is 'accepted' instead of 'failed' so when source compute is recovered the compute manager tries to delete the instance from the source host. However a secondary fault prevents deleting the allocation in placement so the actual deletion of the instance fails as well.
Steps to reproduce
==================
The following functional test reproduces the bug: https:/
What it does: initiate evacuation when no valid host is available and evacuation fails, but nova manager still tries to delete the instance.
Logs:
2017-08-29 19:11:15,751 ERROR [oslo_messaging
NoValidHost: No valid host was found. There are not enough hosts available.
2017-08-29 19:11:16,103 INFO [nova.tests.
2017-08-29 19:11:16,115 INFO [nova.api.
2017-08-29 19:11:16,120 INFO [nova.api.
2017-08-29 19:11:16,131 INFO [nova.api.
2017-08-29 19:11:16,138 INFO [nova.compute.
2017-08-29 19:11:16,146 INFO [nova.api.
2017-08-29 19:11:16,151 INFO [nova.api.
2017-08-29 19:11:16,152 INFO [nova.tests.
2017-08-29 19:11:16,163 INFO [nova.api.
2017-08-29 19:11:16,168 INFO [nova.api.
2017-08-29 19:11:16,176 INFO [nova.api.
2017-08-29 19:11:16,184 INFO [nova.compute.
2017-08-29 19:11:16,192 INFO [nova.api.
2017-08-29 19:11:16,197 INFO [nova.api.
2017-08-29 19:11:16,198 INFO [nova.tests.
2017-08-29 19:11:16,255 INFO [nova.api.
2017-08-29 19:11:16,262 INFO [nova.api.
2017-08-29 19:11:16,280 INFO [nova.api.
2017-08-29 19:11:16,281 INFO [nova.service] Starting compute node (version 16.0.0)
2017-08-29 19:11:16,296 INFO [nova.compute.
description: | updated |
tags: | added: evacuate placement |
summary: |
- Failed evacuation also tries to delete the VM + After failed evacuation the recovered source compute tries to delete the + instance |
Balazs Gibizer (balazs-gibizer) wrote : | #1 |
Matt Riedemann (mriedem) wrote : | #2 |
Somewhere in the evacuate cleanup flow on the source host once it comes back up, we should be comparing local instances and their instance.host to the current host and if they are the same, we shouldn't consider them evacuated. I don't see that happening anywhere though.
Matt Riedemann (mriedem) wrote : | #3 |
https:/
Matt Riedemann (mriedem) wrote : | #4 |
I'm also not sure why the _destroy_
Looking at the functional recreate patch https:/
Matt Riedemann (mriedem) wrote : | #5 |
It looks like the _destroy_
https:/
If it failed in the destination compute service, the migration status should be set to 'failed' and the migration filter in _destroy_
Changed in nova: | |
importance: | Undecided → High |
Matt Riedemann (mriedem) wrote : | #6 |
We should backport this as it actually leads to deleting things in the source node...
Changed in nova: | |
status: | New → Triaged |
information type: | Public → Private Security |
Matt Riedemann (mriedem) wrote : | #7 |
I've marked this as a potential security issue. If conductor fails to find a host for the evacuation, the migration record status is left in 'accepted' state and the compute service on the source node, if brought back up, will delete the guest from the source node because of this method pulling in 'accepted' state migration records:
https:/
The guest is destroyed here:
https:/
And this code goes back to at least Newton.
Matt Riedemann (mriedem) wrote : | #8 |
We could maybe argue that if the source node is down, the guest is already down, but it's at least not deleted.
Also note that by default, evacuate is an admin-only operation, but is configurable via policy.
Sean Dague (sdague) wrote : | #9 |
I'm really not clear why this is a security issue.
We're talking about evacuation, which is an admin action.
And the potential issue is the delete of a guest incorrectly.
Regular users can't trigger this (unless the admin gave them pretty broad privs, which have way other security issues). They aren't getting access to sensitive information. The worst case is a loss of information. Which in a cloud environment is not super unexpected.
I tend to agree with Sean, admin action based security issue are usually classed as type C1 (Not a vulnerability, just a bug with (some) security implications) according to VMT's taxonomy.
I've subscribed nova-coresec to double check the impact.
Matt Riedemann (mriedem) wrote : | #11 |
OK after talking with a few people we can mark this as public again.
information type: | Private Security → Public |
Jeremy Stanley (fungi) wrote : | #12 |
Agreeing with Tristan et al, and adding the "security" bug tag to indicate it's a hardening opportunity (C1).
tags: | added: security |
Changed in ossa: | |
status: | New → Won't Fix |
Related fix proposed to branch: master
Review: https:/
Changed in nova: | |
assignee: | nobody → Illes Elod (elod-illes) |
status: | Triaged → In Progress |
Changed in nova: | |
assignee: | Illes Elod (elod-illes) → Matt Riedemann (mriedem) |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit 5687c170ea003aa
Author: Előd Illés <email address hidden>
Date: Fri Sep 1 15:36:20 2017 +0200
Functional test for regression bug #1713783
Add functional test for evacuation, when no valid host available.
Migration should end up in 'error' state.
Change-Id: I1adc20f2a5261e
Related-bug: #1713783
Related fix proposed to branch: master
Review: https:/
Related fix proposed to branch: stable/pike
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit a8ebf5f1aac0808
Author: Előd Illés <email address hidden>
Date: Wed Aug 30 16:54:36 2017 +0200
Set error state after failed evacuation
When evacuation fails with NoValidHost, the migration status remains
'accepted' instead of 'error'. This causes problem in case the compute
service starts up again and looks for evacuations with status 'accepted',
as it then removes the local instances for those evacuations even though
the instance was never actually evacuated to another host.
Change-Id: I06d78c744fa75a
Closes-Bug: #1713783
Changed in nova: | |
status: | In Progress → Fix Released |
Changed in nova: | |
assignee: | Matt Riedemann (mriedem) → Illes Elod (elod-illes) |
Fix proposed to branch: stable/pike
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/pike
commit e87d9f51abea0b1
Author: Előd Illés <email address hidden>
Date: Fri Sep 1 15:36:20 2017 +0200
Functional test for regression bug #1713783
Add functional test for evacuation, when no valid host available.
Migration should end up in 'error' state.
Change-Id: I1adc20f2a5261e
Related-bug: #1713783
(cherry picked from commit 5687c170ea003aa
tags: | added: in-stable-pike |
Related fix proposed to branch: stable/ocata
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: stable/pike
commit a3f286f43d866cd
Author: Előd Illés <email address hidden>
Date: Wed Aug 30 16:54:36 2017 +0200
Set error state after failed evacuation
When evacuation fails with NoValidHost, the migration status remains
'accepted' instead of 'error'. This causes problem in case the compute
service starts up again and looks for evacuations with status 'accepted',
as it then removes the local instances for those evacuations even though
the instance was never actually evacuated to another host.
Change-Id: I06d78c744fa75a
Closes-Bug: #1713783
(cherry picked from commit a8ebf5f1aac0808
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit aec81e7d98c1ca0
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 1 13:56:38 2017 -0400
Update docs for _destroy_
The docstring on this method no longer reflects reality,
and hasn't since we started filtering on migration records
starting back in Liberty. This change updates the docstring
for what the method actually does now, and also includes a
comment about why we include 'accepted' but not yet complete
migrations when looking for things to remove from this source
node.
Change-Id: Ia0c67aeffe41e1
Related-Bug: #1713783
This issue was fixed in the openstack/nova 16.0.1 release.
This issue was fixed in the openstack/nova 17.0.0.0b1 development milestone.
Fix proposed to branch: stable/ocata
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit f1215e07a5185a8
Author: Előd Illés <email address hidden>
Date: Tue Aug 22 14:14:56 2017 +0200
Functional test: evacuate with no compute
Resource allocation test case:
Initiate evacuation when no valid host to evacuate to and
check all resource usages and allocations after.
This testcase uncovered the bug below. Bug is fixed already, so the
test contains asserts that cover the correct behavior.
Related-Bug: #1713783
Change-Id: I4ced19bd9259f0
no longer affects: | nova/newton |
Related fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 590fd6d2f8497af
Author: Előd Illés <email address hidden>
Date: Fri Sep 1 15:36:20 2017 +0200
Functional test for regression bug #1713783
Add functional test for evacuation, when no valid host available.
Migration should end up in 'error' state.
Original patch is modified as in Ocata os-service API still uses
the old way of identifing services.
Also, due to bugfix of #1744325 the state of the server after a
failed evacuation is no longer ACTIVE rather ERROR.
Change-Id: I1adc20f2a5261e
Related-bug: #1713783
(cherry picked from commit 5687c170ea003aa
(cherry picked from commit e87d9f51abea0b1
tags: | added: in-stable-ocata |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/ocata
commit 604954a70a5dbf6
Author: Előd Illés <email address hidden>
Date: Wed Aug 30 16:54:36 2017 +0200
Set error state after failed evacuation
When evacuation fails with NoValidHost, the migration status remains
'accepted' instead of 'error'. This causes problem in case the compute
service starts up again and looks for evacuations with status 'accepted',
as it then removes the local instances for those evacuations even though
the instance was never actually evacuated to another host.
Conflicts:
NOTE(mriedem): The conflict is due to not having change
I6590f0eda4
Change-Id: I06d78c744fa75a
Closes-Bug: #1713783
(cherry picked from commit a8ebf5f1aac0808
(cherry picked from commit a3f286f43d866cd
This issue was fixed in the openstack/nova 15.1.1 release.
Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https:/
Related fix proposed to branch: stable/pike
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit c5cd6c8e50201df
Author: Matt Riedemann <email address hidden>
Date: Fri Sep 1 13:56:38 2017 -0400
Update docs for _destroy_
The docstring on this method no longer reflects reality,
and hasn't since we started filtering on migration records
starting back in Liberty. This change updates the docstring
for what the method actually does now, and also includes a
comment about why we include 'accepted' but not yet complete
migrations when looking for things to remove from this source
node.
Change-Id: Ia0c67aeffe41e1
Related-Bug: #1713783
(cherry picked from commit aec81e7d98c1ca0
Changed in fuel: | |
assignee: | nobody → Oleksiy Molchanov (omolchanov) |
importance: | Undecided → High |
status: | New → Confirmed |
milestone: | none → 9.2-mu-11 |
Fuel Devops McRobotson (fuel-devops-robot) wrote : Change abandoned on openstack/nova (9.0/mitaka) | #34 |
Change abandoned by Oleksiy Molchanov <email address hidden> on branch: 9.0/mitaka
Review: https:/
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/nova (mcp/1.0/mitaka) | #36 |
Fix proposed to branch: mcp/1.0/mitaka
Change author: Előd Illés <email address hidden>
Review: https:/
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/nova (mcp/1.0/mitaka) | #37 |
Reviewed: https:/
Submitter: Pkgs Jenkins <email address hidden>
Branch: mcp/1.0/mitaka
Commit: b7db765eb4d9a21
Author: Előd Illés <email address hidden>
Date: Tue Feb 12 11:44:31 2019
Set error state after failed evacuation
When evacuation fails with NoValidHost, the migration status remains
'accepted' instead of 'error'. This causes problem in case the compute
service starts up again and looks for evacuations with status 'accepted',
as it then removes the local instances for those evacuations even though
the instance was never actually evacuated to another host.
Conflicts:
nova/
NOTE(mriedem): The conflict is due to not having change
I6590f0eda4ec49
Change-Id: I06d78c744fa75a
Closes-Bug: #1713783
(cherry picked from commit a8ebf5f1aac0808
(cherry picked from commit a3f286f43d866cd
Reviewed: https:/
Submitter: Pkgs Jenkins <email address hidden>
Branch: 9.0/mitaka
Commit: 171d726d401dcbb
Author: Előd Illés <email address hidden>
Date: Mon Mar 25 17:27:27 2019
Set error state after failed evacuation
When evacuation fails with NoValidHost, the migration status remains
'accepted' instead of 'error'. This causes problem in case the compute
service starts up again and looks for evacuations with status 'accepted',
as it then removes the local instances for those evacuations even though
the instance was never actually evacuated to another host.
Conflicts:
nova/
NOTE(mriedem): The conflict is due to not having change
I6590f0eda4ec49
Change-Id: I06d78c744fa75a
Closes-Bug: #1713783
(cherry picked from commit a8ebf5f1aac0808
(cherry picked from commit a3f286f43d866cd
Changed in fuel: | |
status: | Confirmed → Fix Committed |
Pavel Glazov (pglazovv) wrote : | #39 |
Verified:
Instance is set to error status after a new evacuation. Instance is not deleted.
Changed in fuel: | |
status: | Fix Committed → Fix Released |
A regression test was pushed for this bug https:/ /review. openstack. org/#/c/ 498482/. (It seems that the automatic linking doesn't work in later patch sets)