Bug #1423952 “It is impossible to delete an instance that has fa...” : Bugs : OpenStack Compute (nova)

Dan Smith (danms) on 2015-02-20

Changed in nova:
importance:	Undecided → Medium
status:	New → Triaged
milestone:	none → kilo-3

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-11:

#1

This was actually 'fixed' with bug 1308342 in that you can do a force-delete call on the instance and that will clean it up. However, the fix for bug 1308342 introduced a regression in the cells RPC API, so we have bug 1430822 to fix that.

Once we revert https://review.openstack.org/#/c/121800/ then we'll make the fix for this in the compute API to allow force-delete of an instance stuck in 'deleting' task_state to resolve this bug.

Note that we should also have a @revert_task_state decorator in whatever compute manager call failed on spawn so we don't get stuck in this state to begin with.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-11:

#2

Actually, what level of code was this bug reported against? Juno? Or Kilo? Because build_and_run_instance in the compute manager already has the @reverts_task_state decorator:

http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=2015.1.0b2#n2007

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-11:

#3

I'm not really seeing how this is happening on master level code, but I'll write a unit test to try and recreate.

Revision history for this message

Lars Kellogg-Stedman (larsks) wrote on 2015-03-11:

#4

This was originally reported against Icehouse, but I believe the behavior can be reproduced with Juno code as well. I have not tried to reproduce it with anything more recent.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-11:

#5

Note that since Juno you can force-delete an instance in any vm_state but the task_state must be None:

https://review.openstack.org/#/c/111157/

So if the vm_state is 'ERROR' and the task_state is 'deleting', you have to reset the task state using the reset-state admin actions API first to reset the task_state to None and then you can force-delete it.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-11:

#6

I think a see a possible bug in the @reverts_task_state decorator in the compute manager code, it assumes there is an 'instance' key in kwargs to the method and it gets the uuid from that for the state update. However, if kwargs['instance'] raises a KeyError, we swallow it and continue on our merry way.

There are other decorators that use utilities to get all of the args into kwargs dict and then we can be sure we're getting the right thing. I'm going to play with some patches to see how often we hit this in a normal Jenkins run and were ignoring it.

Changed in nova:
assignee:	nobody → Matt Riedemann (mriedem)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-11:

#7

Confirmed, we can't even pass unit tests with that KeyError:

nova.tests.unit.compute.test_compute_mgr.ComputeManagerUnitTestCase.test_set_admin_password_bad_state
-----------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/mock.py", line 1201, in patched
        return func(*args, **keywargs)
      File "nova/tests/unit/compute/test_compute_mgr.py", line 1889, in test_set_admin_password_bad_state
        self.context, instance, None)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 422, in assertRaises
        self.assertThat(our_callable, matcher)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 433, in assertThat
        mismatch_error = self._matchHelper(matchee, matcher, message, verbose)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 483, in _matchHelper
        mismatch = matcher.match(matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_exception.py", line 108, in match
        mismatch = self.exception_matcher.match(exc_info)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_higherorder.py", line 62, in match
        mismatch = matcher.match(matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 414, in match
        reraise(*matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_exception.py", line 101, in match
        result = matchee()
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 969, in __call__
        return self._callable_object(*self._args, **self._kwargs)
      File "nova/compute/manager.py", line 420, in decorated_function
        return function(self, context, *args, **kwargs)
      File "nova/exception.py", line 88, in wrapped
        payload)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
        six.reraise(self.type_, self.value, self.tb)
      File "nova/exception.py", line 71, in wrapped
        return f(self, context, *args, **kw)
      File "nova/compute/manager.py", line 296, in decorated_function
        instance_uuid = kwargs['instance']['uuid']
    KeyError: 'instance'

Confirmed, we can't even pass unit tests with that KeyError:

nova.tests.unit.compute.test_compute_mgr.ComputeManagerUnitTestCase.test_set_admin_password_bad_state
-----------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/mock.py", line 1201, in patched
        return func(*args, **keywargs)
      File "nova/tests/unit/compute/test_compute_mgr.py", line 1889, in test_set_admin_password_bad_state
        self.context, instance, None)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 422, in assertRaises
        self.assertThat(our_callable, matcher)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 433, in assertThat
        mismatch_error = self._matchHelper(matchee, matcher, message, verbose)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 483, in _matchHelper
        mismatch = matcher.match(matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_exception.py", line 108, in match
        mismatch = self.exception_matcher.match(exc_info)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_higherorder.py", line 62, in match
        mismatch = matcher.match(matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 414, in match
        reraise(*matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_exception.py", line 101, in match
        result = matchee()
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 969, in __call__
        return self._callable_object(*self._args, **self._kwargs)
      File "nova/compute/manager.py", line 420, in decorated_function
        return function(self, context, *args, **kwargs)
      File "nova/exception.py", line 88, in wrapped
        payload)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
        six.reraise(self.type_, self.value, self.tb)
      File "nova/exception.py", line 71, in wrapped
        return f(self, context, *args, **kw)
      File "nova/compute/manager.py", line 296, in decorated_function
        instance_uuid = kwargs['instance']['uuid']
    KeyError: 'instance'

tags:

added: compute icehouse-backport-potential juno-backport-potential

Matt Riedemann (mriedem) on 2015-03-11

Changed in nova:
importance:	Medium → High
status:	Triaged → In Progress
importance:	High → Critical

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-11: Fix proposed to nova (master)

#8

Fix proposed to branch: master
Review: https://review.openstack.org/163515

Revision history for this message

Matt Riedemann (mriedem) wrote on 2015-03-11:

#9

The thinking is https://review.openstack.org/#/c/130601/ introduced the bug since _do_build_and_run_instance is passed args rather than kwargs and since that's in a separate thread we lose the error.

Note that change was backported to stable/juno and stable/icehouse, so it's also there:

https://review.openstack.org/#/q/Ife712c43c5a61424bc68b2f5ab47cefdb46ac168,n,z

And we need to backport the fix.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-11: Fix merged to nova (master)

#10

Reviewed: https://review.openstack.org/163515
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c43f2b0d708f0f4b37850d2917c0abcc13b8789b
Submitter: Jenkins
Branch: master

commit c43f2b0d708f0f4b37850d2917c0abcc13b8789b
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 11 09:21:29 2015 -0700

Fix kwargs['instance'] KeyError in @reverts_task_state decorator

    We use @reverts_task_state everywhere in the compute manager and we
    ensure that 'instance' is a parameter to the decorator method via
    @utils.expects_func_args('instance'), however, that only ensures there
    is an instance argument, not that it's in args or kwargs (either is fine
    for what expects_func_args checks).

    The reverts_task_state decorator can get a KeyError when checking
    kwargs['instance'] and fail to revert the task_state on the instance,
    which can leave us in a bad state where non-admins can't delete the
    instance if the task_state is not None (and the reset-state API is
    admin-only).

This fixes the KeyError in the decorator by normalizing the args/kwargs
list into a single dict that we can pull the instance from.

    Also adds a warning log if we fail the instance update since it
    shouldn't happen and we want to know if it does because of the
    aforementioned problems with deleting orphaned instances.

    There isn't a specific unit test added for this since moving
    kwargs['instance'] above the try/except in reverts_task_state makes a
    lot of tests fail already if you don't have the normalize code.

Closes-Bug: #1423952

Change-Id: I70f464120c798422f9a3d601b7cdf3b0a8320690

Changed in nova:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-11: Fix proposed to nova (stable/juno)

#11

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/163623

Matt Riedemann (mriedem) on 2015-03-11

tags:

removed: juno-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-11: Fix proposed to nova (stable/icehouse)

#12

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/163633

Alan Pevec (apevec) on 2015-03-12

tags:

removed: icehouse-backport-potential

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-12: Fix merged to nova (stable/juno)

#13

Reviewed: https://review.openstack.org/163623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8c377b2f50345107c96636ce903409b741801c86
Submitter: Jenkins
Branch: stable/juno

commit 8c377b2f50345107c96636ce903409b741801c86
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 11 09:21:29 2015 -0700

Fix kwargs['instance'] KeyError in @reverts_task_state decorator

    We use @reverts_task_state everywhere in the compute manager and we
    ensure that 'instance' is a parameter to the decorator method via
    @utils.expects_func_args('instance'), however, that only ensures there
    is an instance argument, not that it's in args or kwargs (either is fine
    for what expects_func_args checks).

    The reverts_task_state decorator can get a KeyError when checking
    kwargs['instance'] and fail to revert the task_state on the instance,
    which can leave us in a bad state where non-admins can't delete the
    instance if the task_state is not None (and the reset-state API is
    admin-only).

This fixes the KeyError in the decorator by normalizing the args/kwargs
list into a single dict that we can pull the instance from.

    Also adds a warning log if we fail the instance update since it
    shouldn't happen and we want to know if it does because of the
    aforementioned problems with deleting orphaned instances.

    There isn't a specific unit test added for this since moving
    kwargs['instance'] above the try/except in reverts_task_state makes a
    lot of tests fail already if you don't have the normalize code.

Closes-Bug: #1423952

Change-Id: I70f464120c798422f9a3d601b7cdf3b0a8320690
(cherry picked from commit c43f2b0d708f0f4b37850d2917c0abcc13b8789b)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-03-12: Fix merged to nova (stable/icehouse)

#14

Reviewed: https://review.openstack.org/163633
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c70e1fbebf5488ca9f6c0ce3658f7583df3bbea5
Submitter: Jenkins
Branch: stable/icehouse

commit c70e1fbebf5488ca9f6c0ce3658f7583df3bbea5
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 11 09:21:29 2015 -0700

Fix kwargs['instance'] KeyError in @reverts_task_state decorator

    We use @reverts_task_state everywhere in the compute manager and we
    ensure that 'instance' is a parameter to the decorator method via
    @utils.expects_func_args('instance'), however, that only ensures there
    is an instance argument, not that it's in args or kwargs (either is fine
    for what expects_func_args checks).

    The reverts_task_state decorator can get a KeyError when checking
    kwargs['instance'] and fail to revert the task_state on the instance,
    which can leave us in a bad state where non-admins can't delete the
    instance if the task_state is not None (and the reset-state API is
    admin-only).

This fixes the KeyError in the decorator by normalizing the args/kwargs
list into a single dict that we can pull the instance from.

    Also adds a warning log if we fail the instance update since it
    shouldn't happen and we want to know if it does because of the
    aforementioned problems with deleting orphaned instances.

    There isn't a specific unit test added for this since moving
    kwargs['instance'] above the try/except in reverts_task_state makes a
    lot of tests fail already if you don't have the normalize code.

Closes-Bug: #1423952

Change-Id: I70f464120c798422f9a3d601b7cdf3b0a8320690
(cherry picked from commit c43f2b0d708f0f4b37850d2917c0abcc13b8789b)

Thierry Carrez (ttx) on 2015-03-20

Changed in nova:
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-04-30

Changed in nova:
milestone:	kilo-3 → 2015.1.0

OpenStack Compute (nova)

It is impossible to delete an instance that has failed due to neutron/nova notification problems

Bug Description

Other bug subscribers

Remote bug watches

	Status	Importance	Assigned to	Milestone
OpenStack Compute (nova)	Fix Released	Critical	Matt Riedemann	OpenStack Compute (nova) 2015.1.0 "kilo"
Icehouse	Fix Released	Critical	Ihar Hrachyshka	OpenStack Compute (nova) 2014.1.4
Juno	Fix Released	Critical	Matt Riedemann	OpenStack Compute (nova) 2014.2.3