It is impossible to delete an instance that has failed due to neutron/nova notification problems

Bug #1423952 reported by Lars Kellogg-Stedman
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Matt Riedemann
Icehouse
Fix Released
Critical
Ihar Hrachyshka
Juno
Fix Released
Critical
Matt Riedemann

Bug Description

If you attempt to boot a nova instance without Neutron properly configured for neutron/nova notifications, the instance will eventually fail to spawn:

  [-] [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] Instance failed to spawn
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] Traceback (most recent call last):
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2243, in _build_resources
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] yield resources
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2113, in _build_and_run_instance
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] block_device_info=block_device_info)
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2622, in spawn
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] block_device_info, disk_info=disk_info)
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4439, in _create_domain_and_network
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] raise exception.VirtualInterfaceCreateException()
  [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] VirtualInterfaceCreateException: Virtual Interface creation failed

If you try to delete this instance, the delete operation will fail. In the logs, you see:

  AUDIT nova.compute.manager [req-a4b30d0b-e6d3-429f-8f7a-b7788b79c86c None] [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] Terminating instance
  WARNING nova.virt.libvirt.driver [-] [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] During wait destroy, instance disappeared.
  INFO nova.virt.libvirt.driver [req-a4b30d0b-e6d3-429f-8f7a-b7788b79c86c None] [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] Deletion of /var/lib/nova/instances/1541a197-9f80-4ee5-a7d6-08e591aa83fd_del complete
  INFO nova.compute.manager [req-a4b30d0b-e6d3-429f-8f7a-b7788b79c86c None] [instance: 1541a197-9f80-4ee5-a7d6-08e591aa83fd] Instance disappeared during terminate

At this point, `nova list` will show:

  | 1541a197-9f80-4ee5-a7d6-08e591aa83fd | test0 | ERROR | deleting | NOSTATE | |

And it appears to be impossible to delete this instance. Running "nova reset-state <instance>" has no effect (with or without --active), nor does correctly configuring neutron.

The only way to get rid of this instance appears to be directly editing the database.

Tags: compute
Dan Smith (danms)
Changed in nova:
importance: Undecided → Medium
status: New → Triaged
milestone: none → kilo-3
Revision history for this message
Matt Riedemann (mriedem) wrote :

This was actually 'fixed' with bug 1308342 in that you can do a force-delete call on the instance and that will clean it up. However, the fix for bug 1308342 introduced a regression in the cells RPC API, so we have bug 1430822 to fix that.

Once we revert https://review.openstack.org/#/c/121800/ then we'll make the fix for this in the compute API to allow force-delete of an instance stuck in 'deleting' task_state to resolve this bug.

Note that we should also have a @revert_task_state decorator in whatever compute manager call failed on spawn so we don't get stuck in this state to begin with.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Actually, what level of code was this bug reported against? Juno? Or Kilo? Because build_and_run_instance in the compute manager already has the @reverts_task_state decorator:

http://git.openstack.org/cgit/openstack/nova/tree/nova/compute/manager.py?id=2015.1.0b2#n2007

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm not really seeing how this is happening on master level code, but I'll write a unit test to try and recreate.

Revision history for this message
Lars Kellogg-Stedman (larsks) wrote :

This was originally reported against Icehouse, but I believe the behavior can be reproduced with Juno code as well. I have not tried to reproduce it with anything more recent.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Note that since Juno you can force-delete an instance in any vm_state but the task_state must be None:

https://review.openstack.org/#/c/111157/

So if the vm_state is 'ERROR' and the task_state is 'deleting', you have to reset the task state using the reset-state admin actions API first to reset the task_state to None and then you can force-delete it.

Revision history for this message
Matt Riedemann (mriedem) wrote :

I think a see a possible bug in the @reverts_task_state decorator in the compute manager code, it assumes there is an 'instance' key in kwargs to the method and it gets the uuid from that for the state update. However, if kwargs['instance'] raises a KeyError, we swallow it and continue on our merry way.

There are other decorators that use utilities to get all of the args into kwargs dict and then we can be sure we're getting the right thing. I'm going to play with some patches to see how often we hit this in a normal Jenkins run and were ignoring it.

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
Matt Riedemann (mriedem) wrote :

Confirmed, we can't even pass unit tests with that KeyError:

nova.tests.unit.compute.test_compute_mgr.ComputeManagerUnitTestCase.test_set_admin_password_bad_state
-----------------------------------------------------------------------------------------------------

Captured traceback:
~~~~~~~~~~~~~~~~~~~
    Traceback (most recent call last):
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/mock.py", line 1201, in patched
        return func(*args, **keywargs)
      File "nova/tests/unit/compute/test_compute_mgr.py", line 1889, in test_set_admin_password_bad_state
        self.context, instance, None)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 422, in assertRaises
        self.assertThat(our_callable, matcher)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 433, in assertThat
        mismatch_error = self._matchHelper(matchee, matcher, message, verbose)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 483, in _matchHelper
        mismatch = matcher.match(matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_exception.py", line 108, in match
        mismatch = self.exception_matcher.match(exc_info)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_higherorder.py", line 62, in match
        mismatch = matcher.match(matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 414, in match
        reraise(*matchee)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/matchers/_exception.py", line 101, in match
        result = matchee()
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/testtools/testcase.py", line 969, in __call__
        return self._callable_object(*self._args, **self._kwargs)
      File "nova/compute/manager.py", line 420, in decorated_function
        return function(self, context, *args, **kwargs)
      File "nova/exception.py", line 88, in wrapped
        payload)
      File "/home/mriedem/git/nova/.tox/py27/local/lib/python2.7/site-packages/oslo_utils/excutils.py", line 85, in __exit__
        six.reraise(self.type_, self.value, self.tb)
      File "nova/exception.py", line 71, in wrapped
        return f(self, context, *args, **kw)
      File "nova/compute/manager.py", line 296, in decorated_function
        instance_uuid = kwargs['instance']['uuid']
    KeyError: 'instance'

tags: added: compute icehouse-backport-potential juno-backport-potential
Matt Riedemann (mriedem)
Changed in nova:
importance: Medium → High
status: Triaged → In Progress
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/163515

Revision history for this message
Matt Riedemann (mriedem) wrote :

The thinking is https://review.openstack.org/#/c/130601/ introduced the bug since _do_build_and_run_instance is passed args rather than kwargs and since that's in a separate thread we lose the error.

Note that change was backported to stable/juno and stable/icehouse, so it's also there:

https://review.openstack.org/#/q/Ife712c43c5a61424bc68b2f5ab47cefdb46ac168,n,z

And we need to backport the fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/163515
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c43f2b0d708f0f4b37850d2917c0abcc13b8789b
Submitter: Jenkins
Branch: master

commit c43f2b0d708f0f4b37850d2917c0abcc13b8789b
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 11 09:21:29 2015 -0700

    Fix kwargs['instance'] KeyError in @reverts_task_state decorator

    We use @reverts_task_state everywhere in the compute manager and we
    ensure that 'instance' is a parameter to the decorator method via
    @utils.expects_func_args('instance'), however, that only ensures there
    is an instance argument, not that it's in args or kwargs (either is fine
    for what expects_func_args checks).

    The reverts_task_state decorator can get a KeyError when checking
    kwargs['instance'] and fail to revert the task_state on the instance,
    which can leave us in a bad state where non-admins can't delete the
    instance if the task_state is not None (and the reset-state API is
    admin-only).

    This fixes the KeyError in the decorator by normalizing the args/kwargs
    list into a single dict that we can pull the instance from.

    Also adds a warning log if we fail the instance update since it
    shouldn't happen and we want to know if it does because of the
    aforementioned problems with deleting orphaned instances.

    There isn't a specific unit test added for this since moving
    kwargs['instance'] above the try/except in reverts_task_state makes a
    lot of tests fail already if you don't have the normalize code.

    Closes-Bug: #1423952

    Change-Id: I70f464120c798422f9a3d601b7cdf3b0a8320690

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/juno)

Fix proposed to branch: stable/juno
Review: https://review.openstack.org/163623

Matt Riedemann (mriedem)
tags: removed: juno-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/163633

Alan Pevec (apevec)
tags: removed: icehouse-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/juno)

Reviewed: https://review.openstack.org/163623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8c377b2f50345107c96636ce903409b741801c86
Submitter: Jenkins
Branch: stable/juno

commit 8c377b2f50345107c96636ce903409b741801c86
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 11 09:21:29 2015 -0700

    Fix kwargs['instance'] KeyError in @reverts_task_state decorator

    We use @reverts_task_state everywhere in the compute manager and we
    ensure that 'instance' is a parameter to the decorator method via
    @utils.expects_func_args('instance'), however, that only ensures there
    is an instance argument, not that it's in args or kwargs (either is fine
    for what expects_func_args checks).

    The reverts_task_state decorator can get a KeyError when checking
    kwargs['instance'] and fail to revert the task_state on the instance,
    which can leave us in a bad state where non-admins can't delete the
    instance if the task_state is not None (and the reset-state API is
    admin-only).

    This fixes the KeyError in the decorator by normalizing the args/kwargs
    list into a single dict that we can pull the instance from.

    Also adds a warning log if we fail the instance update since it
    shouldn't happen and we want to know if it does because of the
    aforementioned problems with deleting orphaned instances.

    There isn't a specific unit test added for this since moving
    kwargs['instance'] above the try/except in reverts_task_state makes a
    lot of tests fail already if you don't have the normalize code.

    Closes-Bug: #1423952

    Change-Id: I70f464120c798422f9a3d601b7cdf3b0a8320690
    (cherry picked from commit c43f2b0d708f0f4b37850d2917c0abcc13b8789b)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/163633
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c70e1fbebf5488ca9f6c0ce3658f7583df3bbea5
Submitter: Jenkins
Branch: stable/icehouse

commit c70e1fbebf5488ca9f6c0ce3658f7583df3bbea5
Author: Matt Riedemann <email address hidden>
Date: Wed Mar 11 09:21:29 2015 -0700

    Fix kwargs['instance'] KeyError in @reverts_task_state decorator

    We use @reverts_task_state everywhere in the compute manager and we
    ensure that 'instance' is a parameter to the decorator method via
    @utils.expects_func_args('instance'), however, that only ensures there
    is an instance argument, not that it's in args or kwargs (either is fine
    for what expects_func_args checks).

    The reverts_task_state decorator can get a KeyError when checking
    kwargs['instance'] and fail to revert the task_state on the instance,
    which can leave us in a bad state where non-admins can't delete the
    instance if the task_state is not None (and the reset-state API is
    admin-only).

    This fixes the KeyError in the decorator by normalizing the args/kwargs
    list into a single dict that we can pull the instance from.

    Also adds a warning log if we fail the instance update since it
    shouldn't happen and we want to know if it does because of the
    aforementioned problems with deleting orphaned instances.

    There isn't a specific unit test added for this since moving
    kwargs['instance'] above the try/except in reverts_task_state makes a
    lot of tests fail already if you don't have the normalize code.

    Closes-Bug: #1423952

    Change-Id: I70f464120c798422f9a3d601b7cdf3b0a8320690
    (cherry picked from commit c43f2b0d708f0f4b37850d2917c0abcc13b8789b)

Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: kilo-3 → 2015.1.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.