vm_state ERROR vm undeletable if first delete attempt does not succeed.

Bug #1281324 reported by Robert Collins
32
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Tiago Mello

Bug Description

We had a neutron failure in our cloud, which lead to a bunch of VM's in state ERROR.. we've repaired neutron but now we can't delete:

ERROR: Cannot 'forceDelete' while instance is in vm_state error (HTTP 409) (Request-ID:
                  req-1c4a88c3-4ea1-45a8-b987-629a69b4af06)

or stop/start
nova stop 01199ed9-b3c3-4ee9-a482-bdfdc7347ce1
ERROR: Instance 01199ed9-b3c3-4ee9-a482-bdfdc7347ce1 in task_state deleting. Cannot stop while the instance is in this state. (HTTP 400) (Request-ID: req-18d58b8d-b360-4b37-b671-34624a6dade4)
(ci-overcloud)robertc@lifelesshp:~/work$ nova start 01199ed9-b3c3-4ee9-a482-bdfdc7347ce1
ERROR: Instance 01199ed9-b3c3-4ee9-a482-bdfdc7347ce1 in vm_state error. Cannot start while the instance is in this state. (HTTP 400) (Request-ID: req-b46c0ee6-8ed8-41c3-b400-72f76429209a)

normal 'delete' doesn't error.. but doesn't delete the VM either.

The problem is that nothing is cancelling the task state, so the VMs are staying stuck indefinitely.

Revision history for this message
Robert Collins (lifeless) wrote :

I can't see anything in the logs for nova-api or nova-compute w.r.t.

Revision history for this message
Robert Collins (lifeless) wrote :

Ok, so it hits this:
                        LOG.info(_('Instance is already in deleting state, '
                                   'ignoring this request'), instance=instance)

but - the nova compute process for that VM has been restarted and the VM isn't being deleted, Also that message level of info is wrong - default logging won't show this, and this is IMO an usual situation where admins will be scratching their head.

summary: - vm_state ERROR vm undeletable
+ vm_state ERROR vm undeletable if first delete attempt does not succeed.
description: updated
Revision history for this message
Robert Collins (lifeless) wrote :

AHHA, and so here's how the problem happened in the first place:
 - the compute node wasn't reachable from the api when the delete was submitted: so when the API calls delete, task_state=deleting is set.
 - but the compute node never got the message from rabbit, so task_state=None is never set.

Revision history for this message
Robert Collins (lifeless) wrote :

| 01199ed9-b3c3-4ee9-a482-bdfdc7347ce1 | tripleo-fedora-1391918949.template.openstack.org | ERROR | deleting | Running | default-net=10.0.14.73; tripleo-bm-test=192.168.1.247 |

^ example vm. Note the STATUS ERROR power state = RUNNING

Revision history for this message
Robert Collins (lifeless) wrote :

And this if block -

         if (instance.vm_state == vm_states.SOFT_DELETED or
            (instance.vm_state == vm_states.ERROR and
            instance.task_state != task_states.RESIZE_MIGRATING)):
            LOG.debug(_("Instance is in %s state."),
                      instance.vm_state, instance=instance)

is the one that fails to delete these on startup - because they are in ERROR + != RESIZE_MIGRATING

Tiago Mello (timello)
Changed in nova:
assignee: nobody → Tiago Rodrigues de Mello (timello)
Revision history for this message
Tiago Mello (timello) wrote :

The code below in the same _init_instance function is suppose to handle the case where task_state is in 'DELETING'... but as you pointed out, the first 'if' block stops the process...

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/75047

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Steve Kowalik (stevenk) wrote :

https://review.openstack.org/#/c/74240/ pre-dates your change, but I'm not certain why the bot did not update this bug.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/74240
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=556ab844c823dd364032d59ab1b61780243cbfd1
Submitter: Jenkins
Branch: master

commit 556ab844c823dd364032d59ab1b61780243cbfd1
Author: Robert Collins <email address hidden>
Date: Tue Feb 18 16:03:23 2014 +1300

    Delete ERROR+DELETING VMs during compute startup.

    We should perhaps do this check during message bus reconnection as
    well.. Anyhow, if a compute node is offline during a nova API call
    to delete an instance, and the rabbit message is lost for some
    reason (or alternatively if the delete method throws an error)
    then the task state is not cleared and won't be cleared on compute
    restart, leaving it wedged forever.

    Change-Id: Ie0a47958eb0fb58307902437a95634d5f54f74f3
    Fixes-bug: #1281324
    Co-Authored-By: Steve Kowalik <email address hidden>

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → icehouse-rc1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-rc1 → 2014.1
Revision history for this message
Sacha Yunusic (sacha-m) wrote :

Is there any update on this? I have a similar problem. Even though I don't want to delete the instance, but turn it on.
This is my instance state:
[_ID_] | [_Name_] | ACTIVE | - | Shutdown | admin_net=10.10.0.13, 10.222.221.6 |
When I try to start it from the cli, this is what I get:
ERROR (Conflict): Instance [_ID_] in vm_state active. Cannot start while the instance is in this state. (HTTP 409)
Can I save my instance?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.