Nova can't force-delete instance that is stuck in deleting process

Bug #1641523 reported by Aidin Alihodzic
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Medium
jichenjc
Newton
In Progress
Medium
Lee Yarwood

Bug Description

I tried to delete an instance that got into error state during creation. And it's stuck in deleting "Task State".
I checked with virsh and instance is not present. And after I tried to force-delete it nova client crashed with a message that it should be reported as bug.

Installation is Newton on Centos 7.

Here is part of nova-api.log

2016-11-14 08:37:05.099 4189 INFO nova.osapi_compute.wsgi.server [req-37d453be-6f18-479d-bae8-6b24493a3366 24bef1856bf643c2bda92b710a7a6fcd 43dec04d0696450787fee3580b9780de - default default] 10.6.0.11 "GET /v2.1/ HTTP/1.1" status: 200 len: 714 time: 0.0065050
2016-11-14 08:37:05.569 4189 INFO nova.osapi_compute.wsgi.server [req-9a5d94a1-a2cf-45b8-bfac-5f231f3e07b7 24bef1856bf643c2bda92b710a7a6fcd 43dec04d0696450787fee3580b9780de - default default] 10.6.0.11 "GET /v2.1/43dec04d0696450787fee3580b9780de/servers/198b4ce6-84cc-4a45-8eda-d0b3262f3df4 HTTP/1.1" status: 200 len: 2243 time: 0.2764120
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions [req-fa12df45-bf9b-471b-8914-24a57427e358 24bef1856bf643c2bda92b710a7a6fcd 43dec04d0696450787fee3580b9780de - default default] Unexpected exception in API method
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions File "/usr/lib/python2.7/site-packages/nova/api/openstack/extensions.py", line 338, in wrapped
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions return f(*args, **kwargs)
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions File "/usr/lib/python2.7/site-packages/nova/api/openstack/compute/deferred_delete.py", line 64, in _force_delete
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions self.compute_api.force_delete(context, instance)
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 165, in inner
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions return function(self, context, instance, *args, **kwargs)
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions File "/usr/lib/python2.7/site-packages/nova/compute/api.py", line 138, in inner
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions method=f.__name__)
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions InstanceInvalidState: Instance 198b4ce6-84cc-4a45-8eda-d0b3262f3df4 in task_state deleting. Cannot force_delete while the instance is in this state.
2016-11-14 08:37:05.651 4189 ERROR nova.api.openstack.extensions
2016-11-14 08:37:05.651 4189 INFO nova.api.openstack.wsgi [req-fa12df45-bf9b-471b-8914-24a57427e358 24bef1856bf643c2bda92b710a7a6fcd 43dec04d0696450787fee3580b9780de - default default] HTTP exception thrown: Unexpected API Error. Please report this at http://bugs.launchpad.net/nova/ and attach the Nova API log if possible.
<class 'nova.exception.InstanceInvalidState'>

Rajesh Tailor (ratailor)
Changed in nova:
assignee: nobody → Rajesh Tailor (ratailor)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/397373

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/397542

Changed in nova:
status: New → In Progress
Revision history for this message
Aidin Alihodzic (ssbljk) wrote :

I need to add some more details because there was some unusual setup at my side and maybe this bug report was wrong in the first place.

I have been following official guide for installation of Newton on CentOS that can be found in Documentation.
Along the way I wrote each step as Ansible script so I can easily adjust it to my infrastructure and reproduce it in the future.
The most important part of this setup was that I wanted that everything works with SELinux and Firewalld turned ON.

After many tests, I changed a setup bit by bit while testing as much things as I could.

I was intrigued why I got that many instances stuck in "Deleting" process so I investigated it a bit more today.

Thing occurred after I tested migration of instances between hosts (not live one).
To implement migration I pretty much followed these two:
https://www.sebastien-han.fr/blog/2015/01/06/openstack-configure-vm-migrate-nova-ssh/
https://twiki.cern.ch/twiki/bin/view/Sandbox/GettingStartedwithOpenStack

Since novas required to have ssh keys generated and exchanged between hosts. I wrote Ansible scripts to do that, and it turned out that they won't do it because of these:
/var/lib/nova was a home directory for nova user and it had a selinux context nova_var_lib_t so I found in audit.log that it won't allow nova user to log in because of that.
And I changed context to user_home_t while leaving other directories to their default one, which is nova_var_lib_t except .ssh that had to be ssh_home_t.

Everything worked until I rebooted controller host (in this setup I have controller host that runs compute too, and two other compute nodes and Storwize as backend for Cinder). So I suppose that along the various tests and try/fail/success scenarios that I have been trying, I turned SELinux temporarily off by "setenforce 0" which returned back after reboot.

Today I found in logs that nova-api complains about not being able to access /var/lib/nova/keys directory because of wrong context of /var/lib/nova so I returned it back from user_home_t to nova_var_lib_t and set Ansible scripts to do the same, to generate and exchange ssh keys and to return context of nova's home and I don't get those stuck instances, so I suppose that it was the reason why I got those crashes of nova client when I tried to delete some of stuck instances.

Revision history for this message
Rajesh Tailor (ratailor) wrote :

I think you can still reproduce this, using below steps:

1) Create a instance and wait for it to become active.
2) Try to delete instance, just after executing instance delete command, as soon as vm task_state is changed to 'deleting', stop the rabbitmq-server. (To achieve this, you can put a sleep in delete api after setting vm task_state to 'deleting').
3) Now if you issue another delete/force-delete request for the same instance, nova will ignore those subsequent requests.

And the instance task_state will reside in 'deleting' forever, unless you restart the nova-compute service explicitly.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/397373
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=01e2c5c05bb488f4d5c41a6f61bd4b3328cc5ed2
Submitter: Jenkins
Branch: master

commit 01e2c5c05bb488f4d5c41a6f61bd4b3328cc5ed2
Author: jichenjc <email address hidden>
Date: Tue Nov 15 02:38:39 2016 +0800

    Add handle for 2 exceptions in force_delete

    as force_delete is same to delete action, we need handle
    InstanceNotFound and InstanceCellUnknown exception.

    Change-Id: I1840f8f4ac1b793fd6348b4d056cb94e1333e596
    Related-Bug: 1641523

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (stable/newton)

Related fix proposed to branch: stable/newton
Review: https://review.openstack.org/400651

Matt Riedemann (mriedem)
Changed in nova:
assignee: Rajesh Tailor (ratailor) → jichenjc (jichenjc)
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (stable/newton)

Reviewed: https://review.openstack.org/400651
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a339593469c6a929d49b25ae1e303dc1f7472a4b
Submitter: Jenkins
Branch: stable/newton

commit a339593469c6a929d49b25ae1e303dc1f7472a4b
Author: jichenjc <email address hidden>
Date: Tue Nov 15 02:38:39 2016 +0800

    Add handle for 2 exceptions in force_delete

    as force_delete is same to delete action, we need handle
    InstanceNotFound and InstanceCellUnknown exception.

    Change-Id: I1840f8f4ac1b793fd6348b4d056cb94e1333e596
    Related-Bug: 1641523
    (cherry picked from commit 01e2c5c05bb488f4d5c41a6f61bd4b3328cc5ed2)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/440536

Revision history for this message
Sean Dague (sdague) wrote :

Automatically discovered version newton in description. If this is incorrect, please update the description to include 'nova version: ...'

tags: added: openstack-version.newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Sean Dague (<email address hidden>) on branch: master
Review: https://review.openstack.org/397542
Reason: This review is > 4 weeks without comment, and is not mergable in it's current state. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Ghanshyam Mann (ghanshyammann) wrote :

This is duplicate of https://bugs.launchpad.net/nova/+bug/1741000

It is fixed by https://review.openstack.org/#/c/530879/ and fix is backported till stable/pike.

Marking as duplicate of 1741000

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Balazs Gibizer (<email address hidden>) on branch: master
Review: https://review.opendev.org/440536
Reason: This is a pretty old patch with negative review. Feel free to restore it (or ask gibi on irc to restore it) if you still working on it.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.