Instances stuck in deleting task_state require n-cpu restart to remove

Bug #1299139 reported by Matt Riedemann
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
New
Undecided
Unassigned

Bug Description

Bug 1248563 "Instance deletion is prevented when another component locks up" provided a partial fix https://review.openstack.org/#/c/55444/ which introduces another problem, which is subsequent delete requests are ignored.

When doing Tempest 3rd party CI runs we see instances fail to build (could be a scheduling/resource problem, timeout, whatever) and then get stuck in deleting task_state and are never cleaned up.

The patch even says:

"Dealing with delete requests that never got executed is not in scope of this change and will be submitted separately."

That's the bug reported here. For example, this is several hours after our Tempest run finished:

http://paste.openstack.org/show/74584/

There is also some history after patch 55444 merged, we had this revert of a revert https://review.openstack.org/#/c/70187/, which got reverted itself again later because it was causing race failures in hyper-v CI:

https://review.openstack.org/#/c/71363/

So there is a lot of half-baked code here and I haven't been able to get a response from Stan on bug 1248563 but basically it boils down to the original change 55444 depended on some later changes working, and those were ultimately reverted due to race conditions breaking in the gate.

I would propose that at least for icehouse-rc1 we get the original patch reverted since it's not a complete solution and introduces another bug.

Tags: api
Revision history for this message
Matt Riedemann (mriedem) wrote :

Further, this was the patch to cleanup instances stuck in 'deleting' task_state:

https://review.openstack.org/#/c/55660/

So the workaround here is you have to restart the compute service, that's not an ideal solution.

Changed in nova:
importance: Undecided → High
tags: added: icehouse-rc-potential
Revision history for this message
Matt Riedemann (mriedem) wrote :

Note that the reset-state API doesn't help, the instance is just put into error/deleting state then but you still can't force the delete, you have to restart the compute service.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Tracing the first bad instance I have in that paste, 1d278e77-9900-4b2b-85eb-3a6aac4198c1, I see the delete request in the nova api log (attached):

2014-03-27 17:25:33.431 12851 INFO nova.osapi_compute.wsgi.server [req-06c80607-8eef-409e-a29c-152d693635ad 7bdd970e41c14774b4dd8f4e0e92d05c 5ab6bffec8ae44beb1b69750a0e8ea00] 9.5.48.212 "DELETE /v2/5ab6bffec8ae44beb1b69750a0e8ea00/servers/1d278e77-9900-4b2b-85eb-3a6aac4198c1 HTTP/1.1" status: 204 len: 198 time: 0.5162849

But I never see any terminate_instance happening in the compute log for that instance uuid (compute log attached).

The VM isn't in the hypervisor anymore either so the driver deleted the VM from the hypervisor but something blew up somewhere, not sure why I can't find that in the logs.

Revision history for this message
Matt Riedemann (mriedem) wrote :
tags: removed: icehouse-rc-potential
Changed in nova:
importance: High → Undecided
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is also related to bug 1296414 because if you bomb out hard and go over-quota, you can delete the instances from the database and restart n-cpu to clear them automatically on startup, but the quotas aren't cleaned up to match that so you can't boot any more instances at that point.

summary: - Instances stuck in deleting task_state never cleaned up
+ Instances stuck in deleting task_state require n-cpu restart to remove
Revision history for this message
jichenjc (jichenjc) wrote :

take e927334d-7a07-487c-9125-d2ddab9a8441 as example

I can see claim successful but I can't see 'Instance spawned successfully.' which indicate spawn success in the compute log
looks to me there is something wrong in the build process on the powervm
so the instance keep stuck to build state
this is why you will see 'During sync_power_state the instance has a pending task. Skip.' for this instance for the coming periods in the log

And I don't have other detailed info that can be referred so I don't know what happened during spawn ,this is what we need to take a look and handle it

I guess because the spawn stuck into build state, the delete will not success and the db is not update
instance.terminated_at field

this is rough guess from the logs I can read ,so let me know your opinion

Revision history for this message
Darren Carpenter (wdarrenc) wrote :
Download full text (6.3 KiB)

I've run in to the same issue within a Newton environment. The delete was sent about an hour or two before I took a look, "virsh list --all" shows the instance shutoff;

[1] instance completes build

nova-compute.log:2018-03-14 20:57:52.304 46562 INFO nova.compute.manager [req-22d52f8e-e85d-4f64-b4ba-e938f4c49ed7 71aa816389589b718658b342803f9e3e9ab0baaae13597cf918694e0dbd6c0b9 a918bbf8b58a4d5bb12d7287b56fbb87 - - -] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] Took 18.21 seconds to build instance.

[2] delete call is sent

/var/log/nova/nova-compute.log:2018-03-14 20:57:32.293 22377 INFO nova.compute.manager [req-22d52f8e-e85d-4f64-b4ba-e938f4c49ed7 71aa816389589b718658b342803f9e3e9ab0baaae13597cf918694e0dbd6c0b9 a918bbf8b58a4d5bb12d7287b56fbb87 - - -] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] Terminating instance

[3] View from the compute (shows the compute choking, and the restart of the compute service to complete the delete)

nova-compute.log:2018-03-15 14:35:00.533 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 14:45:18.541 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 14:55:27.558 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 14:56:10.768 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] VM Stopped (Lifecycle Event)
nova-compute.log:2018-03-15 14:56:10.899 46562 INFO nova.compute.manager [req-0af61757-8dc4-4770-b7e3-69404ed31acd - - - - -] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 15:05:31.645 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 15:15:58.511 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 15:26:07.537 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 15:36:10.533 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 15:46:39.554 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 15:57:12.604 46562 INFO nova.compute.manager [-] [instance: 2d404a4e-569f-4ad3-9f81-c00fd6045d2d] During sync_power_state the instance has a pending task (deleting). Skip.
nova-compute.log:2018-03-15 16:07:34.577...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.