VMs cannot be terminated if compute host is dead

Bug #872899 reported by Gavin B
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Joe Gordon

Bug Description

We have seen this issue in a Diablo-2 setup. If a compute server is down (nova-compute not running / node crashed) the VMs hosted on that server cannot be terminated - hence are consuming instance/memory/floating_ip/ ... quota. A temporary crash / halt can be fixed easily enough by a host reboot, but a permanent host death is not so easy to fix.

We need to have some way of updating the DB to wipe an instance even if the appropriate host is not contactable - and of having nodes check on boot if "their" VMs are still all there.

Version = 2011.3-d2 + some bug fixes.

Tags: folsom-rc1 hp
Changed in nova:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Joe Gordon (jogo) wrote :

What do expect to happen when:

a) nova-compute stops working, but the physical machine is up

b) the nova-compute server dies

And we can't necessarily differentiate the two.

Revision history for this message
semy (semyazz) wrote :

Why this bug has "Low" priority? I think it's critical for users. Instance is hanging on Rebooting/Deleting. Openstack should remove instances from dead hosts and show proper warning to users or even run deleted instances' snapshots on other hosts. Or something like that.

Thierry Carrez (ttx)
Changed in nova:
importance: Low → Medium
Tiago Mello (timello)
Changed in nova:
assignee: nobody → Tiago Rodrigues de Mello (tmello)
Tiago Mello (timello)
Changed in nova:
assignee: Tiago Rodrigues de Mello (tmello) → nobody
Revision history for this message
Tong Li (litong01) wrote :

At this moment, the only way I can figure out is to remove related records from Nova DB about the dead VM.
I remember there was a thread awhile back discussing this issue. The problem seems that there is no way to distinguish if the actual VM went dead or the host went dead. Also some status problems for VM, it was a long discussion. My proposal is to add a command to nova-manage to actually remove the db records, so the removal of a VM is completely up to the human who performs this command, he or she will be responsible to determine the real cause.

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

Couldn't we do a check using nova.utils.service_is_up(service) ? If it's not up, remove the record from DB.

service = db.service_get_by_host(instance['host']
service_is_up(service)

Revision history for this message
Joe Gordon (jogo) wrote :

Sam, If a compute node goes down for a finite period of time, we want to leave the record in the DB to potentially recover the VMs when the compute node powers up.

Tong, adding a command to nova-manage to remove records sounds like a good compromise.

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

Hi Joe,

Thanks for replying me! I'm still a newb with Openstack and learning a lot about it, but looking at the terminate_instance functionalliy, I can see that the record of the instance is being destroyed anyway:

A self.db.instance_destroy(context, instance_uuid) is being called in ComputeManager._delete_instance after the instance gets shutdown and volumes cleaned up etc.

_delete_instance is being called from terminate_instance, so in fact the record is not left in the DB, or maybe it is. But I think we should also call the self.db.instance_destroy function if the host is not up, it just means we don't have to shut down the instance, because it wasn't functioning anyway?

My reasoning: They are terminating so why not remove the record from the database as that is what happens anyway when you terminate an instance, the only difference being we don't have to shutdown.

A flaw in this approach, if the host comes up again the resources(Volumes) should still be cleaned up I guess?

Joe Gordon (jogo)
Changed in nova:
assignee: nobody → Joe Gordon (joe-gordon0)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/12231

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Joe Gordon (jogo) wrote :

Sam, I am working on a patch to do exactly what you outlined

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

Hehe nice job, I also wanna get started on contributing :) Just looked at your patch.

Joe Gordon (jogo)
tags: added: folsom-rc1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/12231
Committed: http://github.com/openstack/nova/commit/77dd6a0b37652bc163d4ad3083e29af55f2b9a5d
Submitter: Jenkins
Branch: master

commit 77dd6a0b37652bc163d4ad3083e29af55f2b9a5d
Author: Joe Gordon <email address hidden>
Date: Fri Aug 31 00:04:33 2012 +0000

    Allow for deleting VMs from down compute nodes.

    Fix bug 872899

    If compute node service_is_up returns false, just delete the VM from
    the database. If compute node recovers, setting
    running_deleted_instance_action=reap will clean up the node.

    Change-Id: Ibb5f1e22c2e482d304c59a485a04b882ead0c67d

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → folsom-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: folsom-rc1 → 2012.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.