vm would be stuck in unshelving when unshelve fails

Bug #1248799 reported by hougangliu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
Guangya Liu (Jay Lau)

Bug Description

when unshelve a vm, if this vm has been offloaded, the process would involve re-scheduling.
in nova/conductor/manager.py def unshelve_instance(self, context, instance):

elif instance.vm_state == vm_states.SHELVED_OFFLOADED:
            try:
                with compute_utils.EventReporter(context, self.db,
                        'get_image_info', instance.uuid):
                    image = self._get_image(context,
                            sys_meta['shelved_image_id'])
            except exception.ImageNotFound:
                with excutils.save_and_reraise_exception():
                    LOG.error(_('Unshelve attempted but vm_state not SHELVED '
                                'or SHELVED_OFFLOADED'), instance=instance)
                    instance.vm_state = vm_states.ERROR
                    instance.save()

            filter_properties = {}
            hosts = self._schedule_instances(context, image,
                                             filter_properties,instance) <<<<<this re-scheduling would cause exception,when it occurs,the
         <<<<<<instance will be stuck in task_state: unshelving forever
            host = hosts.pop(0)['host']
            self.compute_rpcapi.unshelve_instance(context, instance, host,
                    image)

hougangliu (liuhoug)
information type: Private Security → Public
description: updated
Revision history for this message
hougangliu (liuhoug) wrote :

I try to fix it by "try-except-else" like below:

        elif instance.vm_state == vm_states.SHELVED_OFFLOADED:
            try:
                with compute_utils.EventReporter(context, self.db,
                        'get_image_info', instance.uuid):
                    image = self._get_image(context,
                            sys_meta['shelved_image_id'])
            except exception.ImageNotFound:
                with excutils.save_and_reraise_exception():
                    LOG.error(_('Unshelve attempted but vm_state not SHELVED '
                                'or SHELVED_OFFLOADED'), instance=instance)
                    instance.vm_state = vm_states.ERROR
                    instance.save()

            try:
                filter_properties = {}
                hosts = self._schedule_instances(context, image,
                                             filter_properties, instance)
                host = hosts.pop(0)['host']
            except:
                instance.task_state = None
                instance.save()
                return
            else:
                self.compute_rpcapi.unshelve_instance(context, instance, host,
                        image)

Changed in nova:
assignee: nobody → Jay Lau (jay-lau-513)
Revision history for this message
Andrew Laski (alaski) wrote :

hougangliu: There should be some better handling of scheduling failures, but I think your proposal needs to do a little more to make it visible to the user that something went wrong. Adding a 'with compute_utils.EventReporter(...)' block will at least expose the error to an end user.

Changed in nova:
importance: Undecided → Low
Andrew Laski (alaski)
Changed in nova:
status: New → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/58086

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/58086
Committed: http://github.com/openstack/nova/commit/c8ded6429616b798947072a62cb1b5ee4ea51209
Submitter: Jenkins
Branch: master

commit c8ded6429616b798947072a62cb1b5ee4ea51209
Author: Jay Lau <email address hidden>
Date: Tue Nov 26 05:55:10 2013 +0800

    instance state will be stuck in unshelving when unshelve fails

    When unshelve an instance, if this instance has been offloaded,
    the conductor manager will involve re-schedule for the instance.
    If re-schedule failed to find a target host for the unsleved
    instance, then the instance will be stuck in unshelving state.

    This patch fix the issue as this: If re-schedule failed to find a
    target host for unshelve the instance, then conductor manager will
    try to rollback the instance to unshelve state.

    Change-Id: If49b2d2c9263b853c745ce25fe146ade51948123
    Closes-Bug: #1248799

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → icehouse-1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.