task_state is not reset when instance fails to build

Bug #1241117 reported by Matt Riedemann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Won't Fix
Undecided
Matt Riedemann

Bug Description

We have an instance hit the instance_build_timeout during deploy.

The instance is put to Error state. However, the task state is still staying in 'initializing'. Shouldn't the task state be reset in this case?

2013-10-16 07:38:40.306 5339 WARNING nova.compute.manager [-] [instance: 1f8049da-550a-4174-bed8-0d22f3fb0b0c] NV-81716D6 Instance build timed out. Set to error state.

    @periodic_task.periodic_task
    def _check_instance_build_time(self, context):
        """Ensure that instances are not stuck in build."""
        timeout = CONF.instance_build_timeout
        if timeout == 0:
            return

        filters = {'vm_state': vm_states.BUILDING,
                   'host': self.host}
        building_insts = self.conductor_api.instance_get_all_by_filters(
            context, filters, columns_to_join=[])

        for instance in building_insts:
            if timeutils.is_older_than(instance['created_at'], timeout):
                self._set_instance_error_state(context, instance['uuid'])
                LOG.warn(_("Instance build timed out. Set to error state."),
                         instance=instance)

http://paste.openstack.org/show/48656/

IRC discussion with some history on the code:

(12:36:46 PM) mriedem: mrodden: looks like it's been that way for a long long time
(12:36:46 PM) mriedem: https://github.com/openstack/nova/blame/master/nova/compute/manager.py#L485
(12:37:29 PM) SergeyLukjanov left the room (quit: Quit: My MacBook has gone to sleep. ZZZzzz…).
(12:39:28 PM) mriedem: bnemec: any ideas? ^
(12:40:50 PM) mriedem: mrodden: looks like expected task state is None when you rebuild from an ERROR state
(12:40:50 PM) mriedem: https://github.com/openstack/nova/blob/master/nova/compute/api.py#L2033
(12:40:52 PM) mriedem: so looks like a bug
(12:40:59 PM) mriedem: 1 line fix at least :)
Mr__T mriedem mrodden
(12:43:19 PM) mriedem: mrodden: furthermore, looks like the unit tests aren't validating the vm_state/task_state passed to set_instance_error_state, they just take **kwargs
(12:43:20 PM) mriedem: def fake_set_instance_error_state(_ctxt, instance_uuid, **kwargs):
(12:43:26 PM) mriedem: easy fix

Matt Riedemann (mriedem)
Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/52519

Changed in nova:
status: New → In Progress
Matt Riedemann (mriedem)
Changed in nova:
status: In Progress → Won't Fix
Revision history for this message
Matt Riedemann (mriedem) wrote :

This is essentially working as designed based on this change:

https://github.com/openstack/nova/commit/99c51e34230394cadf0b82e364ea10c38e193979

You can only rebuild an instance in error state if it was successfully launched once before.

The task_state should be left as-is in the case of the failure so that we can tell what action was being performed when it errored, i.e. building/scheduling, migrating, etc. If we set the task_state to None when the vm_state goes to ERROR, we lose that information and can't recover from it.

The word is the vm_state/task_state system is going to be re-visted in the Icehouse release, there are several task-related summit sessions proposed and this looks related:

https://etherpad.openstack.org/p/IcehouseTaskAPI

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.