virt drivers' resume_state_on_host_boot don't handle migrating instances

Bug #1131588 reported by Chris Behrens
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Dan Smith

Bug Description

nova-compute's init_host() can attempt to resume instances that are found not to be in RUNNING power state... but should be.

For xenapi, the driver just tries to 'power on' the instance.
For libvirt, it tries to do a hard reboot.

Neither method account for 'resize_migrating' task state... where (at least in XenAPI), the instance would have been renamed to append '-orig' to it. The driver will raise NotFound because it can't find the instance.

nova-compute's init_host() will catch all Exceptions and set the instance to ERROR... but it seems that since we have code to clean up the destination (there's code that looks at all instances in the driver and looks to see if their 'host' changed... and destroys them if so)... we could potentially clean up the source as well. Rename the instance back, clear task_state, and restart the old instance?

Chris Behrens (cbehrens)
Changed in nova:
importance: Undecided → High
milestone: none → grizzly-rc1
status: New → Triaged
Revision history for this message
Chris Behrens (cbehrens) wrote :

Trace for XenAPI:

2013-02-22 05:29:04 9736 TRACE nova File "/opt/rackstack/1.31/nova/lib/python2
.6/site-packages/nova-2013.1-py2.6.egg/nova/service.py", line 394, in start
2013-02-22 05:29:04 9736 TRACE nova self.manager.init_host()
2013-02-22 05:29:04 9736 TRACE nova File "/opt/rackstack/1.31/nova/lib/python2
.6/site-packages/nova-2013.1-py2.6.egg/nova/compute/manager.py", line 468, in in
it_host
2013-02-22 05:29:04 9736 TRACE nova self._init_instance(context, instance)
2013-02-22 05:29:04 9736 TRACE nova File "/opt/rackstack/1.31/nova/lib/python2
.6/site-packages/nova-2013.1-py2.6.egg/nova/compute/manager.py", line 442, in _i
nit_instance
2013-02-22 05:29:04 9736 TRACE nova block_device_info)
2013-02-22 05:29:04 9736 TRACE nova File "/opt/rackstack/1.31/nova/lib/python2
.6/site-packages/nova-2013.1-py2.6.egg/nova/virt/xenapi/driver.py", line 607, in
 resume_state_on_host_boot
2013-02-22 05:29:04 9736 TRACE nova self._vmops.power_on(instance)
2013-02-22 05:29:04 9736 TRACE nova File "/opt/rackstack/1.31/nova/lib/python2
.6/site-packages/nova-2013.1-py2.6.egg/nova/virt/xenapi/vmops.py", line 1296, in
 power_on
2013-02-22 05:29:04 9736 TRACE nova vm_ref = self._get_vm_opaque_ref(instanc
e)
2013-02-22 05:29:04 9736 TRACE nova File "/opt/rackstack/1.31/nova/lib/python2
.6/site-packages/nova-2013.1-py2.6.egg/nova/virt/xenapi/vmops.py", line 691, in
_get_vm_opaque_ref
2013-02-22 05:29:04 9736 TRACE nova instance['name'])
2013-02-22 05:29:04 9736 TRACE nova NotFound: Could not find VM with name instance-4834bbfa-cb0b-4b29-8da2-b55d9f4dea38

At this point the compute process dies... because it was just in the init phase.

The instance's name in the driver at this moment was actually instance-4834bbfa-cb0b-4b29-8da2-b55d9f4dea38-orig because it was being migrated when compute was restarted.

Revision history for this message
Chris Behrens (cbehrens) wrote :

Ignore the above traceback. It was based on old code that didn't have a 'except Exception' catch in init_host().

summary: - init_host() crash on instances in mid-migrate
+ virt drivers' resume_state_on_host_boot don't handle migrating instances
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/22897

Changed in nova:
assignee: nobody → Dan Smith (danms)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/22897
Committed: http://github.com/openstack/nova/commit/a9d0313cc7d36a9e5e60aa7710b9ce8ec2f37f19
Submitter: Jenkins
Branch: master

commit a9d0313cc7d36a9e5e60aa7710b9ce8ec2f37f19
Author: Dan Smith <email address hidden>
Date: Mon Feb 25 16:16:19 2013 -0500

    Make compute manager revert crashed migrations on init_host()

    This should provide a path to cleanup partial migrations by reverting
    them back to the original state instead of leaving them in limbo. Note
    that libvirt needs a change to make this work, in the case where the
    unconfirmed instance's directory is still present and needs to be
    removed before the rollback.

    The libvirt, xenapi and powervm drivers are modified here, and tests
    are added to confirm their compliance with the proper behavior.

    Addresses bug 1131588

    Change-Id: I85bc0f6e9cda10aa85328199d107a3ff6e240b96

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: grizzly-rc1 → 2013.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.