OpenStack Compute (nova)

instance artefacts are not removed by libvirt driver if it fails to spawn

Bug #1626230 reported by Paul Carlton on 2016-09-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	In Progress	Low	Vladyslav Drok

Bug Description

When an instance is evacuated an attempt to rebuild it on a different host is made. If the instance spawn method in the driver fails and raises and exception then the instance is placed in an error state. However the instance is still recorded a being on the source node but depending on how far through the spawn instance related files will be present and the instance may be running on the target.

The xenAPI driver cleans up the instance artefact's if spawn fails but not
so the libvirt driver.

In the case where compute nodes do not use shared storage a subsequent attempt to evacuate the instance to the same target will fail because the instance directory is already present.

The use of reset-state and then evacuate to another node will enable the successful evacuation of the instance. However the 'orphaned' files and running instance on the original target will need to be cleaned up manually.

We could update the instance's host once the claim is complete on the target. In this case in the event of a failure to spawn it will effectively have evacuated so the files on the original host will be cleaned up when that node is restored.

However maybe we should address this by bring the libvirt driver into line
with the XenAPI driver and getting it to clean up resources associated with
an instance that fails to spawn? Will raise a blueprint for this.

See original description

Tags:

Paul Carlton (paul-carlton2) on 2016-09-22

Changed in nova:
assignee:	nobody → Paul Carlton (paul-carlton2)
status:	New → In Progress

Paul Carlton (paul-carlton2) on 2016-09-22

summary:

- evacuate leaves instance on target compute node if it fails to spawn
+ instance artefacts are not removed by libvirt driver if it fails to
+ spawn

Paul Carlton (paul-carlton2) on 2016-09-22

description:

updated

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2016-09-23:

Okay, to be honest, at first glance, I thought it was not really a bugfix because the evacuate operation is an admin policy and reset-state is exactly here to reconcile the VM state so that an operator could still fix that.

The real problem I see with setting the instance.host field before calling the driver is that all of our API actions do that *after* the driver is called, which would mean a different behaviour.

See, I'm torn. Sure, we could fix that specific issue and create a tech debt, but I'm more in favor of having the libvirt driver being more robust and be able to delete the temporary resource it created in case of any error. Like you said, that rather looks like a blueprint to me.

Changed in nova:
status:	In Progress → Opinion

Revision history for this message

Paul Carlton (paul-carlton2) wrote on 2016-09-23:

Looking at this further I think there is a way to fix the immediate issue with evacuate by catching the exception raised by spawn in the compute manager and calling the driver again to destroy the instance and delete the files (if they are not on shared storage). This can be achieved by passing the on_shared_storage setting to _rebuild_default_impl in the compute manager and catching the spawn error and cleaning up if it is an evacuate as follows

            try:
                self.driver.spawn(context, instance, image_meta, injected_files,
                                  admin_password, network_info=network_info,
                                  block_device_info=new_block_device_info)
            except exception as e:
                if recreate:
                    driver.destory(context, instance, network_info,
                                   new_block_device_info,
                                   destroy_disks=not on_shared_storage)
                raise e

Changing the libvirt driver to clean up instances when spawn fails will not work. Trouble is driver spawn doesn't know if it is part of a rebuild for evacuate or normal rebuild, boot or unshelve and worse still it doesn't know if the instance files are on shared storage. Also, libvirt users are currently used to a failed boot/rebuild leaving the instance partially booted. A valid operator action is to deal with the instance being in error state following a spawn failure by looking at the compute manager log and if the instance is defined but not running try and start it, in the past I've found this has helped me to work out why it failed to spawn.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-09-23: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/375623

Changed in nova:
status:	Opinion → In Progress

Revision history for this message

Sylvain Bauza (sylvain-bauza) wrote on 2016-09-29:

Given your last paragraph, I'm wondering whether it's really a bug, but we could probably just be discussing into the Gerrit change. For the moment, putting Low.

Changed in nova:
importance:	Undecided → Low
tags:	added: libvirt rebuild

Paul Carlton (paul-carlton2) on 2016-10-17

tags:	added: live-migration
tags:	removed: live-migration

OpenStack Infra (hudson-openstack) on 2018-09-25

Changed in nova:
assignee:	Paul Carlton (paul-carlton2) → Vladyslav Drok (vdrok)

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.