[Libvirt]Evacuate fail may cause disk image be deleted

Bug #1550919 reported by leehom on 2016-02-28
64
This bug affects 11 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Matthew Booth

Bug Description

I checked latest source of nova on master branch, this problem is still exists.

When we are doing evacuate, eventually _do_rebuild_instance will be called.
As rebuild is not implemented in libvirt driver, in fact _rebuild_default_impl is called.

        try:
            with instance.mutated_migration_context():
                self.driver.rebuild(**kwargs)
        except NotImplementedError:
            # NOTE(rpodolyaka): driver doesn't provide specialized version
            # of rebuild, fall back to the default implementation
            self._rebuild_default_impl(**kwargs)

_rebuild_default_impl will call self.driver.spawn to boot up the instance, and spawn will in turn call _create_domain_and_network
when VirtualInterfaceCreateException or Timeout happen, self.cleanup will be called.

        except exception.VirtualInterfaceCreateException:
            # Neutron reported failure and we didn't swallow it, so
            # bail here
            with excutils.save_and_reraise_exception():
                if guest:
                    guest.poweroff()
                self.cleanup(context, instance, network_info=network_info,
                             block_device_info=block_device_info)
        except eventlet.timeout.Timeout:
            # We never heard from Neutron
            LOG.warn(_LW('Timeout waiting for vif plugging callback for '
                         'instance %(uuid)s'), {'uuid': instance.uuid},
                     instance=instance)
            if CONF.vif_plugging_is_fatal:
                if guest:
                    guest.poweroff()
                self.cleanup(context, instance, network_info=network_info,
                             block_device_info=block_device_info)
                raise exception.VirtualInterfaceCreateException()

Because default value for parameter destroy_disks is True
    def cleanup(self, context, instance, network_info, block_device_info=None,
                destroy_disks=True, migrate_data=None, destroy_vifs=True):

So if error occur when doing evacuate during wait neutron's event, instance's disk file will be deleted unexpectedly

leehom (feli5) on 2016-02-28
Changed in nova:
assignee: nobody → leehom (feli5)
Matt Riedemann (mriedem) on 2016-03-03
tags: added: evacuate libvirt rebuild
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
Matt Riedemann (mriedem) wrote :

Are you using shared storage or local disks? In the case of evacuate/rebuild, we're completely rebuilding the instance from the old image ref, flavor, network information and attached volumes (if any). So I'm not sure there is any guarantee (or point) that if nova fails to spawn the rebuilt instance it should keep the disk it created as part of the rebuild.

tags: added: compute
Changed in nova:
status: Triaged → Incomplete
Matt Riedemann (mriedem) wrote :

In fact, there is a preserve_ephemeral flag passed to the _rebuild_default_impl method and if that's true the rebuild fails because rebuild by design can't rebuild ephemeral storage, it's presumably gone and that's why you're rebuilding the instance on another host (in the case of evacuate).

Fix proposed to branch: master
Review: https://review.openstack.org/288109

Changed in nova:
assignee: leehom (feli5) → Matt Riedemann (mriedem)
status: Incomplete → In Progress
Matt Riedemann (mriedem) wrote :

I've posted a patch here: https://review.openstack.org/#/c/288109/

Please let me know if that resolves your issue (it assumes you have disks on shared storage).

leehom (feli5) wrote :

I'm using shared storage. And I will do the verify.

Matt Riedemann (mriedem) wrote :

Can you be more specific? Shared storage with Ceph, NFS, GlusterFS, other?

leehom (feli5) wrote :

Hi Matt.

It's NFS.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/288109

Matt Riedemann (mriedem) on 2016-05-13
Changed in nova:
assignee: Matt Riedemann (mriedem) → nobody
status: In Progress → New

Solving inconsistency: changing bug status from "New" to "Confirmed" as it has assignee.

Changed in nova:
status: New → Confirmed
Shunli Zhou (shunliz) wrote :

Disk image be deleted on ceph also, I tested on ceph

Fix proposed to branch: master
Review: https://review.openstack.org/578846

Changed in nova:
assignee: nobody → Matthew Booth (mbooth-9)
status: Confirmed → In Progress

Related fix proposed to branch: master
Review: https://review.openstack.org/602174

Related fix proposed to branch: master
Review: https://review.openstack.org/604400

Reviewed: https://review.openstack.org/591733
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bbe88786fc90c2106f9fae0156ee7b09ece9a83b
Submitter: Zuul
Branch: master

commit bbe88786fc90c2106f9fae0156ee7b09ece9a83b
Author: Matthew Booth <email address hidden>
Date: Tue Aug 14 16:05:11 2018 +0100

    Add regression test for bug 1550919

    This adds a failing test, which we fix in change I76448196.

    Related-Bug: #1550919
    Change-Id: I5619728d5bd684e9167495dd4550ee4f5fbb87a7

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers