libvirt migrate/resize on shared storage can cause data loss

Bug #1177247 reported by Rafi Khardalian
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Dan Smith
Grizzly
Fix Released
High
Rafi Khardalian

Bug Description

When using shared storage across hypervisors, libvirt driver resize/migrate operations can result in a loss of instance data. This is happening because many of the operations to create a copy of the instance are done within a try/except block. Thus, if any operations fail, you're into the exception which does the following:

=== code ===

        except Exception:
            with excutils.save_and_reraise_exception():
                self._cleanup_remote_migration(dest, inst_base,
                                               inst_base_resize)

    def _cleanup_remote_migration(self, dest, inst_base, inst_base_resize):
        """Used only for cleanup in case migrate_disk_and_power_off fails."""
        try:
            if os.path.exists(inst_base_resize):
                utils.execute('rm', '-rf', inst_base)
                utils.execute('mv', inst_base_resize, inst_base)
                utils.execute('ssh', dest, 'rm', '-rf', inst_base)
        except Exception:
            pass

=== end ===

It doesn't take looking at this code for long to see why this is going to be a problem with shared storage. In effect, the last ssh operation in the block above is going to blow away the original copy of the instance directory.

The issue can be easily reproduced by issuing a resize of an instance with a large root disk. In the middle of the resize, kill the ssh process created from the following call (https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L3508) and observe the exception handler destroying everything.

Revision history for this message
Rafi Khardalian (rkhardalian) wrote :

I've got a patch ready to be submitted.

description: updated
Changed in nova:
assignee: nobody → Rafi Khardalian (rkhardalian)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/28424

Changed in nova:
status: New → In Progress
Changed in nova:
assignee: Rafi Khardalian (rkhardalian) → Dan Smith (danms)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/28424
Committed: http://github.com/openstack/nova/commit/9290bddd9f270d8ea4fbd6d953a8634473979cd5
Submitter: Jenkins
Branch: master

commit 9290bddd9f270d8ea4fbd6d953a8634473979cd5
Author: Rafi Khardalian <email address hidden>
Date: Sun May 5 22:18:33 2013 +0000

    Make resize/migrated shared storage aware

    Fixes bug 1177247

    Added some logic to check for whether or not we are on a shared
    filesystem and set shared_storage accordingly. We perform similar
    checks in other areas of the code, typically through RPC calls.
    However, all the resize/migrate code is slated to be refactored for
    Hava, so the idea was to keep this patch as minimally intrusive as
    possible.

    When shared_storage is true, we pass that on to the cleanup call
    so that it no longer executes an rm via SSH, which was ultimately
    destroying the original instance directory.

    Change-Id: Ie9decedd373c000211c171df64e1e96fe78e5081

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-1
status: Fix Committed → Fix Released
tags: added: grizzly-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/32768

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/grizzly)

Reviewed: https://review.openstack.org/32768
Committed: http://github.com/openstack/nova/commit/d34d4cacf7b20f72c67f7873dcf2c372abc60ecd
Submitter: Jenkins
Branch: stable/grizzly

commit d34d4cacf7b20f72c67f7873dcf2c372abc60ecd
Author: Rafi Khardalian <email address hidden>
Date: Sun May 5 22:18:33 2013 +0000

    Make resize/migrated shared storage aware

    Fixes bug 1177247 (for stable/grizzly)

    Added some logic to check for whether or not we are on a shared
    filesystem and set shared_storage accordingly. We perform similar
    checks in other areas of the code, typically through RPC calls.
    However, all the resize/migrate code is slated to be refactored for
    Hava, so the idea was to keep this patch as minimally intrusive as
    possible.

    When shared_storage is true, we pass that on to the cleanup call
    so that it no longer executes an rm via SSH, which was ultimately
    destroying the original instance directory.

    Change-Id: Ie9decedd373c000211c171df64e1e96fe78e5081
    Cherry-Pick: 9290bddd9f270d8ea4fbd6d953a8634473979cd5

tags: added: in-stable-grizzly
Alan Pevec (apevec)
tags: removed: grizzly-backport-potential in-stable-grizzly
Changed in nova:
importance: Undecided → High
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-1 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.