Block migrate with attached volumes copies volumes to themselves

Bug #1398999 reported by Chris St. Pierre on 2014-12-03
94
This bug affects 14 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Pawel Koniszewski
Juno
Undecided
Unassigned
libvirt (Ubuntu)
High
Unassigned
Trusty
Undecided
Unassigned
Utopic
Undecided
Unassigned
Vivid
Undecided
Unassigned
Wily
High
Unassigned
nova (Ubuntu)
High
Unassigned
Trusty
Medium
Unassigned
Utopic
Undecided
Unassigned
Vivid
High
Unassigned
Wily
High
Unassigned

Bug Description

When an instance with attached Cinder volumes is block migrated, the Cinder volumes are block migrated along with it. If they exist on shared storage, then they end up being copied, over the network, from themselves to themselves. At a minimum, this is horribly slow and de-sparses a sparse volume; at worst, this could cause massive data corruption.

More details at http://lists.openstack.org/pipermail/openstack-dev/2014-June/038152.html

Fix proposed to branch: master
Review: https://review.openstack.org/139085

Changed in nova:
assignee: nobody → Chris St. Pierre (stpierre)
status: New → In Progress

Fix proposed to branch: master
Review: https://review.openstack.org/141832

Change abandoned by Chris St. Pierre (<email address hidden>) on branch: master
Review: https://review.openstack.org/141832

Dr. Jens Harbott (j-harbott) wrote :

Instead of disabling live migration in this case, as proposed by your patch, it may be an option to set the volumes on shared storage as "shareable" in the libvirt definition. We have been using that approach for our RBD backed volumes for some months now quite successfully, see https://github.com/cloudbau/nova/commit/b5e2a8ecd53341f7ad16fcc789cc40222272e72c for our patch.

We did some basic performance comparison and there does not seem to be any major impact, though this may need some further analysis.

Chris St. Pierre (stpierre) wrote :

I'd still be hesitant about that since Berrangé addressed that in his post to the ML: "Even that distinction [sharable vs. exclusive] is somewhat dubious and so not reliably what you would want."

I really think that at this point the important thing is to ensure that we don't copy volumes onto themselves over the network. Once we've removed the opportunity for extremely slow data corruption, then we can consider optional/possible ways to handle live migrations with volumes attached. But I think that we can demonstrate that, for now at least, the only solution that will work for everyone using libvirt is to disable these live migrations entirely.

The proposed solution seems to block not just libvirt but all other hypervisors from being able to live-migrate with volumes. I feel that the solution has to be in the hypervisor/volume driver space.

I suggest a flag that enables your patch by default but gives people an opportunity to override if desired.

Sean Dague (sdague) on 2015-01-27
tags: added: libvirt

Reviewed: https://review.openstack.org/139085
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d667b6a63e80b2f8d6311c2cf224ba32628eed84
Submitter: Jenkins
Branch: master

commit d667b6a63e80b2f8d6311c2cf224ba32628eed84
Author: Chris St. Pierre <email address hidden>
Date: Wed Dec 3 16:16:34 2014 -0600

    libvirt: Fail when live block migrating instance with volumes

    This raises an exception when attempting to live block migrate (nova
    live-migration --block-migrate) an instance with attached volumes.
    libvirt copies these volumes from themselves to themselves. At a
    minimum, this is horribly slow and de-sparses a sparse volume; at
    worst, this could cause massive data corruption.

    Closes-Bug: 1398999
    Change-Id: Ibcd423976bb9fea46e3e1cb23cc8e5cd944d8fc2

Changed in nova:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/156666
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a390a2f257402d6b380774acaa0607a65abb4df4
Submitter: Jenkins
Branch: master

commit a390a2f257402d6b380774acaa0607a65abb4df4
Author: Daniel P. Berrange <email address hidden>
Date: Tue Feb 17 17:10:49 2015 +0000

    libvirt: switch LibvirtConnTestCase back to NoDBTestCase

    The following commit changed LibvirtConnTestCase to inherit
    from TestCase

      commit d667b6a63e80b2f8d6311c2cf224ba32628eed84
      Author: Chris St. Pierre <email address hidden>
      Date: Wed Dec 3 16:16:34 2014 -0600

        libvirt: Fail when live block migrating instance with volumes

    This caused database setup to be performed once more, doubling
    the test execution time.

    Related-bug: #1398999
    Change-Id: Ibad5bf4704a424f73d0e28e7f889ca3db24f7b7e

Joe Gordon (jogo) wrote :

I think there is a valid case for doing block migrate with a cinder volume attached to an instance:

* Cloud isn't using a shared filesystem for ephemeral storage
* Instance is booted from an image, and a volume is attached afterwards. An admin wants to take the box the instance is running on offline for maintenance with a minimal impact to the instances running on it.

The 'fix' was a a workaround not not an actual fix. It sounds like a fix is needed in libvirt first.

http://lists.openstack.org/pipermail/openstack-dev/2015-March/059324.html

Changed in nova:
status: Fix Committed → Confirmed
Chris Friesen (cbf123) wrote :

Would this also affect an instance that is boot-from-volume but where the instance files are on local storage? Or do we even support that scenario?

Pavel Boldin (pboldin) wrote :

Neither `libvirt' nor `qemu' copy block devices marked as `shared'. It is either nova misbehaviour not to marking shared block devices as such or libvirt bug forgetting about such a mark.

Dr. Jens Harbott (j-harbott) wrote :

@Pavel: The flag is called "shareable" and there was some discussion that ended up in some ppl claiming that was misusing this flag. We do run pretty well with a patch setting that flag in our local setup though (see comment #4), targeted only at the ceph/rbd case.

Jacek Nykis (jacekn) wrote :

Is there a bug we can track where root cause is being worked on?

Jacek Nykis (jacekn) wrote :

Sorry for 2nd comment. Will you update icehouse as well?

Dr. Jens Harbott (j-harbott) wrote :

https://bugzilla.redhat.com/show_bug.cgi?id=1203032 is the bug for adding support to libvirt.

Icehouse is pretty near to EOL and I don't think that this issue will be deemed critical enough for a backport even to Juno.

Jacek Nykis (jacekn) wrote :

The ubuntu wiki says icehouse will be supported for 4 more years:
https://wiki.ubuntu.com/ServerTeam/CloudArchive

If there is a chance of data loss I think it's completely justified to have the workaround backported to LTS

Pavel Boldin (pboldin) wrote :

@DrJens, I'm already working on implementation of that bug.

I have few tests to be done before sending the patchset to the maillist for review.

Yet, there is a problem that NBD tunnelled migration is not supported.

Jacek Nykis (jacekn) wrote :

I raised LP1449096 asking for Ubuntu nova package to get the workaround

Reviewed: https://review.openstack.org/176768
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=98834ab9f745d53dd3bf40e486e4b8f14f1fd47e
Submitter: Jenkins
Branch: stable/juno

commit 98834ab9f745d53dd3bf40e486e4b8f14f1fd47e
Author: Chris St. Pierre <email address hidden>
Date: Wed Dec 3 16:16:34 2014 -0600

    libvirt: Fail when live block migrating instance with volumes

    This raises an exception when attempting to live block migrate (nova
    live-migration --block-migrate) an instance with attached volumes.
    libvirt copies these volumes from themselves to themselves. At a
    minimum, this is horribly slow and de-sparses a sparse volume; at
    worst, this could cause massive data corruption.

    (cherry picked from commit d667b6a63e80b2f8d6311c2cf224ba32628eed84)

    Closes-Bug: 1398999
    Change-Id: Ibcd423976bb9fea46e3e1cb23cc8e5cd944d8fc2

tags: added: in-stable-juno
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
James Page (james-page) on 2015-07-15
Changed in nova (Ubuntu Vivid):
status: New → Triaged
Changed in nova (Ubuntu Wily):
status: Confirmed → Triaged
Changed in nova (Ubuntu Utopic):
status: New → Triaged
Changed in nova (Ubuntu Trusty):
status: New → Triaged
tags: added: live-migrate
Pavel Boldin (pboldin) wrote :

The libvirt code providing selective block migration in case of NBD migration (non-tunnelled one) have been merged: https://bugzilla.redhat.com/show_bug.cgi?id=1203032

How can we progress on this issue?

Tony Breeds (o-tony) wrote :

@pboldin Thanks so much for doing that work.

I think we can now check the libvirt version and only raise the exception if libvirt < 1.2.17

James Page (james-page) on 2015-09-01
Changed in nova (Ubuntu Trusty):
importance: Undecided → High
Changed in nova (Ubuntu Wily):
importance: Undecided → High
Changed in nova (Ubuntu Vivid):
importance: Undecided → High
James Page (james-page) on 2015-09-01
Changed in nova (Ubuntu Utopic):
status: Triaged → Won't Fix
Changed in libvirt (Ubuntu Utopic):
status: New → Won't Fix
Serge Hallyn (serge-hallyn) wrote :

Looks like the patches to fix this (at
https://bugzilla.redhat.com/show_bug.cgi?id=1203032#c11) are in
1.2.17, but not in 1.2.16 which is currently in wily.

Changed in libvirt (Ubuntu Wily):
assignee: nobody → Serge Hallyn (serge-hallyn)
importance: Undecided → High
status: New → In Progress
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.2.16-2ubuntu9

---------------
libvirt (1.2.16-2ubuntu9) wily; urgency=medium

  * Add upstream patches implementing a '--migrate-disks' option to virsh
    migrate to specify block devices to migrate. (LP: #1398999)

 -- Serge Hallyn <email address hidden> Fri, 04 Sep 2015 09:29:52 -0500

Changed in libvirt (Ubuntu Wily):
status: In Progress → Fix Released
Bartosz Fic (bartosz-fic) wrote :

I've tested this libvirt fix on simple 1 controller and 2 compute nodes multinode devstack setup.
Both compute nodes have libvirt in version 1.2.16.

After removing check which is in this patch https://review.openstack.org/#/c/176768/ I still cannot successfully complete
block migration of vm with attached volume.

However, the same instance booted from image without any volume attached is successfully block migrated.

Serge Hallyn (serge-hallyn) wrote :

@bartosz-fic

So the libvirt bug for wily should still be marked as not fix released?

You said you are don "1.2.16" - to be sure, were you on 1.2.16-2ubuntu9 or later?

If so, do you have any idea which patches are still missing? The upstream patchset which was supposd to fix this was included with that release, so I wonder whether the bug is actually still present upstream.

Dr. Jens Harbott (j-harbott) wrote :

I think there is some confusion here. As I understand it, the part that was fixed in libvirt was changing the API so that now it is possible to define a subset of block devices to be copied during migration. Now to fix the original issue, another patch in nova will be needed, that uses this extended API to avoid copying shared block devices to itself.

Serge Hallyn (serge-hallyn) wrote :

Ah right - thanks.

Fix proposed to branch: master
Review: https://review.openstack.org/227278

Changed in nova:
assignee: Chris St. Pierre (stpierre) → Bartosz Fic (bartosz-fic)
status: Confirmed → In Progress
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in libvirt (Ubuntu Trusty):
status: New → Confirmed
Changed in libvirt (Ubuntu Vivid):
status: New → Confirmed
Changed in nova:
importance: Undecided → High
Bartosz Fic (bartosz-fic) wrote :

Selective block device migration feature was backported to libvirt 1.2.16 for ubuntu willy.
This patch provides block live migration of vm booted from image with attached devices on libvirt 1.2.16.

The attachment "Patch for Ubuntu willy with libvirt 1.2.16" seems to be a patch. If it isn't, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are a member of the ~ubuntu-reviewers, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issues please contact him.]

tags: added: patch
Changed in nova:
assignee: Bartosz Fic (bartosz-fic) → Pawel Koniszewski (pawel-koniszewski)
Paul Murray (pmurray) on 2015-11-06
tags: added: live-migration
removed: live-migrate
Changed in libvirt (Ubuntu):
assignee: Serge Hallyn (serge-hallyn) → nobody

Reviewed: https://review.openstack.org/252506
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f0d5fc61916f41214da580097a09136e4ed2c99a
Submitter: Jenkins
Branch: master

commit f0d5fc61916f41214da580097a09136e4ed2c99a
Author: Pawel Koniszewski <email address hidden>
Date: Fri Dec 11 03:28:50 2015 +0100

    Get list of disks to copy early to avoid multiple DB hits

    To support selective block migration we need to read block devices
    from nova block device mappings instead of libvirt block info.
    It means that in current implementation we would call
    _live_migration_copy_disk_paths two times - from
    live_migration_operations and from live_migration_monitor.
    To avoid that this change gets disk paths early and pass them as
    and additional paremeter to live migration monitor.

    Change-Id: Ic894cfc7374ba06b436b2a76a5984012d1dba3a5
    Related-bug: #1398999

Changed in libvirt (Ubuntu Wily):
assignee: Serge Hallyn (serge-hallyn) → nobody

Reviewed: https://review.openstack.org/227278
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=23fd0389f0e23e7969644079f4b1ad8504cbb8cb
Submitter: Jenkins
Branch: master

commit 23fd0389f0e23e7969644079f4b1ad8504cbb8cb
Author: Pawel Koniszewski <email address hidden>
Date: Wed Feb 10 13:09:44 2016 +0100

    Allow block live migration of an instance with attached volumes

    Since libvirt 1.2.17 it is possible to select which block devices
    should be migrated to destination host. Block devices that are not
    provided will not be migrated. It means that it is possible to
    exclude volumes from block migration and therefore prevent volumes
    from being copied to themselves.

    This patch implements new check of libvirt version. If version is
    higher or equal to 1.2.17 it is possible to block live migrate vm
    with attached volumes.

    Co-Authored-By: Bartosz Fic <email address hidden>

    Change-Id: I8fcc3ef3cb5d9fd3a95067929c496fdb5976fd41
    Closes-Bug: #1398999
    Partially implements: blueprint block-live-migrate-with-attached-volumes

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 13.0.0.0b3 development milestone.

James Page (james-page) on 2016-09-08
Changed in nova (Ubuntu Vivid):
status: Triaged → Won't Fix
James Page (james-page) on 2016-09-08
Changed in nova (Ubuntu Wily):
status: Triaged → Won't Fix

Reviewed: https://review.openstack.org/459316
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1032c79238e87259cef6963e43c78c39eeeb2cde
Submitter: Jenkins
Branch: master

commit 1032c79238e87259cef6963e43c78c39eeeb2cde
Author: Matt Riedemann <email address hidden>
Date: Mon Apr 24 09:54:21 2017 -0400

    Enable test_iscsi_volume in live migration job

    The block_migrate_cinder_iscsi config option in Tempest says
    the libvirt driver doesn't support live migration with an attached
    volume because of bug 1398999 where volumes live on a network share
    like RBD. However, I8fcc3ef3cb5d9fd3a95067929c496fdb5976fd41 in
    nova says that this is possible with libvirt >= 1.2.17. Since we are
    using libvirt 2.5.0 from the Ubuntu Cloud Archive on Xenial nodes
    now, we should be able to enable this test.

    Change-Id: I7d7a708b231070468616ae852d81d2f8b01ba568
    Related-Bug: #1398999

Reviewed: https://review.openstack.org/504143
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1328a50e2cd493fde44b7cac98393d70a432e3c0
Submitter: Jenkins
Branch: master

commit 1328a50e2cd493fde44b7cac98393d70a432e3c0
Author: Matt Riedemann <email address hidden>
Date: Thu Sep 14 17:30:18 2017 +0000

    Revert "Enable test_iscsi_volume in live migration job"

    This reverts commit 1032c79238e87259cef6963e43c78c39eeeb2cde.

    This wasn't actually ready to merge, and now that it has
    we're seeing a spike in failures of test_iscsi_volume.

    Change-Id: I74649dd63ef82a356b829ea01b2e74640dc6f11c
    Related-Bug: #1398999

James Page (james-page) on 2017-10-20
Changed in libvirt (Ubuntu Vivid):
status: Confirmed → Won't Fix
Changed in nova (Ubuntu Trusty):
importance: High → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.