Resize a boot-from-volume instance with NFS destroys instance

Bug #1728603 reported by Jay Pipes on 2017-10-30
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Matt Riedemann
Ocata
High
Matt Riedemann
Pike
High
Matt Riedemann

Bug Description

Turns out that the fix for https://bugs.launchpad.net/nova/+bug/1666831 accidentally broke boot-from-volume setups that use NFS. In particular, this line:

https://github.com/openstack/nova/blob/stable/ocata/nova/virt/libvirt/driver.py#L1149

        if os.path.exists(inst_base) and not root_disk.exists():
            try:
                shutil.rmtree(inst_base)
            except OSError as e:
                if e.errno != errno.ENOENT:
                    raise

Causes the instance basedir which includes the instances libvirt.XML file to be deleted.

The above needs to be changed to this in order to prevent BFV instances from being destroyed on resize...

 if os.path.exists(inst_base) and not root_disk.exists() and not compute_utils.is_volume_backed_instance(instance._context, instance):

This bug was reported and the fix confirmed by Joris S'heeren

Matt Riedemann (mriedem) wrote :

The referenced change was backported to ocata so marking this as affecting pike and ocata:

https://review.openstack.org/#/c/441037/

Matt Riedemann (mriedem) wrote :

Note that we do run an NFS-based job in the nova experimental CI queue but I'd have to check if it runs any resize tests - it probably should, and those are likely broken.

Matt Riedemann (mriedem) wrote :

So we do run resize tests in the NFS job:

http://logs.openstack.org/01/510201/10/check/legacy-tempest-dsvm-full-devstack-plugin-nfs/5911710/job-output.txt.gz#_2017-10-21_04_39_02_777135

But they don't fail on the regression, I'm assuming because the Tempest test doesn't actually try to do anything with the instance once the resize is confirmed? Or because it's not a boot from volume scenario?

tags: added: libvirt
Jay Pipes (jaypipes) wrote :

Yeah, it's the boot-from-volume thing that's the difference I believe.

description: updated
Matt Riedemann (mriedem) wrote :

I can't say I understand this really - why would the instance base directory that contains the libvirt domain xml files for the guest, which are ephemeral per compute host, be the same location as the root disk that is backed by the Cinder volume? Is the local ephemeral disk on the compute host the same NFS storage as the Cinder volumes?

Matt Riedemann (mriedem) wrote :

Oh nevermind I think I understand, the root disk in the volume isn't the same storage, the problem is that because the root disk is in a cinder volume, "root_disk.exists()" is False so we assume it's not shared storage and we remove the instance base directory files, which are on shared storage between the source and destination hosts.

Fix proposed to branch: master
Review: https://review.openstack.org/516395

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/516395
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=f02afc6569bd930114a4b56413dbd6becc5e7e75
Submitter: Zuul
Branch: master

commit f02afc6569bd930114a4b56413dbd6becc5e7e75
Author: Matt Riedemann <email address hidden>
Date: Mon Oct 30 12:49:31 2017 -0400

    libvirt: do not remove inst_base when volume-backed during resize

    When confirming a resize, the libvirt driver on the source host checks
    to see if the instance base directory (which contains the domain xml
    files, etc) exists and if the root disk image does not, it removes the
    instance base directory.

    However, the root image disk won't exist on local storage for a
    volume-backed instance and if the instance base directory is on shared
    storage, e.g. NFS or Ceph, between the source and destination host, the
    instance base directory is incorrectly deleted.

    This adds a check to see if the instance is volume-backed when checking
    to see if the instance base directory should be removed from the source
    host when confirming a resize.

    Change-Id: I29fac80d08baf64bf69e54cf673e55123174de2a
    Closes-Bug: #1728603

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/517388
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=d5d81a29050966f47050f0f77001273b1aa55116
Submitter: Zuul
Branch: stable/pike

commit d5d81a29050966f47050f0f77001273b1aa55116
Author: Matt Riedemann <email address hidden>
Date: Mon Oct 30 12:49:31 2017 -0400

    libvirt: do not remove inst_base when volume-backed during resize

    When confirming a resize, the libvirt driver on the source host checks
    to see if the instance base directory (which contains the domain xml
    files, etc) exists and if the root disk image does not, it removes the
    instance base directory.

    However, the root image disk won't exist on local storage for a
    volume-backed instance and if the instance base directory is on shared
    storage, e.g. NFS or Ceph, between the source and destination host, the
    instance base directory is incorrectly deleted.

    This adds a check to see if the instance is volume-backed when checking
    to see if the instance base directory should be removed from the source
    host when confirming a resize.

    Change-Id: I29fac80d08baf64bf69e54cf673e55123174de2a
    Closes-Bug: #1728603
    (cherry picked from commit f02afc6569bd930114a4b56413dbd6becc5e7e75)

Reviewed: https://review.openstack.org/517390
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=64f773ab96619b97de2bc521fdfc8400fdb7bd3d
Submitter: Zuul
Branch: stable/ocata

commit 64f773ab96619b97de2bc521fdfc8400fdb7bd3d
Author: Matt Riedemann <email address hidden>
Date: Mon Oct 30 12:49:31 2017 -0400

    libvirt: do not remove inst_base when volume-backed during resize

    When confirming a resize, the libvirt driver on the source host checks
    to see if the instance base directory (which contains the domain xml
    files, etc) exists and if the root disk image does not, it removes the
    instance base directory.

    However, the root image disk won't exist on local storage for a
    volume-backed instance and if the instance base directory is on shared
    storage, e.g. NFS or Ceph, between the source and destination host, the
    instance base directory is incorrectly deleted.

    This adds a check to see if the instance is volume-backed when checking
    to see if the instance base directory should be removed from the source
    host when confirming a resize.

    Change-Id: I29fac80d08baf64bf69e54cf673e55123174de2a
    Closes-Bug: #1728603
    (cherry picked from commit f02afc6569bd930114a4b56413dbd6becc5e7e75)
    (cherry picked from commit d5d81a29050966f47050f0f77001273b1aa55116)

This issue was fixed in the openstack/nova 15.0.8 release.

This issue was fixed in the openstack/nova 16.0.3 release.

This issue was fixed in the openstack/nova 17.0.0.0b2 development milestone.

Reviewed: https://review.openstack.org/566367
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=8e3385707cb1ced55cd12b1314d8c0b68d354c38
Submitter: Zuul
Branch: master

commit 8e3385707cb1ced55cd12b1314d8c0b68d354c38
Author: Matt Riedemann <email address hidden>
Date: Fri May 4 12:58:07 2018 -0400

    libvirt: check image type before removing snapshots in _cleanup_resize

    Change Ic683f83e428106df64be42287e2c5f3b40e73da4 added some disk
    cleanup logic to _cleanup_resize because some image backends (Qcow2,
    Flat and Ploop) will re-create the instance directory and disk.info
    file when initializing the image backend object.

    However, that change did not take into account volume-backed instances
    being resized will not have a root disk *and* if the local disk is
    shared storage, removing the instance directory effectively deletes
    the instance files, like the console.log, on the destination host
    as well. Change I29fac80d08baf64bf69e54cf673e55123174de2a was made
    to resolve that issue.

    However (see the pattern?), if you're doing a resize of a
    volume-backed instance that is not on shared storage, we won't remove
    the instance directory from the source host in _cleanup_resize. If the
    admin then later tries to live migrate the instance back to that host,
    it will fail with DestinationDiskExists in the pre_live_migration()
    method.

    This change is essentially a revert of
    I29fac80d08baf64bf69e54cf673e55123174de2a and alternate fix for
    Ic683f83e428106df64be42287e2c5f3b40e73da4. Since the root problem
    is that creating certain imagebackend objects will recreate the
    instance directory and disk.info on the source host, we simply need
    to avoid creating the imagebackend object. The only reason we are
    getting an imagebackend object in _cleanup_resize is to remove
    image snapshot clones, which is only implemented by the Rbd image
    backend. Therefore, we can check to see if the image type supports
    clones and if not, don't go through the imagebackend init routine
    that, for some, will recreate the disk.

    Change-Id: Ib10081150e125961cba19cfa821bddfac4614408
    Closes-Bug: #1769131
    Related-Bug: #1666831
    Related-Bug: #1728603

Reviewed: https://review.openstack.org/567623
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=174764340d3c965d31143b39af4ab2e8ecefe594
Submitter: Zuul
Branch: stable/queens

commit 174764340d3c965d31143b39af4ab2e8ecefe594
Author: Matt Riedemann <email address hidden>
Date: Fri May 4 12:58:07 2018 -0400

    libvirt: check image type before removing snapshots in _cleanup_resize

    Change Ic683f83e428106df64be42287e2c5f3b40e73da4 added some disk
    cleanup logic to _cleanup_resize because some image backends (Qcow2,
    Flat and Ploop) will re-create the instance directory and disk.info
    file when initializing the image backend object.

    However, that change did not take into account volume-backed instances
    being resized will not have a root disk *and* if the local disk is
    shared storage, removing the instance directory effectively deletes
    the instance files, like the console.log, on the destination host
    as well. Change I29fac80d08baf64bf69e54cf673e55123174de2a was made
    to resolve that issue.

    However (see the pattern?), if you're doing a resize of a
    volume-backed instance that is not on shared storage, we won't remove
    the instance directory from the source host in _cleanup_resize. If the
    admin then later tries to live migrate the instance back to that host,
    it will fail with DestinationDiskExists in the pre_live_migration()
    method.

    This change is essentially a revert of
    I29fac80d08baf64bf69e54cf673e55123174de2a and alternate fix for
    Ic683f83e428106df64be42287e2c5f3b40e73da4. Since the root problem
    is that creating certain imagebackend objects will recreate the
    instance directory and disk.info on the source host, we simply need
    to avoid creating the imagebackend object. The only reason we are
    getting an imagebackend object in _cleanup_resize is to remove
    image snapshot clones, which is only implemented by the Rbd image
    backend. Therefore, we can check to see if the image type supports
    clones and if not, don't go through the imagebackend init routine
    that, for some, will recreate the disk.

    Conflicts:
          nova/tests/unit/virt/libvirt/test_driver.py

    NOTE(mriedem): The conflict is due to not having change
    Icdd039bb4374269d9da38e7f8d2e15e05ca8aadb in Queens.

    Change-Id: Ib10081150e125961cba19cfa821bddfac4614408
    Closes-Bug: #1769131
    Related-Bug: #1666831
    Related-Bug: #1728603
    (cherry picked from commit 8e3385707cb1ced55cd12b1314d8c0b68d354c38)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/567625
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c72a0a7665e96219f0301525edc513dda07b320b
Submitter: Zuul
Branch: stable/pike

commit c72a0a7665e96219f0301525edc513dda07b320b
Author: Matt Riedemann <email address hidden>
Date: Fri May 4 12:58:07 2018 -0400

    libvirt: check image type before removing snapshots in _cleanup_resize

    Change Ic683f83e428106df64be42287e2c5f3b40e73da4 added some disk
    cleanup logic to _cleanup_resize because some image backends (Qcow2,
    Flat and Ploop) will re-create the instance directory and disk.info
    file when initializing the image backend object.

    However, that change did not take into account volume-backed instances
    being resized will not have a root disk *and* if the local disk is
    shared storage, removing the instance directory effectively deletes
    the instance files, like the console.log, on the destination host
    as well. Change I29fac80d08baf64bf69e54cf673e55123174de2a was made
    to resolve that issue.

    However (see the pattern?), if you're doing a resize of a
    volume-backed instance that is not on shared storage, we won't remove
    the instance directory from the source host in _cleanup_resize. If the
    admin then later tries to live migrate the instance back to that host,
    it will fail with DestinationDiskExists in the pre_live_migration()
    method.

    This change is essentially a revert of
    I29fac80d08baf64bf69e54cf673e55123174de2a and alternate fix for
    Ic683f83e428106df64be42287e2c5f3b40e73da4. Since the root problem
    is that creating certain imagebackend objects will recreate the
    instance directory and disk.info on the source host, we simply need
    to avoid creating the imagebackend object. The only reason we are
    getting an imagebackend object in _cleanup_resize is to remove
    image snapshot clones, which is only implemented by the Rbd image
    backend. Therefore, we can check to see if the image type supports
    clones and if not, don't go through the imagebackend init routine
    that, for some, will recreate the disk.

    Change-Id: Ib10081150e125961cba19cfa821bddfac4614408
    Closes-Bug: #1769131
    Related-Bug: #1666831
    Related-Bug: #1728603
    (cherry picked from commit 8e3385707cb1ced55cd12b1314d8c0b68d354c38)
    (cherry picked from commit 174764340d3c965d31143b39af4ab2e8ecefe594)

tags: added: in-stable-pike

Reviewed: https://review.openstack.org/567630
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=a16fa14ce47bd2d3a5189047e9bd330a607ef3cc
Submitter: Zuul
Branch: stable/ocata

commit a16fa14ce47bd2d3a5189047e9bd330a607ef3cc
Author: Matt Riedemann <email address hidden>
Date: Fri May 4 12:58:07 2018 -0400

    libvirt: check image type before removing snapshots in _cleanup_resize

    Change Ic683f83e428106df64be42287e2c5f3b40e73da4 added some disk
    cleanup logic to _cleanup_resize because some image backends (Qcow2,
    Flat and Ploop) will re-create the instance directory and disk.info
    file when initializing the image backend object.

    However, that change did not take into account volume-backed instances
    being resized will not have a root disk *and* if the local disk is
    shared storage, removing the instance directory effectively deletes
    the instance files, like the console.log, on the destination host
    as well. Change I29fac80d08baf64bf69e54cf673e55123174de2a was made
    to resolve that issue.

    However (see the pattern?), if you're doing a resize of a
    volume-backed instance that is not on shared storage, we won't remove
    the instance directory from the source host in _cleanup_resize. If the
    admin then later tries to live migrate the instance back to that host,
    it will fail with DestinationDiskExists in the pre_live_migration()
    method.

    This change is essentially a revert of
    I29fac80d08baf64bf69e54cf673e55123174de2a and alternate fix for
    Ic683f83e428106df64be42287e2c5f3b40e73da4. Since the root problem
    is that creating certain imagebackend objects will recreate the
    instance directory and disk.info on the source host, we simply need
    to avoid creating the imagebackend object. The only reason we are
    getting an imagebackend object in _cleanup_resize is to remove
    image snapshot clones, which is only implemented by the Rbd image
    backend. Therefore, we can check to see if the image type supports
    clones and if not, don't go through the imagebackend init routine
    that, for some, will recreate the disk.

    Change-Id: Ib10081150e125961cba19cfa821bddfac4614408
    Closes-Bug: #1769131
    Related-Bug: #1666831
    Related-Bug: #1728603
    (cherry picked from commit 8e3385707cb1ced55cd12b1314d8c0b68d354c38)
    (cherry picked from commit 174764340d3c965d31143b39af4ab2e8ecefe594)
    (cherry picked from commit c72a0a7665e96219f0301525edc513dda07b320b)

tags: added: in-stable-ocata
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers