nova snapshot_volume_backed failure does not thaw filesystems

Bug #1731986 reported by Eric M Gonzalez on 2017-11-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Eric M Gonzalez
Ocata
High
Matt Riedemann
Pike
High
Matt Riedemann
Queens
High
Matt Riedemann

Bug Description

Noticed in OpenStack Mitaka (commit 9825c80), but the function (snapshot_volume_backed) is unchanged as of commit a4fc1bcd. backends: Libvirt + Ceph.

When Nova attempts to create an image / snapshot of a volume-backed instance it first quiesces the instance in `snapshot_volume_backed()`. It then loops over all of the block devices associated with that instance. However, there is no exception handling in the for loop and any failures on the part of Cinder are bubbled up and through the `snapshot_volume_backed()` function. This causes the needed `unquiesce()` to never be called on the instance, leaving it in an inconsistent (read-only) state. This can cause operational errors in the instance leaving it unusable.

In my case, the steps for reproduction are:

1) nova create image / ( "create snapshot" via horizon )
2) nova/compute/api snapshot_volume_backed() calls quiesce
3) "qemu-ga: info: guest-fsfreeze called" is seen in instance
4) cinder fails snapshot of volume due to OverLimit
5) cinder raises OverLimit
6) snapshot_volume_backed() never finishes due to OverLimit
7) filesystem is never thawed
8) instance unusable

I am in the process of writing and testing a patch and will have a review for it soon.

Matt Riedemann (mriedem) on 2017-11-13
tags: added: api volumes
Changed in nova:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Eric M Gonzalez (egrh3)

Fix proposed to branch: master
Review: https://review.openstack.org/519464

Changed in nova:
status: Triaged → In Progress
Changed in nova:
assignee: Eric M Gonzalez (egrh3) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2017-11-15
Changed in nova:
assignee: Matt Riedemann (mriedem) → Eric M Gonzalez (egrh3)

Related fix proposed to branch: master
Review: https://review.openstack.org/520122

Changed in nova:
assignee: Eric M Gonzalez (egrh3) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2017-11-15
Changed in nova:
assignee: Matt Riedemann (mriedem) → Eric M Gonzalez (egrh3)

Related fix proposed to branch: master
Review: https://review.openstack.org/520158

Changed in nova:
assignee: Eric M Gonzalez (egrh3) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2018-02-19
Changed in nova:
assignee: Matt Riedemann (mriedem) → Eric M Gonzalez (egrh3)

Reviewed: https://review.openstack.org/519464
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=bca425a33f52584051348a3ace832be8151299a7
Submitter: Zuul
Branch: master

commit bca425a33f52584051348a3ace832be8151299a7
Author: Eric M Gonzalez <email address hidden>
Date: Mon Nov 13 14:02:27 2017 -0600

    unquiesce instance on volume snapshot failure

    This patch adds an exception catch to "snapshot_volume_backed()" of
    compute/api.py that catches (at the moment) _all_ exceptions from the
    underlying cinderclient. Previously, if the instance is quiesced ( frozen
    filesystem ) then the exception will break execution of the function,
    skipping the needed unquiesce, and leave the instance in a frozen state.

    Now, the exception catch will unquiesce the instance if it was prior to
    the failure.

    Got a unit test in place with the help of Matt Riedemann.
        test_snapshot_volume_backed_with_quiesce_create_snap_fails

    Change-Id: I60de179c72eede6746696f29462ee9d805dace47
    Closes-bug: #1731986

Changed in nova:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/520122
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ad389244ba5737d01f0c28e6704fc93b83dace9e
Submitter: Zuul
Branch: master

commit ad389244ba5737d01f0c28e6704fc93b83dace9e
Author: Matt Riedemann <email address hidden>
Date: Wed Nov 15 11:00:12 2017 -0500

    Add the ability to get absolute limits from Cinder

    This will be used in a later patch to check quota usage
    for volume snapshots before attempting to create new
    volume snapshots, so we can avoid an OverLimit error.

    Change-Id: Ica7c087708e86494d285fc3905a5740fd1356e5f
    Related-Bug: #1731986

Related fix proposed to branch: stable/queens
Review: https://review.openstack.org/546158

Reviewed: https://review.openstack.org/520158
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=289d2703c75123216d0a6802f2fe5f41aa84407c
Submitter: Zuul
Branch: master

commit 289d2703c75123216d0a6802f2fe5f41aa84407c
Author: Matt Riedemann <email address hidden>
Date: Wed Nov 15 12:45:08 2017 -0500

    Check quota before creating volume snapshots

    When creating a snapshot of a volume-backed instance, we
    create a snapshot of every volume BDM associated with the
    instance. The default volume snapshot limit is 10, so if
    you have a volume-backed instance with several volumes attached
    and snapshot it a few times, you're likely to fail the
    volume snapshot at some point with an OverLimit error from
    Cinder. This can lead to orphaned volume snapshots in Cinder
    that the user then has to cleanup.

    This change makes the snapshot operation a bit more robust by
    first checking the quota limit and current usage for the given
    project before attempting to create any volume snapshots.

    It's not fail-safe since we could still fail with racing snapshot
    requests for the same project, but it's a simple improvement to
    avoid this issue in the general case.

    Change-Id: I4e7b46deb43c0c2430b480f1a498a52fc4a9daf0
    Related-Bug: #1731986

Reviewed: https://review.openstack.org/545961
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=7ab98b5345f4a023bd209e714cd0aa60b3a31d48
Submitter: Zuul
Branch: stable/queens

commit 7ab98b5345f4a023bd209e714cd0aa60b3a31d48
Author: Eric M Gonzalez <email address hidden>
Date: Mon Nov 13 14:02:27 2017 -0600

    unquiesce instance on volume snapshot failure

    This patch adds an exception catch to "snapshot_volume_backed()" of
    compute/api.py that catches (at the moment) _all_ exceptions from the
    underlying cinderclient. Previously, if the instance is quiesced ( frozen
    filesystem ) then the exception will break execution of the function,
    skipping the needed unquiesce, and leave the instance in a frozen state.

    Now, the exception catch will unquiesce the instance if it was prior to
    the failure.

    Got a unit test in place with the help of Matt Riedemann.
        test_snapshot_volume_backed_with_quiesce_create_snap_fails

    Change-Id: I60de179c72eede6746696f29462ee9d805dace47
    Closes-bug: #1731986
    (cherry picked from commit bca425a33f52584051348a3ace832be8151299a7)

tags: added: in-stable-queens

Reviewed: https://review.openstack.org/546157
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=02acd2d1bc4ad6ba4863cfd35dbaa40510d8454d
Submitter: Zuul
Branch: stable/queens

commit 02acd2d1bc4ad6ba4863cfd35dbaa40510d8454d
Author: Matt Riedemann <email address hidden>
Date: Wed Nov 15 11:00:12 2017 -0500

    Add the ability to get absolute limits from Cinder

    This will be used in a later patch to check quota usage
    for volume snapshots before attempting to create new
    volume snapshots, so we can avoid an OverLimit error.

    Change-Id: Ica7c087708e86494d285fc3905a5740fd1356e5f
    Related-Bug: #1731986
    (cherry picked from commit ad389244ba5737d01f0c28e6704fc93b83dace9e)

Reviewed: https://review.openstack.org/546158
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=c65dfeecaec23c0fa1cefd5f72c56faa7589216b
Submitter: Zuul
Branch: stable/queens

commit c65dfeecaec23c0fa1cefd5f72c56faa7589216b
Author: Matt Riedemann <email address hidden>
Date: Wed Nov 15 12:45:08 2017 -0500

    Check quota before creating volume snapshots

    When creating a snapshot of a volume-backed instance, we
    create a snapshot of every volume BDM associated with the
    instance. The default volume snapshot limit is 10, so if
    you have a volume-backed instance with several volumes attached
    and snapshot it a few times, you're likely to fail the
    volume snapshot at some point with an OverLimit error from
    Cinder. This can lead to orphaned volume snapshots in Cinder
    that the user then has to cleanup.

    This change makes the snapshot operation a bit more robust by
    first checking the quota limit and current usage for the given
    project before attempting to create any volume snapshots.

    It's not fail-safe since we could still fail with racing snapshot
    requests for the same project, but it's a simple improvement to
    avoid this issue in the general case.

    Change-Id: I4e7b46deb43c0c2430b480f1a498a52fc4a9daf0
    Related-Bug: #1731986
    (cherry picked from commit 289d2703c75123216d0a6802f2fe5f41aa84407c)

This issue was fixed in the openstack/nova 17.0.1 release.

Reviewed: https://review.openstack.org/545966
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=17b9b900a249f6f432552fe27a9cdd54c1495b99
Submitter: Zuul
Branch: stable/pike

commit 17b9b900a249f6f432552fe27a9cdd54c1495b99
Author: Eric M Gonzalez <email address hidden>
Date: Mon Nov 13 14:02:27 2017 -0600

    unquiesce instance on volume snapshot failure

    This patch adds an exception catch to "snapshot_volume_backed()" of
    compute/api.py that catches (at the moment) _all_ exceptions from the
    underlying cinderclient. Previously, if the instance is quiesced ( frozen
    filesystem ) then the exception will break execution of the function,
    skipping the needed unquiesce, and leave the instance in a frozen state.

    Now, the exception catch will unquiesce the instance if it was prior to
    the failure.

    Got a unit test in place with the help of Matt Riedemann.
        test_snapshot_volume_backed_with_quiesce_create_snap_fails

    Conflicts:
          nova/compute/api.py
          nova/tests/unit/compute/test_compute_api.py

    NOTE(mriedem): The conflicts are due to not having change
    I9ce48e768cc67543f27a6c87c57b47501fff38c2 in Pike.

    Change-Id: I60de179c72eede6746696f29462ee9d805dace47
    Closes-bug: #1731986
    (cherry picked from commit bca425a33f52584051348a3ace832be8151299a7)
    (cherry picked from commit 7ab98b5345f4a023bd209e714cd0aa60b3a31d48)

This issue was fixed in the openstack/nova 16.1.1 release.

Reviewed: https://review.openstack.org/545973
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=fb54ccf55ce9b3dbeefde03b68464bc0e450684e
Submitter: Zuul
Branch: stable/ocata

commit fb54ccf55ce9b3dbeefde03b68464bc0e450684e
Author: Eric M Gonzalez <email address hidden>
Date: Mon Nov 13 14:02:27 2017 -0600

    unquiesce instance on volume snapshot failure

    This patch adds an exception catch to "snapshot_volume_backed()" of
    compute/api.py that catches (at the moment) _all_ exceptions from the
    underlying cinderclient. Previously, if the instance is quiesced ( frozen
    filesystem ) then the exception will break execution of the function,
    skipping the needed unquiesce, and leave the instance in a frozen state.

    Now, the exception catch will unquiesce the instance if it was prior to
    the failure.

    Got a unit test in place with the help of Matt Riedemann.
        test_snapshot_volume_backed_with_quiesce_create_snap_fails

    NOTE(mriedem): There is a small change in Ocata since we have to use
    the _LI translation markers for the new INFO log level messages.

    Change-Id: I60de179c72eede6746696f29462ee9d805dace47
    Closes-bug: #1731986
    (cherry picked from commit bca425a33f52584051348a3ace832be8151299a7)
    (cherry picked from commit 7ab98b5345f4a023bd209e714cd0aa60b3a31d48)
    (cherry picked from commit 17b9b900a249f6f432552fe27a9cdd54c1495b99)

This issue was fixed in the openstack/nova 18.0.0.0b1 development milestone.

This issue was fixed in the openstack/nova 15.1.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers