snapshot delete fails on shutdown VM

Bug #1885528 reported by anon
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Balazs Gibizer
Queens
New
Undecided
Unassigned
Rocky
In Progress
Undecided
Unassigned
Stein
In Progress
Undecided
Unassigned
Train
Fix Committed
Undecided
Unassigned
Ussuri
Fix Released
Undecided
Lee Yarwood
Victoria
Fix Released
Undecided
Lee Yarwood

Bug Description

Description:
When we try to delete the last snapshot of a VM in shutdown state, this snapshot_delete will fail (and be stuck in state error-deleting). When setting state==available and redeleting the snapshot, the volume will be corrupted and the VM will never start again. Volumes are stored on NFS.
(for root cause and fix, see the bottom of this post)

To reproduce:
- storage on NFS
- create a VM and some snapshots
- shut down the VM (ie volume is still considered "attached" but vm is no longer "active")
- delete the last snapshot

Expected Result:
snapshot is deleted, vm still works

Actual result:
The snapshot is stuck on error deleting. After setting the snapshot state==available and deleting the snapshot again, the volume will be corrupted and the VM will never start again. (non-existing backing_file in qcow on disk)

Environment:
- openstack version: stein, deployed via kolla-ansible. I suspect this downloads from git but i don't know the exact version.
- hypervisor: Libvirt + KVM
- storage: NFS
- networking: Neutron with OpenVSwitch

Nova debug Logs:
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [req-d38b5ec8-afdb-4dfe-af12-0c47598c6a47 6dd1c995b2ea4ddfbeb0685bc52e5fbf 6bebb564667d4a75b9281fd826b32ecf - d
efault default] [instance: 711651a3-8440-42dd-a210-e7e550a8624e] Error occurred during volume_snapshot_delete, sending error status to Cinder.: DiskNotFound: No disk at
 volume-86c06b12-699c-4b54-8bca-fb92c99a2bf0.63d1585e-eb76-4e8f-bc96-93960e9c9692
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] Traceback (most recent call last):
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/dri
ver.py", line 2726, in volume_snapshot_delete
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] snapshot_id, delete_info=delete_info)
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/dri
ver.py", line 2686, in _volume_snapshot_delete
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] rebase_base)
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/dri
ver.py", line 2519, in _rebase_with_qemu_img
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] b_file_fmt = images.qemu_img_info(backing_file).file_forma
t
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] File "/usr/lib/python2.7/site-packages/nova/virt/images.py",
 line 58, in qemu_img_info
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] raise exception.DiskNotFound(location=path)
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e] DiskNotFound: No disk at volume-86c06b12-699c-4b54-8bca-fb92c9
9a2bf0.63d1585e-eb76-4e8f-bc96-93960e9c9692
2020-02-06 12:20:10.713 6 ERROR nova.virt.libvirt.driver [instance: 711651a3-8440-42dd-a210-e7e550a8624e]
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server [req-d38b5ec8-afdb-4dfe-af12-0c47598c6a47 6dd1c995b2ea4ddfbeb0685bc52e5fbf 6bebb564667d4a75b9281fd826b32ecf -
default default] Exception during message handling: DiskNotFound: No disk at volume-86c06b12-699c-4b54-8bca-fb92c99a2bf0.63d1585e-eb76-4e8f-bc96-93960e9c9692
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 166, in _process_incoming
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 265, in dispatch
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 194, in _do_dispatch
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 229, in inner
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server return func(*args, **kwargs)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 79, in wrapped
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server function_name, call_dict, binary, tb)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server self.force_reraise()
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/exception_wrapper.py", line 69, in wrapped
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server return f(self, context, *args, **kw)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 3686, in volume_snapshot_delete
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server snapshot_id, delete_info)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2734, in volume_snapshot_delete
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server context, snapshot_id, 'error_deleting')
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server self.force_reraise()
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2726, in volume_snapshot_delete
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server snapshot_id, delete_info=delete_info)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2686, in _volume_snapshot_delete
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server rebase_base)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2519, in _rebase_with_qemu_img
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server b_file_fmt = images.qemu_img_info(backing_file).file_format
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server File "/usr/lib/python2.7/site-packages/nova/virt/images.py", line 58, in qemu_img_info
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server raise exception.DiskNotFound(location=path)
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server DiskNotFound: No disk at volume-86c06b12-699c-4b54-8bca-fb92c99a2bf0.63d1585e-eb76-4e8f-bc96-93960e9c9692
2020-02-06 12:20:10.780 6 ERROR oslo_messaging.rpc.server

Root Cause:
When you look at the first line in the debug log: "No disk at volume-volid.snapid", you will notice that this is a relative path. When the code does an `os.path.exists("volume-volid.snapid")` this will obviously fail (as nova has a cwd="/") and the volumes are stored under "/usr/lib/nova/mnt/xxxxx/". If an absolute path were used, the code would be much happier :)

Code path:
  - cinder->snapshot_delete
  - In remotefs.py (called for nfs): if volume is attached -> call nova snapshot_delete
  - nova-> snapshot_delete: If snapshot_to_merge == active -> do rebase
  - if VM is not active: call _rebase_with_qemu_img (virt/libvirt/driver.py)
  - rebase_base/backing_file is now usually a relative file (depends on wether qemu_img_info is called with json or human and can also depend on the cwd). In all instances i saw, this was a relative path. When _rebase_with_qemu_img calls images.qemu_img_info(backing_file), this will fail unless supplied with and absolute path (first line in qemu_img_info does an os.path.exists which will obviously fail on a relative path). This will cause the snapshot delete to fail.

Potential, cinder_based fix:
In cinder, qemu_img_info is typically wrapped to always convert the backing_file to a relative path. In the code there are typically 2 variables: backing_file (relative) and backing_path(absolute). The code then decides when to use what variable:

Patch:
--- /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py.orig 2020-06-22 14:48:09.760186268 +0000
+++ /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py 2020-06-25 15:26:42.344963030 +0000
@@ -2515,8 +2515,10 @@
             # If the rebased image is going to have a backing file then
             # explicitly set the backing file format to avoid any security
             # concerns related to file format auto detection.
- backing_file = rebase_base
- b_file_fmt = images.qemu_img_info(backing_file).file_format
+ backing_file = os.path.basename(rebase_base) # relative path
+ volume_path = os.path.dirname(active_disk_object.source_path)
+ backing_path = os.path.join(volume_path, backing_file) # absolute
+ b_file_fmt = images.qemu_img_info(backing_path).file_format
             qemu_img_extra_arg = ['-F', b_file_fmt]

         qemu_img_extra_arg.append(active_disk_object.source_path)

Note:
We've only been able to test on Stein. However based on a code analysis of the current code in git, this bug is probably still present.

Revision history for this message
anon (anon1024) wrote :

Seems the identation of the patch was corrupted in the original bug report Here's the patch as attachment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/739246

Changed in nova:
assignee: nobody → Balazs Gibizer (balazs-gibizer)
status: New → In Progress
Changed in nova:
assignee: Balazs Gibizer (balazs-gibizer) → Lee Yarwood (lyarwood)
Changed in nova:
assignee: Lee Yarwood (lyarwood) → Stephen Finucane (stephenfinucane)
Changed in nova:
assignee: Stephen Finucane (stephenfinucane) → Balazs Gibizer (balazs-gibizer)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/739246
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b9333125790682f9d60bc74fdbb12a098565e7c2
Submitter: Zuul
Branch: master

commit b9333125790682f9d60bc74fdbb12a098565e7c2
Author: Balazs Gibizer <email address hidden>
Date: Thu Jul 2 12:13:29 2020 +0200

    Use absolute path during qemu img rebase

    During an assisted volume snapshot delete request from Cinder nova
    removes the snapshot from the backing file chain. During that nova
    checks the existence of such file. However in some cases (see the bug
    report) the path is relative and therefore os.path.exists fails.

    This patch makes sure that nova uses the volume absolute path to make
    the backing file path absolute as well.

    Closes-Bug #1885528

    Change-Id: I58dca95251b607eaff602783fee2fc38e2421944

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/757083

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/757084

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/757085

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/757098

Revision history for this message
melanie witt (melwitt) wrote :
melanie witt (melwitt)
no longer affects: nova/trunk
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/c/openstack/nova/+/757084
Committed: https://opendev.org/openstack/nova/commit/e926ec75e29dcdf3b671811533587bba246a8c45
Submitter: "Zuul (22348)"
Branch: stable/train

commit e926ec75e29dcdf3b671811533587bba246a8c45
Author: Balazs Gibizer <email address hidden>
Date: Thu Jul 2 12:13:29 2020 +0200

    Use absolute path during qemu img rebase

    During an assisted volume snapshot delete request from Cinder nova
    removes the snapshot from the backing file chain. During that nova
    checks the existence of such file. However in some cases (see the bug
    report) the path is relative and therefore os.path.exists fails.

    This patch makes sure that nova uses the volume absolute path to make
    the backing file path absolute as well.

    Closes-Bug #1885528

    Change-Id: I58dca95251b607eaff602783fee2fc38e2421944
    (cherry picked from commit b9333125790682f9d60bc74fdbb12a098565e7c2)
    (cherry picked from commit 831abc9f83a2d3f517030f881e7da724417fea93)
    (cherry picked from commit c2044d4bd0919860aa2d49687ba9c6ef6f7d37e8)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/queens)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/queens
Review: https://review.opendev.org/c/openstack/nova/+/780790
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/780788
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/stein)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/757085
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.