Cinder

online snapshot deletion breaks backing chain with remotefs drivers

Bug #1716920 reported by Silvan Kaiser on 2017-09-13

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Cinder	Invalid	High	Unassigned
	OpenStack Compute (nova)	Invalid	Undecided	Unassigned

Bug Description

The deletion of online snapshots of remotefs based volumes breaks the .info file/backing chain of these volumes. Logs can be seen in any current Quobyte CI run in Cinder/Nova/OS-Brick . Afaics the the other driver using this (VZstorage) has it's CI skip the affected tests (e.g. test_snapshot_create_delete_with_volume_in_use).

I ran a lot of tests and so far i can say that the first deletion of a member in the backing chain works (snapshot is deleted) but seemingly leaves the .info files content and/or the backing chain of the volume file in a broken state. The error can be identified e.g. by the following log pattern:

This is the first snapshot deletion that runs successfully (the snapshots id is 91755e5f-e573-4ddb-84af-3712d69a). The ID of the snapshot and its snapshot_file name do match their uuids:

2017-09-13 08:28:59.436 20467 DEBUG cinder.volume.drivers.remotefs [req-eda7ddf5-217d-490d-a8d4-1813df68d8db tempest-VolumesSnapshotTestJSON-708947401 -] Deleting online snapshot 91755e5f-e573-4ddb-84af-3712d69a
fc89 of volume 94598844-418c-4b5d-b034-5330e24e7421 _delete_snapshot /opt/stack/cinder/cinder/volume/drivers/remotefs.py:1099
2017-09-13 08:28:59.487 20467 DEBUG cinder.volume.drivers.remotefs [req-eda7ddf5-217d-490d-a8d4-1813df68d8db tempest-VolumesSnapshotTestJSON-708947401 -] snapshot_file for this snap is: volume-94598844-418c-4b5d-b034-5330e24e7421.91755e5f-e573-4ddb-84af-3712d69afc89 _delete_snapshot /opt/stack/cinder/cinder/volume/drivers/remotefs.py:1124

The next snapshot to be deleted (138a1f62-7582-4aaa-9d72-9eada34b) shows that a wrong snapshot_file is read from the volumes .info file. In fact it shows the file of the previous snapshot :

2017-09-13 08:29:01.857 20467 DEBUG cinder.volume.drivers.remotefs [req-6ad4add9-34b8-41b9-a1f0-7dc2d6bb1862 tempest-VolumesSnapshotTestJSON-708947401 -] Deleting online snapshot 138a1f62-7582-4aaa-9d72-9eada34b
eeaf of volume 94598844-418c-4b5d-b034-5330e24e7421 _delete_snapshot /opt/stack/cinder/cinder/volume/drivers/remotefs.py:1099
2017-09-13 08:29:01.872 20467 DEBUG cinder.volume.drivers.remotefs [req-6ad4add9-34b8-41b9-a1f0-7dc2d6bb1862 tempest-VolumesSnapshotTestJSON-708947401 -] snapshot_file for this snap is: volume-94598844-418c-4b5d-b034-5330e24e7421.91755e5f-e573-4ddb-84af-3712d69afc89 _delete_snapshot /opt/stack/cinder/cinder/volume/drivers/remotefs.py:1124

Now this second snapshot deletion fails because the snapshot file for 138a1f62-7582-4aaa-9d72-9eada34b no longer exists:

2017-09-13 08:29:02.674 20467 ERROR oslo_messaging.rpc.server ProcessExecutionError: Unexpected error while running command.
2017-09-13 08:29:02.674 20467 ERROR oslo_messaging.rpc.server Command: /usr/bin/python -m oslo_concurrency.prlimit --as=1073741824 --cpu=8 -- env LC_ALL=C qemu-img info /opt/stack/data/cinder/mnt/a1e3635ffba9fce1b854369f1a255d7b/volume-94598844-418c-4b5d-b034-5330e24e7421.138a1f62-7582-4aaa-9d72-9eada34beeaf
2017-09-13 08:29:02.674 20467 ERROR oslo_messaging.rpc.server Exit code: 1
2017-09-13 08:29:02.674 20467 ERROR oslo_messaging.rpc.server Stdout: u''
2017-09-13 08:29:02.674 20467 ERROR oslo_messaging.rpc.server Stderr: u"qemu-img: Could not open '/opt/stack/data/cinder/mnt/a1e3635ffba9fce1b854369f1a255d7b/volume-94598844-418c-4b5d-b034-5330e24e7421.138a1f62-7582-4aaa-9d72-9eada34beeaf': Could not open '/opt/stack/data/cinder/mnt/a1e3635ffba9fce1b854369f1a255d7b/volume-94598844-418c-4b5d-b034-5330e24e7421.138a1f62-7582-4aaa-9d72-9eada34beeaf': No such file or directory\n"

The referenced tempest test fails 100% of the time in our CIs. I manually tested the scenario and found the same results. Furthermore i was able, by creating three consecutive snapshots from a single volume and deleting them one after the other, to create a snapshot file with a broken backing file link. In the end i was left with a volume file and an overlay file referencing a removed backing file (previous snapshot of the same volume).

I was able to run the scenario without issues when using offline snapshots. Thus this seems to be related to the usage of the online snapshot deletion via the Nova API.

See original description

Tags:

Silvan Kaiser (2-silvan) on 2017-09-13

tags:	added: quobyte remotefs
description:	updated

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-09-13:

This issue started up around the 28th of August but definitely not earlier than the 24th of August.

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-09-14:

I'm adding a shell log that shows a manual reproduction of breaking the backing chain of a volume with three snapshots via deleting the latest snapshot while the volume is attached to a running VM. This was run in a current trunk setup with DevStack.

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-09-14:

buglog-1716920.txt Edit (21.8 KiB, text/plain)

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-09-14:

I did more manual testing and followed up into the details of what Nova does during the assisted snapshot deletaion.
Afaics Nova starts a block rebase that never returns which probably is the reason why the backing chain is not adapted and the whole test times out.

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-09-14:

I'd be happy to get hints on how to trace the block rebase process, i've no idea which binary does that and how check it. There's no further log output in the Nova log...

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-10-11:

Update:
I followed up with manual testing & patching the code. The Nova block commit does terminate fine (it's just that there's no message provided about that). So i'm back on square one following up on why the backing chain breaks. Will dig further...

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-10-12:

Download full text (6.1 KiB)

1) correction regarding my last comment (#6): block commit -> block rebase

2) I'm still wondering why the active file ends up with a broken backing file entry. I checked into the libvirt logs and found that libvirt seems to run the correct rebasing in the end but checking the files afterwards shows the active file still having the backing file pointing to the original (now deleted) file and therefore being broken.
If i'm reading this log correctly libvirt sets the new backing file in the end (watch out for volume-669cfd0c-5b5c-470b-8b25-4c15cc655373.a755e1fc-f5ff-4dc1-937f-c03fb5ecc2d8, this file should be pointed to. Afaiu this libvirt log shows exactly that. But checking the volumes files afterwards via qemu-img info shows instead that the active files backing file link points to volume-669cfd0c-5b5c-470b-8b25-4c15cc655373.a24f9a80-9858-478f-871b-7a5f11278c8, which is the former backing file that is deleted during this assisted snapshot deletion and no longer exists.

Libvirt log excerpt:

1) correction regarding my last comment (#6): block commit -> block rebase

Libvirt log excerpt:

2017-10-12 08:06:59.321+0000: 4390: debug : qemuMonitorJSONCommandWithFd:296 : Send command '{"execute":"query-block-jobs","id":"libvirt-25"}' for write with FD -1
2017-10-12 08:06:59.321+0000: 4390: info : qemuMonitorSend:1035 : QEMU_MONITOR_SEND_MSG: mon=0x7f0338011150 msg={"execute":"query-block-jobs","id":"libvirt-25"}
 fd=-1
2017-10-12 08:06:59.321+0000: 4388: info : qemuMonitorIOWrite:536 : QEMU_MONITOR_IO_WRITE: mon=0x7f0338011150 buf={"execute":"query-block-jobs","id":"libvirt-25"}
 len=50 ret=50 errno=0
2017-10-12 08:06:59.321+0000: 4388: info : qemuMonitorIOProcess:431 : QEMU_MONITOR_IO_PROCESS: mon=0x7f0338011150 buf={"return": [{"io-status": "ok", "device": "drive-virtio-disk1", "busy": false, "len": 3221225
472, "offset": 250609664, "paused": false, "speed": 0, "ready": false, "type": "stream"}], "id": "libvirt-25"}
 len=205
2017-10-12 08:06:59.321+0000: 4388: debug : qemuMonitorJSONIOProcessLine:191 : Line [{"return": [{"io-status": "ok", "device": "drive-virtio-disk1", "busy": false, "len": 3221225472, "offset": 250609664, "paused
": false, "speed": 0, "ready": false, "type": "stream"}], "id": "libvirt-25"}]
2017-10-12 08:06:59.321+0000: 4388: info : qemuMonitorJSONIOProcessLine:211 : QEMU_MONITOR_RECV_REPLY: mon=0x7f0338011150 reply={"return": [{"io-status": "ok", "device": "drive-virtio-disk1", "busy": false, "len
": 3221225472, "offset": 250609664, "paused": false, "speed": 0, "ready": false, "type": "stream"}], "id": "libvirt-25"}
2017-10-12 08:06:59.321+0000: 4388: debug : qemuMonitorJSONIOProcess:260 : Total used 205 bytes out of 205 available in buffer
2017-10-12 08:06:59.321+0000: 4390: debug : qemuMonitorJSONCommandWithFd:301 : Receive command reply ret=0 rxObject=0x55f8a29153d0
2017-10-12 08:06:59.321+0000: 4390: debug : qemuDomainObjExitMonitorInternal:3555 : Exited monitor (mon=0x7f0338011150 vm=0x7f03280031d0 name=instance-00000006)
2017-10-12 08:06:59.321+0000: 4390: debug : qemuDomainObjEndJob:3463 : Stopping job: query (async=none vm=0x7f03280031d0 name=instance-00000006)
2017-10-12 08:06:59.321+0000: 4390: debug : virThreadJobClear:121 : Thread 4390 (virNetServerHandleJob) finished job remoteDispatchDomainGetBlockJobInfo with ret=0
2017-10-12 08:06:59.377+0000: 4388: info : qemuMonitorIOProcess:431 : QEMU_MONITOR_IO_PROCESS: mon=0x7f0338011150 buf={"timestamp": {"seconds": 1507795619, "microseconds": 377412}, "event": "BLOCK_JOB_COMPLETED"
, "data": {"device": "drive-virtio-disk1", "len": 3221225472, "offset": 3221225472, "speed": 0, "type": "stream"}}
 len=209
2017-10-12 08:06:59.377+0000: 4388: debug : qemuMonitorJSONIOProcessLine:191 : Line [{"timestamp": {"seconds": 1507795619, "microseconds": 377412}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive-virti
o-disk1", "len": 3221225472, "offset": 3221225472, "speed": 0, "type": "stream"}}]
2017-10-12 08:06:59.377+0000: 4388: info : qemuMonitorJSONIOProcessLine:206 : QEMU_MONITOR_RECV_EVENT: mon=0x7f0338011150 event={"timestamp": {"seconds": 1507795619, "microseconds": 377412}, "event": "BLOCK_JOB_
COMPLETED", "data": {"device": "drive-virtio-disk1", "len": 3221225472, "offset": 3221225472, "speed": 0, "type": "stream"}}
2017-10-12 08:06:59.377+0000: 4388: debug : qemuMonitorJSONIOProcessEvent:152 : mon=0x7f0338011150 obj=0x55f8a2942ec0
2017-10-12 08:06:59.377+0000: 4388: debug : qemuMonitorEmitEvent:1298 : mon=0x7f0338011150 event=BLOCK_JOB_COMPLETED
2017-10-12 08:06:59.377+0000: 4388: debug : qemuProcessHandleEvent:616 : vm=0x7f03280031d0
2017-10-12 08:06:59.377+0000: 4388: debug : virObjectEventNew:640 : obj=0x55f8a2943160
2017-10-12 08:06:59.377+0000: 4388: debug : qemuMonitorJSONIOProcessEvent:177 : handle BLOCK_JOB_COMPLETED handler=0x7f031255f980 data=0x55f8a290ccc0
2017-10-12 08:06:59.377+0000: 4388: debug : qemuMonitorEmitBlockJob:1490 : mon=0x7f0338011150
2017-10-12 08:06:59.377+0000: 4388: debug : qemuProcessHandleBlockJob:986 : Block job for device drive-virtio-disk1 (domain: 0x7f03280031d0,instance-00000006) type 1 status 0
2017-10-12 08:06:59.377+0000: 4388: debug : qemuMonitorJSONIOProcess:260 : Total used 209 bytes out of 209 available in buffer
2017-10-12 08:06:59.377+0000: 4388: debug : virObjectEventDispose:134 : obj=0x55f8a2943160
2017-10-12 08:06:59.377+0000: 4726: debug : qemuProcessEventHandler:4571 : vm=0x7f03280031d0, event=5
2017-10-12 08:06:59.377+0000: 4726: debug : qemuDomainObjBeginJobInternal:3268 : Starting job: modify (vm=0x7f03280031d0 name=instance-00000006, current job=none async=none)
2017-10-12 08:06:59.377+0000: 4726: debug : qemuDomainObjBeginJobInternal:3309 : Started job: modify (async=none vm=0x7f03280031d0 name=instance-00000006)
2017-10-12 08:06:59.377+0000: 4726: debug : qemuBlockJobEventProcess:103 : disk=vdb, mirrorState=none, type=1, status=0
2017-10-12 08:06:59.377+0000: 4726: debug : virObjectEventNew:640 : obj=0x7f02f022d2a0
2017-10-12 08:06:59.377+0000: 4726: debug : virObjectEventNew:640 : obj=0x7f02f0191be0
2017-10-12 08:06:59.394+0000: 4726: debug : qemuDomainObjEndJob:3463 : Stopping job: modify (async=none vm=0x7f03280031d0 name=instance-00000006)
2017-10-12 08:06:59.394+0000: 4388: debug : virObjectEventDispose:134 : obj=0x7f02f022d2a0
2017-10-12 08:06:59.394+0000: 4388: debug : virObjectEventDispose:134 : obj=0x7f02f0191be0

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-10-12:

OS is Ubuntu 16.04, Libvirt version is 2.5.0 .
If i'm reading release dates correctly the libvirt version change from 1.3.1 to 2.5.0 might be a cause for this issue?

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2017-10-12:

console.html Edit (422.7 KiB, text/html)

NFS CI seems to have similar issues: Logs from test run some minutes ago show similar tests failing:

2017-10-12 10:12:34.162751 | {3} tempest.api.volume.test_volumes_snapshots.VolumesSnapshotTestJSON.test_snapshot_create_delete_with_volume_in_use [14.256258s] ... FAILED
2017-10-12 10:12:59.799040 | {3} tempest.api.volume.test_volumes_snapshots.VolumesSnapshotTestJSON.test_snapshot_create_offline_delete_online [20.797931s] ... FAILED

I'm attaching the console log for this test as an example.
Further examples can be seen in any recent run of gate-tempest-dsvm-full-devstack-plugin-nfs-nv .

Silvan Kaiser (2-silvan) on 2017-10-13

tags:	added: nfs
summary:	- online snapshot deletion breaks volume info and backing chain (with - remotefs drivers?) + online snapshot deletion breaks backing chain with remotefs drivers

Eric Harney (eharney) on 2017-11-07

Changed in cinder:
importance:	Undecided → High

Silvan Kaiser (2-silvan) on 2017-11-16

tags:

added: bugsmash

Revision history for this message

Sean McGinnis (sean-mcginnis) wrote on 2018-08-05:

#10

Is this still an issue?

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2018-08-17:

#11

@Sean Yep, it is, just re-tested for verification.

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2018-08-17:

#12

I just found that the NFS driver in Cinder seems to test this successfully. I need to find out why this is fixed for NFS now, will update with my findings.

Changed in nova:
status:	New → Incomplete
Changed in cinder:
status:	New → Incomplete

Revision history for this message

Launchpad Janitor (janitor) wrote on 2018-10-17:

#13

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status:	Incomplete → Expired

Revision history for this message

Silvan Kaiser (2-silvan) wrote on 2018-12-17:

#14

Finally i was able to attribute this to an issue in Quobyte client side caching, not an OpenStack or libvirt issue.

Changed in cinder:
status:	Incomplete → Invalid

Silvan Kaiser (2-silvan) on 2018-12-17

Changed in nova:
status:	Expired → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.