Cinder fails to create image-based volume if mirroring is enabled

Bug #1900775 reported by Andrey Grebennikov on 2020-10-20
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Cinder
Undecided
Corey Bryant
Ubuntu Cloud Archive
Status tracked in Victoria
Mitaka
High
Unassigned
Queens
High
Unassigned
Stein
High
Unassigned
Train
High
Unassigned
Ussuri
High
Unassigned
Victoria
High
Unassigned
charm-ceph-rbd-mirror
Undecided
Unassigned
cinder (Ubuntu)
High
Unassigned
Xenial
High
Unassigned
Bionic
High
Unassigned
Focal
High
Unassigned
Groovy
High
Unassigned

Bug Description

OpenStack Train, Ceph Nautilus, ceph-rbd-mirror is deployed in dual-way.
Cinder has Ceph as backend for volumes.

Creating a volume from qcow2 image.
Current flow is the following:
1. Cinder creates empty volume in Ceph
2. Cinder downloads the image
3. Image is converted to raw
4. Volume is deleted https://github.com/openstack/cinder/blob/master/cinder/volume/drivers/rbd.py#L1611
5. Cinder performs "rbd import" using unpacked raw image as source.

Apparently rbd-mirror daemon creates a snapshot upon the image creation in Ceph (seems for mirroring purposes) which for empty image lasts for about a second.
It happens that step4 may be performed (very often) during the period of time when the snapshot exists, and it fails with "Cannot delete the volume with snapshots" error.
The only way to fix this behaviour - disable mirroring of the backend pool which is not desired.

Vern Hart (vern) wrote :
Download full text (36.0 KiB)

Our environment has two ceph clusters (az1 and az2) with a separate ceph-rbd-mirror charm deployed to each. The ceph-rbd-mirror charm enables pool mirroring on all rbd pools.

Glance is backed by only one of the ceph clusters, az1.

We have volume types named after the ceph cluster, arbor-az1 and arbor-az2.

When we copy an image to cinder using the following openstack cli, it's fast and reliable when copying from az1 to az1, but we have greater than 50% failure when copying from az1 to az2.

  openstack volume create --image bionic --size 3 --type arbor-az1 volume-az1 # works
  openstack volume create --image bionic --size 3 --type arbor-az2 volume-az2 # fails >50%

When the az2 image upload fails, it's always with "cinder.exception.VolumeIsBusy: deleting volume volume-<id> that has snapshot"

The full log from the cinder-volume log follows.

2020-10-19 10:15:35.871 45458 INFO cinder.volume.flows.manager.create_volume [req-a4905f43-c6f8-42e4-a602-a00238068492 f2ee9e8060e54d058060d220fab84088 0c0bb65d03f446ca8a845d1ae9229790 - 1e59a4057fb24003a902bc4df6247d3c 1e59a4057fb24003a902bc4df6247d3c] Volume 63e4d69c-0722-40fc-b44f-cb59fca0d435: being created as image with specification: {'status': 'creating', 'volume_name': 'volume-63e4d69c-0722-40fc-b44f-cb59fca0d435', 'volume_size': 20, 'image_id': 'e93f97e0-f514-435d-a277-0ac288ed0c6c', 'image_location': ('rbd://36bd979e-0511-11eb-bced-ecebb88db476/glance/e93f97e0-f514-435d-a277-0ac288ed0c6c/snap', [{'url': 'rbd://36bd979e-0511-11eb-bced-ecebb88db476/glance/e93f97e0-f514-435d-a277-0ac288ed0c6c/snap', 'metadata': {}}]), 'image_meta': {'name': 'UBUNTU-18.04', 'disk_format': 'qcow2', 'container_format': 'bare', 'visibility': 'public', 'size': 359923712, 'virtual_size': None, 'status': 'active', 'checksum': '9aa011b2b79b1fe42a7c306555923b1b', 'protected': False, 'min_ram': 0, 'min_disk': 0, 'owner': '86fcc0b3839b45029dd325641ddc2a09', 'os_hidden': False, 'os_hash_algo': 'sha512', 'os_hash_value': '03786c51866c1c6c50ef671502b265afb71084e0d018748ca9b29f871ca5445c9eefffffdf927d540bddaf0e8597adc673e218e785fcf16013fba7e7d9898e6e', 'id': 'e93f97e0-f514-435d-a277-0ac288ed0c6c', 'created_at': datetime.datetime(2020, 10, 15, 16, 49, tzinfo=<iso8601.Utc>), 'updated_at': datetime.datetime(2020, 10, 15, 16, 49, 18, tzinfo=<iso8601.Utc>), 'locations': [{'url': 'rbd://36bd979e-0511-11eb-bced-ecebb88db476/glance/e93f97e0-f514-435d-a277-0ac288ed0c6c/snap', 'metadata': {}}], 'direct_url': 'rbd://36bd979e-0511-11eb-bced-ecebb88db476/glance/e93f97e0-f514-435d-a277-0ac288ed0c6c/snap', 'tags': [], 'file': '/v2/images/e93f97e0-f514-435d-a277-0ac288ed0c6c/file', 'properties': {}}, 'image_service': <cinder.image.glance.GlanceImageService object at 0x7fb92712cda0>}
2020-10-19 10:15:40.810 45458 INFO cinder.image.image_utils [req-a4905f43-c6f8-42e4-a602-a00238068492 f2ee9e8060e54d058060d220fab84088 0c0bb65d03f446ca8a845d1ae9229790 - 1e59a4057fb24003a902bc4df6247d3c 1e59a4057fb24003a902bc4df6247d3c] Image download 343.25 MB at 70.20 MB/s
2020-10-19 10:15:44.685 45458 INFO cinder.image.image_utils [req-a4905f43-c6f8-42e4-a602-a00238068492 f2ee9e8060e54d058060d220fab84088 0c0bb65d03f446ca8a845d1ae9229790 - 1e59a4057fb24003a902bc...

Changed in charm-ceph-rbd-mirror:
assignee: nobody → Aurelien Lourot (aurelien-lourot)

IMHO this is a bug in https://opendev.org/openstack/cinder . And I agree with the previous analysis:

1. volume/flows/manager/create_volume.py:_create_from_image_download() creates a volume, then
2. calls volume_utils.copy_image_to_volume(), which
    a. fetches the image,
    b. deletes the volume, and
    c. runs `rbd import`.

If a snapshot of the volume got created between 1 and 2b (e.g. by rbd mirror), 2b will fail. So I think we need one of these solutions:

1. Instead of creating a volume and immediately deleting it, let's not create it in the first place.
2. Is there a way to delete a volume and all its snapshot in an atomic way?
3. Can we force the deletion of a volume even if it has snapshots?

Changed in charm-ceph-rbd-mirror:
status: New → Invalid
assignee: Aurelien Lourot (aurelien-lourot) → nobody
Corey Bryant (corey.bryant) wrote :

It's worth noting that the ceph-rbd-mirror charm uses RBD journaling to mirror RBD images between two Ceph clusters. [1] The ceph-rbd-mirror charm doesn't use the snapshot-based approach to RBD Mirroring.

[1] https://jaas.ai/ceph-rbd-mirror

Corey Bryant (corey.bryant) wrote :

Documentation on ceph RBD mirroring modes (journal-based and snapshot-based): https://docs.ceph.com/en/latest/rbd/rbd-mirroring/

This is correct, however while watching the content of the snap list filtering by the volume ID it is clear that the snapshot does get created for this empty volume, having a name of "rbd_mirror.<volume_id>", and it stays there for about a second. Maybe this is how the images of the volume gets created on the other side of the mirror (list a metadata sync), it is more a question to the behaviour of ceph-rbd-mirror daemon itself.

Corey Bryant (corey.bryant) wrote :

Andrey/Vern, can you help us understand where the snapshot is coming from? Perhaps we could also get confirmation that journaling is in fact configured for the rbd mirroring; I'm not sure how we can get that information, perhaps 'rbd mirror pool status' will show details.

Corey Bryant (corey.bryant) wrote :

I have a bionic-train ceph-rbd-mirror deployment up (the smoke test from charm-ceph-rbd-mirror). I need to pick this up tomorrow but the next step is to create a volume from an image and hopefully I can add a sleep to the cinder code to recreate the reported bug. Listing some relevant commands below that I think will be useful, in case someone else picks this up or someone has better commands. Note: I'm listing the glance pool below because it has a snapshot.

ceph-mon/0
----------
$ sudo ceph osd lspools
1 glance
2 cinder-ceph

ceph-rbd-mirror/0
-----------------
$ ls /etc/ceph
ceph.client.rbd-mirror.juju-2e2313-zaza-940825664f98-12.keyring ceph.conf rbdmap remote.client.rbd-mirror.juju-2e2313-zaza-940825664f98-12.keyring remote.conf
$ sudo rbd --id rbd-mirror.juju-2e2313-zaza-940825664f98-12 pool stats glance
Total Images: 1
Total Snapshots: 1
Provisioned Size: 342 MiB
$ sudo rbd --id rbd-mirror.juju-2e2313-zaza-940825664f98-12 pool stats cinder-ceph
Total Images: 1
Total Snapshots: 0
Provisioned Size: 8 GiB
$ sudo rbd --id rbd-mirror.juju-2e2313-zaza-940825664f98-12 mirror pool status glance
health: OK
images: 1 total
    1 stopped
$ sudo rbd --id rbd-mirror.juju-2e2313-zaza-940825664f98-12 mirror pool status cinder-ceph
health: OK
images: 1 total
    1 stopped
$ sudo rbd --id rbd-mirror.juju-2e2313-zaza-940825664f98-12 ls -l glance
NAME SIZE PARENT FMT PROT LOCK
db355184-09a7-4ecc-869e-5a99ef0ef97e 342 MiB 2
db355184-09a7-4ecc-869e-5a99ef0ef97e@snap 342 MiB 2 yes
$ sudo rbd --id rbd-mirror.juju-2e2313-zaza-940825664f98-12 ls -l cinder-ceph
NAME SIZE PARENT FMT PROT LOCK
volume-b0ea5477-d218-409e-b608-dc992a659c0a 8 GiB 2
$ sudo rbd --id rbd-mirror.juju-2e2313-zaza-940825664f98-12 snap ls glance/db355184-09a7-4ecc-869e-5a99ef0ef97e
SNAPID NAME SIZE PROTECTED TIMESTAMP
    24 snap 342 MiB yes Wed Oct 21 20:29:56 2020

Corey, as I mentioned, the snapshot lives for only about 1 second. In order to reproduce that:
1. Create a pool
2. Set up mirror
3. Create a volume from the image
4. IMMEDIATELY grab the volume id
5. run "watch -n0.2 rbd -p <pool> snap ls|grep <vol_id>" (alternatively just run watch without grep permanently). You will see the snapshot come and go.

Adding sleep to the cinder code can only help to overcome the issue (though will be dirty hack and will never be accepted to upstream), and won't help to watch the snapshot lifecycle.
The replication indeed is configured with journals because it is Nautilus (no snapshot-based mirroring yet).

James Page (james-page) wrote :

The exception is raised in the following code block:

https://github.com/openstack/cinder/blob/master/cinder/volume/drivers/rbd.py#L1145

(maybe).

As this is as an expected situation then having some sort of retry around this might work OK

Corey Bryant (corey.bryant) wrote :

@Andrey you're right, I think the sleep would have to be in ceph. @James Yes we might be able to do something like:

diff --git a/cinder/volume/drivers/rbd.py b/cinder/volume/drivers/rbd.py
index 5b868e4fe..d0b556a94 100644
--- a/cinder/volume/drivers/rbd.py
+++ b/cinder/volume/drivers/rbd.py
@@ -1594,6 +1594,9 @@ class RBDDriver(driver.CloneableImageVD, driver.MigrateVD,
             finally:
                 fileutils.delete_if_exists(dest_image_path)

+ @utils.retry(exception.VolumeIsBusy,
+ self.configuration.rados_connection_interval,
+ self.configuration.rados_connection_retries)
     def _copy_image_to_volume(self, context, volume, image_service, image_id,
                               encrypted=False):

Corey Bryant (corey.bryant) wrote :

and the sleep was just a way to force recreation, not a fix :)

Fix proposed to branch: master
Review: https://review.opendev.org/759315

Changed in cinder:
assignee: nobody → Corey Bryant (corey.bryant)
status: New → In Progress
Changed in cinder (Ubuntu Groovy):
status: New → Triaged
Changed in cinder (Ubuntu Focal):
status: New → Triaged
Changed in cinder (Ubuntu Bionic):
status: New → Triaged
Changed in cinder (Ubuntu Xenial):
status: New → Triaged
importance: Undecided → High
Changed in cinder (Ubuntu Focal):
importance: Undecided → High
Changed in cinder (Ubuntu Groovy):
importance: Undecided → High
Changed in cinder (Ubuntu Bionic):
importance: Undecided → High
Vern Hart (vern) wrote :

I tested Corey's fix (via ppa) side-by-side with a cinder unit with version 2:15.4.0-0ubuntu1~cloud0 (from bionic-proposed/train).

I used a bionic qcow2 image with:
  openstack volume create --image bionic --size 3 --type stable2-az2 vernvol

Again, glance is backended by the az1 ceph cluster so this is a cross-cluster copy. In our previous tests, qcow2 images tend to fail at a much higher rate than raw images.

On the unpatched cinder, 77% failure with 23 out of 30 tries erring with:
  Exception during message handling: cinder.exception.ImageCopyFailure: Failed to copy image to volume: deleting volume volume-1b4d5231-2baf-4880-a380-f1dc937b20bd that has snapshot

On Corey's patched version, 0% failure. 30 out of 30 volumes succeeded.

So, in this deployment, this simple retry solves the problem.

Corey Bryant (corey.bryant) wrote :

Vern, Thanks very much for testing. I'm glad it worked. I just wanted to give you an update on this. I've been monitoring the upstream gate status which has been broken and blocking all new patches. We want to land patches upstream first so they can get a good upstream review before starting SRUs. This helps reduce regression potential. I'll continue to monitor the gate status and once the gate is back in order I'll work on pushing to get a review of the patch.

Michael Skalka (mskalka) wrote :

Bumping this down to high as a workaround exists

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers