Documentation of Ceph auth caps for RBD clients used by Cinder / Glance / Nova is missing or inconsistent

Bug #2051244 reported by Christian Rohmann
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph
New
Undecided
Unassigned
Cinder
New
Undecided
Unassigned
Glance
New
Undecided
Unassigned
OpenStack Compute (nova)
New
Undecided
Unassigned
OpenStack-Ansible
New
Undecided
Unassigned
glance_store
In Progress
Undecided
Unassigned

Bug Description

This bug originates from my post to the openstack-discuss ML - https://<email address hidden>/thread/E3VYY24HUGBNH7626ALOGZMJRVX5VOSZ/
which was discussed at a cinder-weekly (https://meetings.opendev.org/meetings/cinder/2024/cinder.2024-01-24-14.01.log.html#l-43).

In short: There seem to be inconsistencies in the correct and required Ceph authx permissions for the RBD clients in Cinder, Glance and also Nova.
While it's nice to have the various deployment tools like openstack-ansible ([4]) or charm[[5]]) do it somewhat "properly",
first and foremost this needs to be properly documented in the source documentation of Glance and also Cinder and Nova for that matter.

And achieving this is what this bug report is intended to do.
The proposed steps are ...

 * determine and discuss the correct caps (least privileges, caps via profiles where possible, ...)
 * update the documentation / install guides and the devstack code. Those should all serve as references for the correct way of doing things.
 * write an upgrade bullet point to release notes for Caracal, to have operators check and align their caps
 * spread the word / open bugs for the deployment tools for them to update their config / code accordingly
 * send a PR to have Ceph update their docs

The long story about the various Ceph (RBD) clients and uses withing Glance, Cinder and Nova:

1) Glance

First there was a simple issue reported for Glance [3].

When Glance is requested to delete an image it will check if this image has depended children, see https://opendev.org/openstack/glance_store/src/commit/6f5011d1f05c99894fb8b909d33ad23a20bf83a9/glance_store/_drivers/rbd.py#L459.
The children of Glance images usually are (Cinder) volumes, which therefore live in a different RBD pool "volumes". But if such children do exist a 500 error is thrown by Glance API.

Manually using the RBD client shows the same error:

> # rbd -n client.glance -k /etc/ceph/ceph.client.glance.keyring -p images children $IMAGE_ID
>
> 2023-12-13T16:51:48.131+0000 7f198cf4e640 -1 librbd::image::OpenRequest: failed to retrieve name: (1) Operation not permitted
> 2023-12-13T16:51:48.131+0000 7f198d74f640 -1 librbd::ImageState: 0x5639fdd5af60 failed to open image: (1) Operation not permitted
> rbd: listing children failed: (1) Operation not permitted
> 2023-12-13T16:51:48.131+0000 7f1990c474c0 -1 librbd::api::Image: list_descendants: failed to open descendant b7078ed7ace50d from pool instances:(1) Operation not permitted

So it's a permission error. Following either the documentation of Glance [1] or Ceph [2] on configuring the ceph auth caps there is no mention of granting anything towards the volume pool to Glance.
So this is what I currently have configured:

> client.cinder
> key: REACTED
> caps: [mgr] profile rbd pool=volumes, profile rbd-read-only pool=images
> caps: [mon] profile rbd
> caps: [osd] profile rbd pool=volumes, profile rbd-read-only pool=images
>
> client.glance
> key: REACTED
> caps: [mgr] profile rbd pool=images
> caps: [mon] profile rbd
> caps: [osd] profile rbd pool=images
>
> client.nova
> key: REACTED
> caps: [mgr] profile rbd pool=instances, profile rbd pool=images
> caps: [mon] profile rbd
> caps: [osd] profile rbd pool=instances, profile rbd pool=images
>

When granting the glance client e.g. "rbd-read-only" to the volumes pool via:
>
> # ceph auth caps client.glance mon 'profile rbd' osd 'profile rbd pool=images, profile rbd-read-only pool=volumes' mgr 'profile rbd pool=images, profile rbd-read-only pool=volumes'
>
the error is gone.
This is the wrong approach though! Which was established during the discussion on the ML:

a) Commit [10] introduced the method "_snapshot_has_external_reference" to the yoga
release to fix [11]. The commit message also briefly states:
...

    NOTE: To check this dependency glance osd needs 'read' access to
    cinder and nova side RBD pool.
```

but there is zero mention of this requirement in the release notes for
Yoga, only for glance_store [13]. Also this (temporary, Yoga only) requirement to grant read-only rights to the volumes to Glance
was never revoked. So likely operators did miss this.

b) The mentioned method to check for snapshot references was removed again with [12], this change was also backported to the 2023.1 release.
There again was no mention of the change to operators via the release notes, who could now remove the read access for volumes from the Glance user again.

c) For none of the changes a and b there was any update to the actual documentation on how to configure the glance user ceph caps.

d) The "_snapshot_has_external_reference" method is currently just dangling and unused [14].

e) I am still wondering what the caps to allow reading "rbd_children" prefixed rados objects is or was used for? Especially with the managed profiles such as "rbd" or "rbd-readonly",
things should be pretty well covered.

And finally: The Glance documentation at [18] is outdated.

2) DevStack

I also wondered why there are no unit tests that fail in CI because of this [3]?
Looking at what devstack does at [6] it appears that

a) it actually applies "allow class-read object_prefix rbd_children",
which is not what is currently documented in the setup guide(s) (see [7]
and [2])

b) it unnecessarily grants read permissions to NOVA_CEPH_POOL ("vms")
and CINDER_CEPH_POOL ("volumes") also for the Glance user

c) does NOT use the managed capabilities called "profiles" such as "rbd"
or "rbd-readonly" instead of raw ACLs such das "rwx", see [9].

This also differs in the Cinder / Glance documentation and makes a great
difference as "such privileges include the ability to blocklist other
client users.", required for lock of stale RBD clients to be removed from images, see
https://docs.ceph.com/en/latest/rbd/rbd-exclusive-locks/#rbd-exclusive-locks.

This might not matter for CI / DevStack environments in itself. But since those are used to validate,
they should at best use the default / documented settings where possible to also validate they work.

3) Cinder

There seems to be no documented caps when using the ceph-rbd volume driver [19].

4) Cinder-Backup

If cinder-backup is used with the ceph driver [17] a keyring is required allowing to create snapshots of volumes (RBD images), which then serve as source for backups.
Also deletion of those snapshots has to be allowed as cinder-backups will remove them if they are not needed anymore. While full "profile rbd" access to the volume pool works,
it's likely not required to allow e.g. cinder-backup to modify or even delete volumes. Also there could be user snapshots, which cinder-backup also does not need to be able to delete.
Then there are the caps to store and retrieve backups via rbd import / rbd import-diff from another pool (potentially on a different cluster).

There currently seems to be no caps required for cinder-backup that are documented in e.g. [17].

4) Nova

While there are lots of RBD related options, e.g. for libvirt [8] and more ...

 * instance storage (if `images_type=rbd``)
 * volumes
 * interaction with Glance images ([glance] -> enable_rbd_download)

But, there seems to be no list of actually required capabilities and recommendations for the various interactions with RBD.

5) OpenStack-Ansible

OpenStack-Ansible uses ceph-ansible, but they actively override the keyrings and their caps.
Overriding managed code should really just be a temporary fix (it was done for Stein if I read this correctly).
Those openstack_keys in [15], once the proper caps are defined should be converted into a PR towards ceph-ansible [16] to fix things globally there as well.

Likely there are other deployment tools, applying their home-grown set of caps and Ceph users/keyrings as there is no references to reply on.

[1] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[2] https://docs.ceph.com/en/latest/rbd/rbd-openstack/#setup-ceph-client-authentication
[3] https://bugs.launchpad.net/glance/+bug/2045158
[4] Openstack-Ansible: https://opendev.org/openstack/openstack-ansible/src/branch/master/inventory/group_vars/all/ceph.yml#L53-L60
[5] Charm: https://review.opendev.org/q/topic:%22bug/1696073%22 // https://bugs.launchpad.net/charm-glance/+bug/1696073
[6] https://opendev.org/openstack/devstack-plugin-ceph/src/commit/4c22c3d0905589d676bf4865ca5cf57994eb426d/devstack/lib/ceph#L712
[7] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[8] https://docs.openstack.org/nova/queens/configuration/config.html#libvirt.rbd_user
[9] https://docs.ceph.com/en/latest/rados/operations/user-management/#authorization-capabilities
[10] https://github.com/openstack/glance_store/commit/3d221ec529862d43ab303644e74ee9ad6ce8cd40
[11] https://bugs.launchpad.net/glance-store/+bug/1954883
[12] https://review.opendev.org/q/I34dcd90a09d43127ff2e8b477750c70f3cc01113
[13] https://docs.openstack.org/releasenotes/glance_store/yoga.html#relnotes-3-0-0-stable-yoga
[14] https://opendev.org/openstack/glance_store/src/commit/054bd5ddf5d4d255076bd5f44296f2521e899394/glance_store/_drivers/rbd.py#L455
[15] https://opendev.org/openstack/openstack-ansible/commit/0f92985608c0f6ff941ea0445ae25eab20e94fb4
[16] https://github.com/ceph/ceph-ansible/blob/b6102975549d8f870b0c20a01edda59d6ceac422/group_vars/all.yml.sample#L642
[17] https://docs.openstack.org/cinder/latest/configuration/block-storage/backup/ceph-backup-driver.html
[18] https://docs.openstack.org/glance/latest/configuration/configuring.html#configuring-the-rbd-storage-backend
[19] https://docs.openstack.org/cinder/latest/configuration/block-storage/drivers/ceph-rbd-volume-driver.html

summary: - Documentation of caps for Ceph auth of RBD clients used by Cinder /
- Glance / Nova is missing or inconsistent
+ Documentation of Ceph auth caps for RBD clients used by Cinder / Glance
+ / Nova is missing or inconsistent
Revision history for this message
Christian Rohmann (christian-rohmann) wrote :

Depending on whether Cinder should make use of the native CoW upload of an image from volume (https://review.opendev.org/c/openstack/cinder/+/809523) the caps might need further adjustments in this regard.

This should then be least privilege (read: "only allow creating new images, not changing or deleting existing ones").

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to glance_store (master)
Changed in glance-store:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to glance_store (master)

Reviewed: https://review.opendev.org/c/openstack/glance_store/+/907317
Committed: https://opendev.org/openstack/glance_store/commit/4bdba0cae30c9d52717e83d68dda68ec33bcddf7
Submitter: "Zuul (22348)"
Branch: master

commit 4bdba0cae30c9d52717e83d68dda68ec33bcddf7
Author: Christian Rohmann <email address hidden>
Date: Wed Jan 31 13:38:49 2024 +0100

    Remove _snapshot_has_external_reference from rbd driver

    With the implementation of the trash feature in [1] checking for external
    references is not done anymore, so this code is unused.

    [1] https://review.opendev.org/c/openstack/glance_store/+/884524

    Closes-Bug: #1959186
    Partial-Bug: #2051244
    Change-Id: I8e2b37441b5bb3675ebbc438f0c37d57df103ec7

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.