Manila ceph configuration won't work in HA mode

Bug #1905542 reported by David Aikema
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
kolla-ansible
Fix Committed
Undecided
Unassigned

Bug Description

**Bug Report**

What happened:

When more than one manila_share instance is setup with external ceph, problems occur as in the current kolla-ansible configuration the containers share the same auth ID. (With each restart of the manila_share container on a controller the ceph sessions for the other controllers are evicted).

In the manila-share logs:
In the below chunk of logs you can see each instance of manila seems to run a chunk of code to evict all other clients named manila... happens at 14:49:16.939 (06-37), 14:49:17.031 (06-39), and 14:49:17.022 (06-41).
$ ansible -i /etc/kolla/multinode control -m shell -a "grep -iR '2020-11-24 14:49:' /var/log/kolla/manila/manila-share.log"
B-06-37-openstack-ctl.maas | CHANGED | rc=0 >>
2020-11-24 14:49:16.459 6 INFO oslo_service.periodic_task [-] Skipping periodic task update_share_usage_size because it is disabled
2020-11-24 14:49:16.475 6 INFO oslo_service.service [req-f32e40b4-3566-49d6-957f-6fb480c42db6 - - - - -] Starting 1 workers
2020-11-24 14:49:16.483 19 INFO manila.service [-] Starting manila-share node (version 10.0.2)
2020-11-24 14:49:16.918 19 INFO manila.share.drivers.cephfs.driver [req-71c01790-6d75-4cf6-97f8-904d795727fe - - - - -] [CEPHFS1}] Ceph client found, connecting...
2020-11-24 14:49:16.939 19 INFO ceph_volume_client [req-71c01790-6d75-4cf6-97f8-904d795727fe - - - - -] evict clients with auth_name=manila
2020-11-24 14:49:16.944 19 INFO ceph_volume_client [req-71c01790-6d75-4cf6-97f8-904d795727fe - - - - -] evict: joined all
2020-11-24 14:49:16.949 19 INFO manila.share.drivers.cephfs.driver [req-71c01790-6d75-4cf6-97f8-904d795727fe - - - - -] [CEPHFS1] Ceph client connection complete.
2020-11-24 14:49:16.962 19 INFO manila.share.manager [req-71c01790-6d75-4cf6-97f8-904d795727fe - - - - -] Updating share status
2020-11-24 14:49:16.977 19 INFO manila.share.manager [req-71c01790-6d75-4cf6-97f8-904d795727fe - - - - -] Finished initialization of driver: 'CephFSDriver@B-06-37-openstack-ctl@cephfsnative1'
B-06-39-openstack-ctl.maas | CHANGED | rc=0 >>
2020-11-24 14:49:16.584 6 INFO oslo_service.periodic_task [-] Skipping periodic task update_share_usage_size because it is disabled
2020-11-24 14:49:16.599 6 INFO oslo_service.service [req-498d5964-a9e3-4484-8a7e-f2d894fc4e9f - - - - -] Starting 1 workers
2020-11-24 14:49:16.606 19 INFO manila.service [-] Starting manila-share node (version 10.0.2)
2020-11-24 14:49:17.012 19 INFO manila.share.drivers.cephfs.driver [req-10e0d7ef-bf38-4352-95ea-80edab67e0bb - - - - -] [CEPHFS1}] Ceph client found, connecting...
2020-11-24 14:49:17.031 19 INFO ceph_volume_client [req-10e0d7ef-bf38-4352-95ea-80edab67e0bb - - - - -] evict clients with auth_name=manila
2020-11-24 14:49:17.590 19 INFO ceph_volume_client [req-10e0d7ef-bf38-4352-95ea-80edab67e0bb - - - - -] evict: joined all
2020-11-24 14:49:17.595 19 INFO manila.share.drivers.cephfs.driver [req-10e0d7ef-bf38-4352-95ea-80edab67e0bb - - - - -] [CEPHFS1] Ceph client connection complete.
2020-11-24 14:49:17.615 19 INFO manila.share.manager [req-10e0d7ef-bf38-4352-95ea-80edab67e0bb - - - - -] Updating share status
2020-11-24 14:49:17.633 19 INFO manila.share.manager [req-10e0d7ef-bf38-4352-95ea-80edab67e0bb - - - - -] Finished initialization of driver: 'CephFSDriver@B-06-39-openstack-ctl@cephfsnative1'
B-06-41-openstack-ctl.maas | CHANGED | rc=0 >>
2020-11-24 14:49:16.569 7 INFO oslo_service.periodic_task [-] Skipping periodic task update_share_usage_size because it is disabled
2020-11-24 14:49:16.585 7 INFO oslo_service.service [req-79fa62e5-dcff-4c19-a516-143844096da7 - - - - -] Starting 1 workers
2020-11-24 14:49:16.591 20 INFO manila.service [-] Starting manila-share node (version 10.0.2)
2020-11-24 14:49:17.000 20 INFO manila.share.drivers.cephfs.driver [req-d9be838e-8a1e-4b14-b43b-979d97a7e5dd - - - - -] [CEPHFS1}] Ceph client found, connecting...
2020-11-24 14:49:17.022 20 INFO ceph_volume_client [req-d9be838e-8a1e-4b14-b43b-979d97a7e5dd - - - - -] evict clients with auth_name=manila
2020-11-24 14:49:17.589 20 INFO ceph_volume_client [req-d9be838e-8a1e-4b14-b43b-979d97a7e5dd - - - - -] evict: joined all
2020-11-24 14:49:17.594 20 INFO manila.share.drivers.cephfs.driver [req-d9be838e-8a1e-4b14-b43b-979d97a7e5dd - - - - -] [CEPHFS1] Ceph client connection complete.
2020-11-24 14:49:17.613 20 INFO manila.share.manager [req-d9be838e-8a1e-4b14-b43b-979d97a7e5dd - - - - -] Updating share status
2020-11-24 14:49:17.633 20 INFO manila.share.manager [req-d9be838e-8a1e-4b14-b43b-979d97a7e5dd - - - - -] Finished initialization of driver: 'CephFSDriver@B-06-41-openstack-ctl@cephfsnative1'

That maps to https://github.com/openstack/manila/blob/stable/ussuri/manila/share/drivers/cephfs/driver.py#L220 - specifically this chunk of code when a volume client is created:
```
        self._volume_client = ceph_volume_client.CephFSVolumeClient(
            auth_id, conf_path, cluster_name, volume_prefix=volume_prefix)
        LOG.info("[%(be)s}] Ceph client found, connecting...",
                 {"be": self.backend_name})
        if auth_id != CEPH_DEFAULT_AUTH_ID:
            # Evict any other manila sessions. Only do this if we're
            # using a client ID that isn't the default admin ID, to avoid
            # rudely disrupting anyone else.
            premount_evict = auth_id
        else:
            premount_evict = None
        try:
            self._volume_client.connect(premount_evict=premount_evict)
```
that calls https://github.com/ceph/ceph/blob/octopus/src/pybind/ceph_volume_client.py#L488 ... which specifically notes: ```
        :param premount_evict: Optional auth_id to evict before mounting the filesystem: callers
                               may want to use this to specify their own auth ID if they expect
                               to be a unique instance and don't want to wait for caps to time
                               out after failure of another instance of themselves.```

https://documentation.suse.com/external-tree/en-us/soc/9/openstack/user/html/manila/admin/cephfs_driver.html also mentions:
```A CephFS driver instance, represented as a backend driver section in manila.conf, requires a Ceph auth ID unique to the backend Ceph Filesystem. Using a non-unique Ceph auth ID will result in the driver unintentionally evicting other CephFS clients using the same Ceph auth ID to connect to the backend.```

The kolla-ansible configuration for Manila uses the same auth id for all the manila-share instances rather than the unique IDs which seem to be required, thus resulting in this pattern of evictions.

Manually setting each of the manila instances to use a separate auth id seems to result in multiple manila-share containers functioning without interfering with each other. (Updating the config.json to stage in a different key for each controller and then updating the manila.conf to reference this separate auth id).

(Note that we don't have `enable_manila_backend_cephfs_nfs` enabled. It's unclear as to whether or not this would necessitate a separate keyring for each of those as well - i.e. from 1 auth id per controller to 2 per controller).

What you expected to happen:

Multiple `manila_share` containers function without interfering with each other.

How to reproduce it (minimal and precise):

Deploy manila with external ceph and manila enabled.

```
enable_manila: "yes"
enable_manila_backend_cephfs_native: "yes"
```

**Environment**:
* OS (e.g. from /etc/os-release):
* Kernel (e.g. `uname -a`): Linux B-06-39-openstack-ctl 5.4.0-54-generic #60~18.04.1-Ubuntu SMP Fri Nov 6 17:25:16 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
* Docker version if applicable (e.g. `docker version`): 19.03.13
* Kolla-Ansible version (e.g. `git head or tag or stable branch` or pip package version if using release): stable/ussuri
* Docker image Install type (source/binary): source
* Docker image distribution:
* Are you using official images from Docker Hub or self built? self-built
* If self built - Kolla version and environment used to build: stable/ussuri
* Share your inventory file, globals.yml and other configuration files if relevant

Revision history for this message
Mark Goddard (mgoddard) wrote :

I'm no Manila expert, but one thing we are missing in the configuration is the [coordination] section. This is used for synchronisation between services, and could be a factor here. It requires a functioning key/value store such as etcd or redis.

I would suggest raising this with the Manila team in #openstack-manila on IRC.

Revision history for this message
David Aikema (david-aikema) wrote :

The attached does enable manila to function with a separate key per container, although it also returns the keyring filenames to a semi-hardcoded approach.

There are still some references left to `ceph_manila_keyring` after the change here which I'm not quite sure what to do about. Instead of the default `ceph.client.manila.keyring`, the references on each controller are set to `ceph.client.manila-{{ ansible_hostname }}.keyring`.

I haven't made changes to the coordination section as mentioned by Mark Goddard and agree those would likely be worthwhile but the changes made here seem to alleviate the symptoms that were being encountered.

Revision history for this message
Michal Nasiadka (mnasiadka) wrote :

It seems it has been solved with https://review.opendev.org/c/openstack/kolla-ansible/+/877413 and will be available in Bobcat release

Changed in kolla-ansible:
status: New → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.