During deployment manila-ganesha may leave zombie ceph mds clients locking the ceph fs cluster until clients time out.

Bug #2073498 reported by Alex Kavanagh
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Ceph-FS Charm
New
Undecided
Unassigned
OpenStack Manila-Ganesha Charm
New
Undecided
Unassigned

Bug Description

(Tested on stable/2024.1 - caracal, but I've definitely seen this on previous versions).

The gate test for manila-ganesha has been unstable for a while (see [1] for example).

In the process of trying to resolve why the test is unstable, I noticed that at some point during the deployment, the number of ceph fs (mds) clients was too high:

# ceph fs status
ceph-fs - 3 clients
=======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
 0 clientreplay juju-d898ae-zaza-c985b42adc26-13 43 20 19 1
      POOL TYPE USED AVAIL
ceph-fs_metadata metadata 20.4M 9575M
  ceph-fs_data data 0 9575M
          STANDBY MDS
juju-d898ae-zaza-c985b42adc26-12
MDS version: ceph version 19.2.0~git20240301.4c76c50 (4c76c50a73f63ba48ccdf0adccce03b00d1d80c7) squid (dev)

There should, at most, be 1, corresponding to the single ganesha.nfsd daemon that might be running to hand out shares.

Indeed, the ceph status, indicates unhealthiness:

# ceph status
  cluster:
    id: 6ec54076-4471-11ef-9348-2d0c7811fa71
    health: HEALTH_WARN
            1 filesystem is degraded

  services:
    mon: 3 daemons, quorum juju-d898ae-zaza-c985b42adc26-7,juju-d898ae-zaza-c985b42adc26-6,juju-d898ae-zaza-c985b42adc26-8 (age 14h)
    mgr: juju-d898ae-zaza-c985b42adc26-6(active, since 14h), standbys: juju-d898ae-zaza-c985b42adc26-7, juju-d898ae-zaza-c985b42adc26-8
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 3 up (since 14h), 3 in (since 14h)

  data:
    volumes: 0/1 healthy, 1 recovering
    pools: 4 pools, 113 pgs
    objects: 29 objects, 7.2 MiB
    usage: 445 MiB used, 30 GiB / 30 GiB avail
    pgs: 113 active+clean

When this occurs, the ceph fs service is effectively blocked from accepting new shares. The ceph fs state cycles between "joining -> clientreplay - joining" whilst waiting for the zombie clients to be timed out, which (on my testing) can take 10 or 15 minutes, which is also very odd, as the timeout for zombie clients is 300 seconds (5 minutes).

The upshot is that, for the gate test, the share creation fails.

If you evict the zombie clients, the ceph fs state returns to "active", and ceph status returns to "HEALTH_OK", and the test will then pass.

Whilst we work out *why* manila-ganesha breaks during installation (it will clear after 10-15 minutes in production), I've modified the manila tests to detect the 'lock' situation, and forceable evict all the mds clients. This then enables ceph to return to health and the test passes. This is in PR#1246 on zaza-openstack-tests [2].

I'm also going to mark bug [1] as a duplicate of this as I think this has a better description of the issue.

[1] https://bugs.launchpad.net/charm-manila-ganesha/+bug/1935022
[2] https://github.com/openstack-charmers/zaza-openstack-tests/pull/1246

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

Added the charm-ceph-fs project as the bug may be due to that. See this comment [1] from the review for the manila-ganesha charm [2]

> Looking through the logs of the latest change, I can see that removing the ceph share is failing due to a timeout (error -110 == ETIMEDOUT). Looking at the Ceph FS logs, I can see that the mds server is restarted very regularly on both instances - each time that a config changed or assess status call is invoked and I think this is playing into the failures seen here.

> Looking at the change for https://review.opendev.org/c/openstack/charm-ceph-fs/+/916693, I can see that the mds service will be restarted every time that the config changed hook is run without worrying about whether it is actually needed or not. Essentially, this change is trying to make sure that the mds keys are updated - without consideration as to whether the ceph keys actually changed or not. This is causing very frequent restart and recovery of the mds servers.

[1] https://review.opendev.org/c/openstack/charm-manila-ganesha/+/917334/comments/c9e32574_add8dc07
[2] https://review.opendev.org/c/openstack/charm-manila-ganesha/+/917334

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.