the filesystem get frequently offline and fail-over to other unit

Bug #2074349 reported by Yoshi Kadokawa
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Ceph-FS Charm
Fix Committed
Undecided
Unassigned

Bug Description

Observed the following log message in one of the ceph-mon unit in /var/log/ceph/ceph.log

2024-07-29T06:40:42.313271+0000 mon.juju-6618b8-2-lxd-1 (mon.0) 884 : cluster [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
2024-07-29T06:40:42.313323+0000 mon.juju-6618b8-2-lxd-1 (mon.0) 885 : cluster [ERR] Health check failed: 1 filesystem is offline (MDS_ALL_DOWN)
2024-07-29T06:40:42.332571+0000 mon.juju-6618b8-2-lxd-1 (mon.0) 887 : cluster [INF] Standby daemon mds.juju-6618b8-2-lxd-0 assigned to filesystem ceph-fs as rank 0
2024-07-29T06:40:42.332934+0000 mon.juju-6618b8-2-lxd-1 (mon.0) 888 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)
2024-07-29T06:40:44.364311+0000 mon.juju-6618b8-2-lxd-1 (mon.0) 895 : cluster [INF] daemon mds.juju-6618b8-2-lxd-0 is now active in filesystem ceph-fs as rank 0
2024-07-29T06:40:45.363716+0000 mon.juju-6618b8-2-lxd-1 (mon.0) 896 : cluster [INF] Health check cleared: FS_DEGRADED (was: 1 filesystem is degraded)

This is observed quite often, and mostly every 5 minutes or less.
This seems to be linked with the update-status hook in ceph-fs charm. I haven't identified the exact code, but it seems that the ceph-mds@<unit-host-name>.service in ceph-fs unit get restarted every time the update-status hook is run, which is every 5 min, and causing this filesystem to get offline.
Not a workaround but could confirm that this issue will stop when you stop jujud-machine-<id> service in ceph-fs units.

This is reproducible with this basic bundle.[0]

[0] https://pastebin.ubuntu.com/p/4CwvmYD5cc/

Revision history for this message
Yoshi Kadokawa (yoshikadokawa) wrote :

As we don't have a workaround for this issue, subscribing this to field-critical.

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Believe this to be a dup of bug #2071780, we're working on a fix.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-fs (master)
Changed in charm-ceph-fs:
status: New → In Progress
Revision history for this message
Nobuto Murata (nobuto) wrote :

For the record from https://bugs.launchpad.net/charm-ceph-fs/+bug/2071780/comments/3

run-default-update-status and is-update-status-hook flags are set at the same timing. However, run-default-update-status gets cleared before invoking config_changed or clearing is-update-status-hook.

unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:64:default_update_status
unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer: set flag run-default-update-status
unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer: set flag is-update-status-hook
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/ceph_fs.py:80:storage_ceph_connected
unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer: cleared flag ceph-mds.pools.available
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:82:check_really_is_update_status
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:93:run_default_update_status
tracer: cleared flag run-default-update-status
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:170:default_config_rendered
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:117:broken:certificates
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: hooks/relations/ceph-mds/requires.py:31:joined:ceph-mds
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: hooks/relations/ceph-mds/requires.py:35:changed:ceph-mds
tracer: set flag ceph-mds.pools.available
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/ceph_fs.py:42:config_changed
tracer: cleared flag is-update-status-hook

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-fs (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-fs/+/925095
Committed: https://opendev.org/openstack/charm-ceph-fs/commit/032949448fc6d1a472ff5d3040b0862b893c85d8
Submitter: "Zuul (22348)"
Branch: master

commit 032949448fc6d1a472ff5d3040b0862b893c85d8
Author: Nobuto Murata <email address hidden>
Date: Mon Jul 29 22:36:14 2024 +0900

    Don't make any changes during update-status hooks

    Previously the config_changed function was invoked during the
    update-status hooks. And it made unnecessary changes to the system.
    Guard reactive functions properly.

    > INFO unit.ceph-fs/0.juju-log Invoking reactive handler:
    > reactive/ceph_fs.py:42:config_changed

    Closes-Bug: #2074349
    Related-Bug: #2071780
    Change-Id: If6cd061fef4c3625d6d498942949e31f243622df

Changed in charm-ceph-fs:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-fs (stable/squid-jammy)

Fix proposed to branch: stable/squid-jammy
Review: https://review.opendev.org/c/openstack/charm-ceph-fs/+/925135

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-fs (stable/reef)

Fix proposed to branch: stable/reef
Review: https://review.opendev.org/c/openstack/charm-ceph-fs/+/925154

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-fs (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-fs/+/925155

Revision history for this message
Nobuto Murata (nobuto) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.