cephfs service gets restarted unnecessarily

Bug #2071780 reported by Ponnuvel Palaniyappan
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Ceph-FS Charm
In Progress
Undecided
Unassigned

Bug Description

ceph-fs charm restarts ceph-fs service frequently (about 12-15 times in 1 hr).

This was done recently in the charm to support rotating keys:

https://github.com/openstack/charm-ceph-fs/commit/8655c1b81a79a7995c094736f28bb4f1919a690a

However, the service gets restarted even when there's no key change - every time 'config-changed' hook is triggered.

Restarting every 5 mins or so can also impact clients.

Reproducer: a simple cluster setup with quincy/stable charms (including ceph-fs charm) exhibits the issue.

Oddly, this doesn't happen if latest/edge is used (ceph-fs charm).

Tags: sts
tags: added: sts
description: updated
description: updated
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-fs (master)
Changed in charm-ceph-fs:
status: New → In Progress
Revision history for this message
Nobuto Murata (nobuto) wrote :

This is blocking one handover so I'm re-subscribing ~field-critical from the duplicate one:
https://bugs.launchpad.net/charm-ceph-fs/+bug/2074349

One patchset can fix both of the issues but I believe there are two issues here.

1. incorrect logic to restart the ceph-mds service
2. invoking config_changed function in each update-status hook

I believe the patch set review is taking some time to have a proper logic for fixing 1. but fixing 2. should be fairly straightforward if I'm not mistaken and it would be a low hanging fruit and a good mitigation to buy some time to fix 1.

Revision history for this message
Nobuto Murata (nobuto) wrote :
Download full text (6.8 KiB)

[un-patched but some logs added charm from quincy/stable rev. 65]

-> config_changed function is called in every update-status hook.

diff --git a/charm.orig/reactive/ceph_fs.py b/charm/reactive/ceph_fs.py
index 8dc9898..b820b36 100644
--- a/charm.orig/reactive/ceph_fs.py
+++ b/charm/reactive/ceph_fs.py
@@ -68,6 +68,8 @@ def config_changed():
                 subprocess.check_call(['sudo', 'systemctl',
                                        'reset-failed', svc])
                 subprocess.check_call(['sudo', 'systemctl', 'restart', svc])
+ ch_core.hookenv.log(f'⚠️ systemctl restart {svc}',
+ ch_core.hookenv.WARNING)
             except subprocess.CalledProcessError as exc:
                 # The service can be temporarily masked when booting, so
                 # skip that class of errors.

unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer>
tracer: hooks phase, 1 handlers queued
tracer: ++ queue handler reactive/layer_openstack.py:64:default_update_status
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:64:default_update_status
unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer: set flag run-default-update-status
unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer: set flag is-update-status-hook
unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer>
tracer: main dispatch loop, 7 handlers queued
tracer: ++ queue handler hooks/relations/ceph-mds/requires.py:31:joined:ceph-mds
tracer: ++ queue handler hooks/relations/ceph-mds/requires.py:35:changed:ceph-mds
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:117:broken:certificates
tracer: ++ queue handler reactive/ceph_fs.py:80:storage_ceph_connected
tracer: ++ queue handler reactive/layer_openstack.py:170:default_config_rendered
tracer: ++ queue handler reactive/layer_openstack.py:82:check_really_is_update_status
tracer: ++ queue handler reactive/layer_openstack.py:93:run_default_update_status
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/ceph_fs.py:80:storage_ceph_connected
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Request already sent but not complete, not sending new request
unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer: cleared flag ceph-mds.pools.available
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Request already sent but not complete, not sending new request
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Request already sent but not complete, not sending new request
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:82:check_really_is_update_status
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:93:run_default_update_status
unit-ceph-fs-0: 11:57:15 DEBUG unit.ceph-fs/0.juju-log tracer>
tracer: cleared flag run-default-update-status
tracer: -- dequeue handler reactive/layer_openstack.py:93:run_default_update_status
unit-ceph-fs-0: 11:57:15 INFO unit.ceph-fs/0.juju-log Invoking reactive handler:...

Read more...

Revision history for this message
Nobuto Murata (nobuto) wrote :
Download full text (3.5 KiB)

[patched charm on top of quincy/stable rev. 65]

-> no "Invoking reactive handler: reactive/ceph_fs.py:42:config_changed" line in config_changed.

diff --git a/charm.orig/reactive/ceph_fs.py b/charm/reactive/ceph_fs.py
index 8dc9898..7e94d83 100644
--- a/charm.orig/reactive/ceph_fs.py
+++ b/charm/reactive/ceph_fs.py
@@ -39,7 +39,7 @@ charm.use_defaults(
 )

-@reactive.when_none('charm.paused', 'run-default-update-status')
+@reactive.when_none('charm.paused', 'is-update-status-hook')
 @reactive.when('ceph-mds.pools.available')
 def config_changed():
     ceph_mds = reactive.endpoint_from_flag('ceph-mds.pools.available')
@@ -68,6 +68,8 @@ def config_changed():
                 subprocess.check_call(['sudo', 'systemctl',
                                        'reset-failed', svc])
                 subprocess.check_call(['sudo', 'systemctl', 'restart', svc])
+ ch_core.hookenv.log(f'⚠️ systemctl restart {svc}',
+ ch_core.hookenv.WARNING)
             except subprocess.CalledProcessError as exc:
                 # The service can be temporarily masked when booting, so
                 # skip that class of errors.

unit-ceph-fs-0: 22:08:46 DEBUG unit.ceph-fs/0.juju-log tracer>
tracer: hooks phase, 1 handlers queued
tracer: ++ queue handler reactive/layer_openstack.py:64:default_update_status
unit-ceph-fs-0: 22:08:46 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:64:default_update_status
unit-ceph-fs-0: 22:08:46 DEBUG unit.ceph-fs/0.juju-log tracer: set flag run-default-update-status
unit-ceph-fs-0: 22:08:46 DEBUG unit.ceph-fs/0.juju-log tracer: set flag is-update-status-hook
unit-ceph-fs-0: 22:08:46 DEBUG unit.ceph-fs/0.juju-log tracer>
tracer: main dispatch loop, 4 handlers queued
tracer: ++ queue handler hooks/relations/tls-certificates/requires.py:117:broken:certificates
tracer: ++ queue handler reactive/layer_openstack.py:161:default_config_not_rendered
tracer: ++ queue handler reactive/layer_openstack.py:82:check_really_is_update_status
tracer: ++ queue handler reactive/layer_openstack.py:93:run_default_update_status
unit-ceph-fs-0: 22:08:46 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:82:check_really_is_update_status
unit-ceph-fs-0: 22:08:46 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:93:run_default_update_status
unit-ceph-fs-0: 22:08:46 DEBUG unit.ceph-fs/0.juju-log tracer>
tracer: cleared flag run-default-update-status
tracer: -- dequeue handler reactive/layer_openstack.py:93:run_default_update_status
unit-ceph-fs-0: 22:08:46 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: reactive/layer_openstack.py:161:default_config_not_rendered
unit-ceph-fs-0: 22:08:46 DEBUG unit.ceph-fs/0.update-status inactive
unit-ceph-fs-0: 22:08:46 WARNING unit.ceph-fs/0.update-status Unit /<email address hidden> is masked, ignoring.
unit-ceph-fs-0: 22:08:46 INFO unit.ceph-fs/0.juju-log Invoking reactive handler: hooks/relations/tls-certificates/requires.py:117:broken:certificates
unit-ceph-fs-0: 22:08:46 DEBUG unit.ceph-fs/0.juju-log Running _assess_...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-fs (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-fs/+/925095

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-fs (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-fs/+/925095
Committed: https://opendev.org/openstack/charm-ceph-fs/commit/032949448fc6d1a472ff5d3040b0862b893c85d8
Submitter: "Zuul (22348)"
Branch: master

commit 032949448fc6d1a472ff5d3040b0862b893c85d8
Author: Nobuto Murata <email address hidden>
Date: Mon Jul 29 22:36:14 2024 +0900

    Don't make any changes during update-status hooks

    Previously the config_changed function was invoked during the
    update-status hooks. And it made unnecessary changes to the system.
    Guard reactive functions properly.

    > INFO unit.ceph-fs/0.juju-log Invoking reactive handler:
    > reactive/ceph_fs.py:42:config_changed

    Closes-Bug: #2074349
    Related-Bug: #2071780
    Change-Id: If6cd061fef4c3625d6d498942949e31f243622df

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-fs (stable/squid-jammy)

Related fix proposed to branch: stable/squid-jammy
Review: https://review.opendev.org/c/openstack/charm-ceph-fs/+/925135

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.