"juju refresh ceph-osd" triggers ceph-osd@.service restart in all the cluster at the same time

Bug #2068020 reported by Jorge Rodríguez
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Status tracked in Trunk
Quincy.2
Fix Released
High
Unassigned
Reef
Fix Committed
High
Unassigned
Squid-jammy
Fix Committed
High
Unassigned
Trunk
Fix Released
High
Unassigned

Bug Description

* An overview of your environment
The environment is a Focal-Yoga cloud running Ceph Quincy 17.2.7. There are 15 ceph-osd units hosting 6 disks per unit.

* A description of the problem and how it differs from your expected outcome
A "juju refresh ceph-osd" executed in the cloud (upgrading from revision 564 to 585 (quincy/stable channel)) causes all the ceph-osd@ services of all the ceph-osd units to be restarted at the same time. This fact leads to PGs going from Active to Inactive/Down at the same time, causing a performance disruption in the Ceph cluster.
During some earlier tests upgrading the ceph-osd charm from revision 576 to 585 (quincy/stable) did not show this behavior (ceph-osd@ services were not restarted).
The expected outcome while upgrading the ceph-osd charm would be no impact in the cluster or a minimal impact (e.g. rolling restart of ceph-osd services).

* Software versions for: Ubuntu, OpenStack, Juju, and MAAS
Ubuntu 20.04.6 LTS
OpenStack Yoga (yoga/stable)
Juju 2.9.44
MAAS 3.3

* Command output errors or log messages directly related to the problem
In the /var/log/juju/unit-ceph-osd-<id>.log of the ceph-osd units:
```
2024-06-04 07:30:33 INFO unit.ceph-osd/0.juju-log server.go:316 Installing apparmor profile for usr.bin.ceph-osd
2024-06-04 07:30:33 INFO unit.ceph-osd/0.juju-log server.go:316 Installing apparmor utils.
2024-06-04 07:30:34 INFO unit.ceph-osd/0.juju-log server.go:316 Setting up the apparmor profile for usr.bin.ceph-osd in disable mode.
2024-06-04 07:30:34 WARNING unit.ceph-osd/0.config-changed logger.go:60
2024-06-04 07:30:34 WARNING unit.ceph-osd/0.config-changed logger.go:60 ERROR: /sbin/apparmor_parser: Unable to remove "/usr/bin/ceph-osd". Profile doesn't exist
2024-06-04 07:30:34 WARNING unit.ceph-osd/0.config-changed logger.go:60
2024-06-04 07:30:34 INFO unit.ceph-osd/0.juju-log server.go:316 Manually disabling the apparmor profile for usr.bin.ceph-osd.
2024-06-04 07:30:34 INFO unit.ceph-osd/0.juju-log server.go:316 Loading new AppArmor profile
2024-06-04 07:30:34 INFO unit.ceph-osd/0.juju-log server.go:316 Restarting ceph-osd services with new AppArmor profile
2024-06-04 07:31:22 INFO juju.worker.uniter.operation runhook.go:159 ran "config-changed" hook (via explicit, bespoke hook script)
```

* Steps to reproduce the problem
Running "juju refresh ceph-osd" in the cloud, upgrading from revision 564 to 585 (quincy/stable channel) triggers the massive ceph-osd@{id}.service restart at the same time.

tags: added: sts
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-osd (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/921704

Changed in charm-ceph-osd:
status: New → Confirmed
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/921704
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/b4a870af2c74638f139d73a09dc61cd43865c4f5
Submitter: "Zuul (22348)"
Branch: master

commit b4a870af2c74638f139d73a09dc61cd43865c4f5
Author: Rodrigo Barbieri <email address hidden>
Date: Mon Jun 10 16:26:14 2024 -0300

    Defer restart of services on AppArmor change

    AppArmor changes cause restarts of all OSD
    services in all units at the same time, which
    can cause ceph-cluster outage.

    Defer actual update and set a status message prompting
    to run an action for each unit separately.

    Related-bug: #2068020
    Change-Id: I32398a0525b7098503de36d72e593c14207102a1

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-osd (stable/squid-jammy)

Related fix proposed to branch: stable/squid-jammy
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/923675

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-osd (stable/jammy-caracal)

Related fix proposed to branch: stable/jammy-caracal
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/923735

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-osd (stable/squid-jammy)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/923675
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/e2a6a33f9ae56d0e3fa5cc510683ea3c84da6986
Submitter: "Zuul (22348)"
Branch: stable/squid-jammy

commit e2a6a33f9ae56d0e3fa5cc510683ea3c84da6986
Author: Rodrigo Barbieri <email address hidden>
Date: Mon Jun 10 16:26:14 2024 -0300

    Defer restart of services on AppArmor change

    AppArmor changes cause restarts of all OSD
    services in all units at the same time, which
    can cause ceph-cluster outage.

    Defer actual update and set a status message prompting
    to run an action for each unit separately.

    Related-bug: #2068020
    Change-Id: I32398a0525b7098503de36d72e593c14207102a1
    (cherry picked from commit b4a870af2c74638f139d73a09dc61cd43865c4f5)

tags: added: in-stable-squid-jammy
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-osd (stable/reef)

Related fix proposed to branch: stable/reef
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/923753

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-osd (stable/jammy-caracal)

Change abandoned by "Rodrigo Barbieri <email address hidden>" on branch: stable/jammy-caracal
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/923735
Reason: wrong branch

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-osd (stable/reef)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/923753
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/a5f13cac333e714fd03e0b44f7b33d0be3f88371
Submitter: "Zuul (22348)"
Branch: stable/reef

commit a5f13cac333e714fd03e0b44f7b33d0be3f88371
Author: Rodrigo Barbieri <email address hidden>
Date: Mon Jun 10 16:26:14 2024 -0300

    Defer restart of services on AppArmor change

    AppArmor changes cause restarts of all OSD
    services in all units at the same time, which
    can cause ceph-cluster outage.

    Defer actual update and set a status message prompting
    to run an action for each unit separately.

    Related-bug: #2068020
    Change-Id: I32398a0525b7098503de36d72e593c14207102a1
    (cherry picked from commit b4a870af2c74638f139d73a09dc61cd43865c4f5)
    (cherry picked from commit e2a6a33f9ae56d0e3fa5cc510683ea3c84da6986)

tags: added: in-stable-reef
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-ceph-osd (stable/quincy.2)

Related fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/923786

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-ceph-osd (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/923786
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/659a88eba76c4967d8706b7497a0dceaac416018
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit 659a88eba76c4967d8706b7497a0dceaac416018
Author: Rodrigo Barbieri <email address hidden>
Date: Mon Jun 10 16:26:14 2024 -0300

    Defer restart of services on AppArmor change

    AppArmor changes cause restarts of all OSD
    services in all units at the same time, which
    can cause ceph-cluster outage.

    Defer actual update and set a status message prompting
    to run an action for each unit separately.

    Related-bug: #2068020
    Change-Id: I32398a0525b7098503de36d72e593c14207102a1
    (cherry picked from commit b4a870af2c74638f139d73a09dc61cd43865c4f5)
    (cherry picked from commit e2a6a33f9ae56d0e3fa5cc510683ea3c84da6986)
    (cherry picked from commit a5f13cac333e714fd03e0b44f7b33d0be3f88371)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.