Ceph Monitor Charm

NRPE should alert when the OSD count reduces

Bug #1952985 reported by Xav Paice on 2021-12-02

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Ceph Monitor Charm	Fix Committed	Undecided	Unassigned

Bug Description

There is a situation where a disk in a host fails, and operators rightly remove it as quickly as possible to prevent attempts to access the device from causing a high count of D-state processes. This removal should include removing the OSD from ceph, and so Ceph will, after balancing, report healthy.

This is, unfortunately, not going to alert us to the fact that the capacity is now reduced and that we must not only replace the disk, but do so completely in Ceph as well as on the host.

There's a suggestion to create "a new NRPE check that has an OSD counter that is configurable but auto-increments and never auto-decrements. That way, if you remove an OSD from a cluster, it could go into warning (as current-osd-count << expected-osd-count) that could escalate to critical as time goes on."

We would want to accompany such a check with an action to reset the osd count to whatever is current.

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-12-17: Fix proposed to charm-ceph-mon (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-mon/+/822061

Changed in charm-ceph-mon:
status:	New → In Progress

Alvaro Uria (aluria) on 2022-04-15

tags:

added: bseng-68

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-10-07: Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/822061
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/b8af44aefacacb286da1c56a5fcde6f5a5511ffb
Submitter: "Zuul (22348)"
Branch: master

commit b8af44aefacacb286da1c56a5fcde6f5a5511ffb
Author: Edin Sarajlic <email address hidden>
Date: Fri Dec 17 08:45:02 2021 +0800

Add nagios check for expected number of OSDs

This check does not require manually setting the number of expected
OSDs.

    Initially, the charm sets the count (per-host) to that of what's
    present in the OSD tree. The count will be updated (on a per-host
    basis) when the number of OSDs grows, but not when it shrinks. There
    is a charm action to reset the expected count using information from
    the OSD tree.

Closes-Bug: #1952985
Change-Id: Ia6a060bf151908c1d4159e6bdffa7bfe1f0a7988

Changed in charm-ceph-mon:
status:	In Progress → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.