NRPE should alert when the OSD count reduces

Bug #1952985 reported by Xav Paice
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ceph Monitor Charm
Fix Committed
Undecided
Unassigned

Bug Description

There is a situation where a disk in a host fails, and operators rightly remove it as quickly as possible to prevent attempts to access the device from causing a high count of D-state processes. This removal should include removing the OSD from ceph, and so Ceph will, after balancing, report healthy.

This is, unfortunately, not going to alert us to the fact that the capacity is now reduced and that we must not only replace the disk, but do so completely in Ceph as well as on the host.

There's a suggestion to create "a new NRPE check that has an OSD counter that is configurable but auto-increments and never auto-decrements. That way, if you remove an OSD from a cluster, it could go into warning (as current-osd-count << expected-osd-count) that could escalate to critical as time goes on."

We would want to accompany such a check with an action to reset the osd count to whatever is current.

Tags: bseng-68
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-mon (master)
Changed in charm-ceph-mon:
status: New → In Progress
Alvaro Uria (aluria)
tags: added: bseng-68
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-mon (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-mon/+/822061
Committed: https://opendev.org/openstack/charm-ceph-mon/commit/b8af44aefacacb286da1c56a5fcde6f5a5511ffb
Submitter: "Zuul (22348)"
Branch: master

commit b8af44aefacacb286da1c56a5fcde6f5a5511ffb
Author: Edin Sarajlic <email address hidden>
Date: Fri Dec 17 08:45:02 2021 +0800

    Add nagios check for expected number of OSDs

    This check does not require manually setting the number of expected
    OSDs.

    Initially, the charm sets the count (per-host) to that of what's
    present in the OSD tree. The count will be updated (on a per-host
    basis) when the number of OSDs grows, but not when it shrinks. There
    is a charm action to reset the expected count using information from
    the OSD tree.

    Closes-Bug: #1952985
    Change-Id: Ia6a060bf151908c1d4159e6bdffa7bfe1f0a7988

Changed in charm-ceph-mon:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.