NRPE should alert when the OSD count reduces
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph Monitor Charm |
Fix Committed
|
Undecided
|
Unassigned |
Bug Description
There is a situation where a disk in a host fails, and operators rightly remove it as quickly as possible to prevent attempts to access the device from causing a high count of D-state processes. This removal should include removing the OSD from ceph, and so Ceph will, after balancing, report healthy.
This is, unfortunately, not going to alert us to the fact that the capacity is now reduced and that we must not only replace the disk, but do so completely in Ceph as well as on the host.
There's a suggestion to create "a new NRPE check that has an OSD counter that is configurable but auto-increments and never auto-decrements. That way, if you remove an OSD from a cluster, it could go into warning (as current-osd-count << expected-osd-count) that could escalate to critical as time goes on."
We would want to accompany such a check with an action to reset the osd count to whatever is current.
Fix proposed to branch: master /review. opendev. org/c/openstack /charm- ceph-mon/ +/822061
Review: https:/