Need more granular ceph_health checks
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph Monitor Charm |
Triaged
|
Wishlist
|
Unassigned | ||
OpenStack Ceph Charm (Retired) |
Won't Fix
|
Wishlist
|
Unassigned |
Bug Description
Currently, ceph and ceph-mon charms have a check_ceph_status nrpe check that they drop that monitors for HEALTH_WARN and HEALTH_CRIT. HEALTH_WARN and HEALTH_CRIT are triggered by quite a number of different situations. Some examples:
OSD near Full
PG stuck for >30s
OSD Full
MON down
OSD down/out
PG recovery
PG backfill
PG backfill-wait
PG backfill-
We are finding that having one service monitor in nagios limits our ability to acknowledge and silence an alert for something like "OSD Near Full" without also affecting our ability to pick up on other errors such as "PG Backfill-
Imagine a scenario where you have a ceph-mon disk near-full situation where it's 1% over ceph alerting threshold and you're waiting for a client or coworker/
The purpose of this bug is to track a wishlist of separate NRPE checks that would filter the detailed health status to track these "critical" vs "impending" vs "informative" reasons that ceph may go out of HEALTH_OK state.
I'd like to see separate checks such as these checks below as part of these extended checks that would allow us to still have visibility into more critical issues in ceph while quiescing the alert due to "impending" or "informative" issues.
check_ceph_
check_ceph_toofull
check_ceph_pgstuck
check_ceph_osd_down
Changed in charm-ceph: | |
importance: | Undecided → Critical |
importance: | Critical → Wishlist |
Changed in charm-ceph-mon: | |
importance: | Undecided → Wishlist |
Changed in charm-ceph: | |
status: | New → Triaged |
Changed in charm-ceph-mon: | |
status: | New → Triaged |
Also important to add - OSDs that are up, but not 'IN'