Need more granular ceph_health checks

Bug #1735579 reported by Drew Freiberger on 2017-11-30
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack ceph charm
Wishlist
Unassigned
OpenStack ceph-mon charm
Wishlist
Unassigned

Bug Description

Currently, ceph and ceph-mon charms have a check_ceph_status nrpe check that they drop that monitors for HEALTH_WARN and HEALTH_CRIT. HEALTH_WARN and HEALTH_CRIT are triggered by quite a number of different situations. Some examples:

OSD near Full
PG stuck for >30s
OSD Full
MON down
OSD down/out
PG recovery
PG backfill
PG backfill-wait
PG backfill-wait-toofull

We are finding that having one service monitor in nagios limits our ability to acknowledge and silence an alert for something like "OSD Near Full" without also affecting our ability to pick up on other errors such as "PG Backfill-wait-toofull" or "OSD Down" or "MON DOWN".

Imagine a scenario where you have a ceph-mon disk near-full situation where it's 1% over ceph alerting threshold and you're waiting for a client or coworker/escalation-point-of-contact to approve some cleanup you've identified is possible (maybe old images filling root disk). You want to be able to acknowledge or downtime that alert for the next 2 hours, however, there's now risk that you'll miss an OSD down error coming out of that same check.

The purpose of this bug is to track a wishlist of separate NRPE checks that would filter the detailed health status to track these "critical" vs "impending" vs "informative" reasons that ceph may go out of HEALTH_OK state.

I'd like to see separate checks such as these checks below as part of these extended checks that would allow us to still have visibility into more critical issues in ceph while quiescing the alert due to "impending" or "informative" issues.

check_ceph_mon_missing
check_ceph_toofull
check_ceph_pgstuck
check_ceph_osd_down

Changed in charm-ceph:
importance: Undecided → Critical
importance: Critical → Wishlist
Changed in charm-ceph-mon:
importance: Undecided → Wishlist
Changed in charm-ceph:
status: New → Triaged
Changed in charm-ceph-mon:
status: New → Triaged
Xav Paice (xavpaice) wrote :

Also important to add - OSDs that are up, but not 'IN'

Reviewed: https://review.openstack.org/648941
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=8e4fe5729583bd6a229d2688ea53f0198483b279
Submitter: Zuul
Branch: master

commit 8e4fe5729583bd6a229d2688ea53f0198483b279
Author: Marian Gasparovic <email address hidden>
Date: Mon Apr 1 11:32:10 2019 +0200

    Creates aditional nrpe checks which parse warning/error messages

    When Ceph is in a warning state for reason1 and in the meantime
    new reason2 appears, operator is not alerted and also cannot mute
    alarms selectively (as described in bug #1735579)
    This patch alllows to specify a dictionary of 'name':'regex' where
    'name' will become ceph-$name check in nrpe and $regex will be
    searched for in warning/error messages. It is specified via
    charm nagios_additional_checks parameter.
    There is also nagios_additional_checks_critical parameter which
    specifies if those checks are reported as warning or error.

    Change-Id: I73a7c15db88793bb78841d8395535c97ca2af872
    Partial-Bug: 1735579

Reviewed: https://review.opendev.org/656764
Committed: https://git.openstack.org/cgit/openstack/charm-ceph-mon/commit/?id=caa1cd8d6a2859a8d33abedd4a5f409591aa005e
Submitter: Zuul
Branch: master

commit caa1cd8d6a2859a8d33abedd4a5f409591aa005e
Author: Marian Gasparovic <email address hidden>
Date: Fri Apr 26 16:54:02 2019 +0200

    Creates nrpe check for number of OSDs

    Alert is triggered when number of known OSDs in osdmap is different
    than number of "in" or "up" OSDs.

    Change-Id: Id3d43f0146452d0bbd73e1ce98616a994eaee090
    Partial-Bug: 1735579

Drew Freiberger (afreiberger) wrote :

This appeared on my cloud I was upgrading today.

It appears that the xenial-pike version of luminous still uses health.overall_status.

# ceph status -f json-pretty

{
    "fsid": "mysecret",
    "health": {
        "summary": [],
        "overall_status": "HEALTH_OK",
        "detail": []
    },

*snip*

# ceph version
ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous (stable)
# apt-cache policy ceph-common
ceph-common:
  Installed: 12.2.4-0ubuntu0.17.10.1~cloud0
  Candidate: 12.2.4-0ubuntu0.17.10.1~cloud0
  Version table:
 *** 12.2.4-0ubuntu0.17.10.1~cloud0 500
        500 http://ubuntu-cloud.archive.canonical.com/ubuntu xenial-updates/pike/main amd64 Packages
        100 /var/lib/dpkg/status
     10.2.11-0ubuntu0.16.04.1 500
        500 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 Packages
     10.1.2-0ubuntu1 500
        500 http://archive.ubuntu.com/ubuntu xenial/main amd64 Packages

$ sudo /usr/local/lib/nagios/plugins/check_ceph_status.py -f /var/lib/nagios/cat-ceph-status.txt
WARNING:

Mimic (13.x) is even more interesting in that it has both variables, but overall_status shows HEALTH_WARN even when status = HEALTH_OK. This may be a holdover in the mon state database.
Mimic health output when ceph status shows health_ok:

    "health": {
        "checks": {},
        "status": "HEALTH_OK",
        "overall_status": "HEALTH_WARN"
    },

I think either we need to find out the minor subversion of ceph that contains the update to status, or change the charm to check for existence of status as an element of the health dictionary, and use it, otherwise fall back to overall_status.

Drew Freiberger (afreiberger) wrote :

It turns out that Pike luminous 12.2.4 uses 'overall_status', but Queens luminous has 12.2.11 which uses 'status'.

This deserves a new bug.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers