Allow for adjustable failcount for nrpe check_crm

Bug #1796400 reported by Drew Freiberger
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
Medium
Unassigned

Bug Description

There are several times that we get alerts for a simple service restart tripping a CRM resource monitor, such as nova-consoleauth or haproxy.

This pops a failcount=1 for the service on a given node, but the service starts right back up and is shown as Started in 'crm status' shortly after.

We would like the ability to customize the failcount CRITICAL threshold in the hacluster service charm to avoid having to cleanup cluster resources for items that have just had services restarted for upgrades or reconfiguration, and only alert on things that have failed for 2 or 3 monitor cycles, perhaps.

If you check the nagios plugin check_crm v0.7 (Jan 2013), there is a --failcount (-f) variable that defaults to 1, but could be modified to be --failcount=3 or whatever your tolerance might be for fail counters.

$np->add_arg(
    spec => 'failcount|failcounts|f=i',
    help => 'resource fail count to start warning on [default = 1].',
    required => 0,
    default => 1,
);

James Page (james-page)
Changed in charm-hacluster:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Andrea Ieri (aieri) wrote :

check_crm v0.7 already supports specifying a threshold for failcounts, and in any case the plugin never sends a CRITICAL for failed actions, only a WARNING.
The problem is actually that check_crm sends a CRITICAL for failed actions, and that takes precedence over the warnings for failcounts.

I submitted a patch for optionally ignoring (or warning) on the failed actions block (https://review.openstack.org/#/c/615965/).
IMHO failed actions should always be ignored - they are useful to understand what's happened, but they don't tell you how the cluster is doing *right now* - but the new option I proposed defaults to CRITICAL to be backward compatible.

Revision history for this message
Andrea Ieri (aieri) wrote :

Actually... I marked my commit as 'closes-bug' but that's wrong: it only allows check_crm to be configured to ignore failed actions, but does not change anything in the way the charm configures it.
Now we should set --failedactions=ignore and either expose the --failcount option, or set it a sane default. I think something like 5 would make sense (monitor actions for openstack services are set to once every 5s).

Revision history for this message
Chris Sanders (chris.sanders) wrote :

Subscribed field-medium

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

Andrea:

I'm looking at your existing, related, review and wanted to inquire as to whether you intend to continue working on this feature?

Revision history for this message
Andrea Ieri (aieri) wrote :

Chris:

I could do that, yes. After the new check_crm code lands, all that's needed is to add a new charm option so we can ignore failed actions.

Revision history for this message
Chris MacNaughton (chris.macnaughton) wrote :

I've just merged https://review.opendev.org/#/c/615965. Andrea: if you want to continue working on this change, that would be great, otherwise, I believe that the field-medium subscription should be removed.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/615965
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=9483383555c181e7efa99619fad6f7641020bc5a
Submitter: Zuul
Branch: master

commit 9483383555c181e7efa99619fad6f7641020bc5a
Author: Andrea Ieri <email address hidden>
Date: Tue Nov 6 18:43:06 2018 +0100

    Choose whether to ignore/warn/crit on failed actions

    This commit adds a new option to check_crm named --failedactions
    Possible options are 'warning', 'critical', or anything else (which is
    considered equivalent to 'ignore').
    The default is 'critical' to be backward compatible.

    Change-Id: I5908f5f4b7d77219280dfe896ea938459c6b23bd
    Partial-Bug: #1796400

Revision history for this message
Andrea Ieri (aieri) wrote :

@Chris:

Please see https://review.opendev.org/#/c/658825/ for a patch that would close this bug.
Also note that closing #1802310 would make that patch a lot more useful.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/658825
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=4d391e8107f63eefbc143624d78100389fd12a92
Submitter: Zuul
Branch: master

commit 4d391e8107f63eefbc143624d78100389fd12a92
Author: Andrea Ieri <email address hidden>
Date: Mon May 13 15:58:46 2019 +0200

    Allow tuning for check_crm failure handling

    This commit adds two new options, failed_actions_alert_type and
    failed_actions_threshold, which map onto the check_crm options
    --failedactions and --failcounts, respectively.
    The default option values make check_crm generate critical alerts if
    actions failed once.
    The actions check can be entirely bypassed if failed_actions_alert_type
    is set to 'ignore'.

    Closes-Bug: #1796400
    Change-Id: I72f65bacba8bf17a13db19d2a3472f760776019a

Changed in charm-hacluster:
status: Triaged → Fix Committed
Changed in charm-hacluster:
milestone: none → 19.07
David Ames (thedac)
Changed in charm-hacluster:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.