OpenStack HA Cluster Charm

Allow for adjustable failcount for nrpe check_crm

Bug #1796400 reported by Drew Freiberger on 2018-10-05

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack HA Cluster Charm	Fix Released	Medium	Unassigned	OpenStack HA Cluster Charm 19.07

Bug Description

There are several times that we get alerts for a simple service restart tripping a CRM resource monitor, such as nova-consoleauth or haproxy.

This pops a failcount=1 for the service on a given node, but the service starts right back up and is shown as Started in 'crm status' shortly after.

We would like the ability to customize the failcount CRITICAL threshold in the hacluster service charm to avoid having to cleanup cluster resources for items that have just had services restarted for upgrades or reconfiguration, and only alert on things that have failed for 2 or 3 monitor cycles, perhaps.

If you check the nagios plugin check_crm v0.7 (Jan 2013), there is a --failcount (-f) variable that defaults to 1, but could be modified to be --failcount=3 or whatever your tolerance might be for fail counters.

$np->add_arg(
    spec => 'failcount|failcounts|f=i',
    help => 'resource fail count to start warning on [default = 1].',
    required => 0,
    default => 1,
);

Tags:

James Page (james-page) on 2018-10-18

Changed in charm-hacluster:
status:	New → Triaged
importance:	Undecided → Medium

Revision history for this message

Andrea Ieri (aieri) wrote on 2018-11-06:

check_crm v0.7 already supports specifying a threshold for failcounts, and in any case the plugin never sends a CRITICAL for failed actions, only a WARNING.
The problem is actually that check_crm sends a CRITICAL for failed actions, and that takes precedence over the warnings for failcounts.

I submitted a patch for optionally ignoring (or warning) on the failed actions block (https://review.openstack.org/#/c/615965/).
IMHO failed actions should always be ignored - they are useful to understand what's happened, but they don't tell you how the cluster is doing *right now* - but the new option I proposed defaults to CRITICAL to be backward compatible.

Revision history for this message

Andrea Ieri (aieri) wrote on 2018-11-08:

Actually... I marked my commit as 'closes-bug' but that's wrong: it only allows check_crm to be configured to ignore failed actions, but does not change anything in the way the charm configures it.
Now we should set --failedactions=ignore and either expose the --failcount option, or set it a sane default. I think something like 5 would make sense (monitor actions for openstack services are set to once every 5s).

Revision history for this message

Chris Sanders (chris.sanders) wrote on 2018-12-10:

Subscribed field-medium

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-13:

Andrea:

I'm looking at your existing, related, review and wanted to inquire as to whether you intend to continue working on this feature?

Revision history for this message

Andrea Ieri (aieri) wrote on 2019-05-13:

Chris:

I could do that, yes. After the new check_crm code lands, all that's needed is to add a new charm option so we can ignore failed actions.

Revision history for this message

Chris MacNaughton (chris.macnaughton) wrote on 2019-05-13:

I've just merged https://review.opendev.org/#/c/615965. Andrea: if you want to continue working on this change, that would be great, otherwise, I believe that the field-medium subscription should be removed.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-05-13: Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/615965
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=9483383555c181e7efa99619fad6f7641020bc5a
Submitter: Zuul
Branch: master

commit 9483383555c181e7efa99619fad6f7641020bc5a
Author: Andrea Ieri <email address hidden>
Date: Tue Nov 6 18:43:06 2018 +0100

Choose whether to ignore/warn/crit on failed actions

    This commit adds a new option to check_crm named --failedactions
    Possible options are 'warning', 'critical', or anything else (which is
    considered equivalent to 'ignore').
    The default is 'critical' to be backward compatible.

Change-Id: I5908f5f4b7d77219280dfe896ea938459c6b23bd
Partial-Bug: #1796400

Revision history for this message

Andrea Ieri (aieri) wrote on 2019-05-13:

@Chris:

Please see https://review.opendev.org/#/c/658825/ for a patch that would close this bug.
Also note that closing #1802310 would make that patch a lot more useful.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2019-06-06:

Reviewed: https://review.opendev.org/658825
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=4d391e8107f63eefbc143624d78100389fd12a92
Submitter: Zuul
Branch: master

commit 4d391e8107f63eefbc143624d78100389fd12a92
Author: Andrea Ieri <email address hidden>
Date: Mon May 13 15:58:46 2019 +0200

Allow tuning for check_crm failure handling

    This commit adds two new options, failed_actions_alert_type and
    failed_actions_threshold, which map onto the check_crm options
    --failedactions and --failcounts, respectively.
    The default option values make check_crm generate critical alerts if
    actions failed once.
    The actions check can be entirely bypassed if failed_actions_alert_type
    is set to 'ignore'.

Closes-Bug: #1796400
Change-Id: I72f65bacba8bf17a13db19d2a3472f760776019a

Changed in charm-hacluster:
status:	Triaged → Fix Committed

Chris MacNaughton (chris.macnaughton) on 2019-07-09

Changed in charm-hacluster:
milestone:	none → 19.07

David Ames (thedac) on 2019-08-12

Changed in charm-hacluster:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.