Allow for adjustable failcount for nrpe check_crm
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack HA Cluster Charm |
Fix Released
|
Medium
|
Unassigned |
Bug Description
There are several times that we get alerts for a simple service restart tripping a CRM resource monitor, such as nova-consoleauth or haproxy.
This pops a failcount=1 for the service on a given node, but the service starts right back up and is shown as Started in 'crm status' shortly after.
We would like the ability to customize the failcount CRITICAL threshold in the hacluster service charm to avoid having to cleanup cluster resources for items that have just had services restarted for upgrades or reconfiguration, and only alert on things that have failed for 2 or 3 monitor cycles, perhaps.
If you check the nagios plugin check_crm v0.7 (Jan 2013), there is a --failcount (-f) variable that defaults to 1, but could be modified to be --failcount=3 or whatever your tolerance might be for fail counters.
$np->add_arg(
spec => 'failcount|
help => 'resource fail count to start warning on [default = 1].',
required => 0,
default => 1,
);
Changed in charm-hacluster: | |
status: | New → Triaged |
importance: | Undecided → Medium |
Changed in charm-hacluster: | |
milestone: | none → 19.07 |
Changed in charm-hacluster: | |
status: | Fix Committed → Fix Released |
check_crm v0.7 already supports specifying a threshold for failcounts, and in any case the plugin never sends a CRITICAL for failed actions, only a WARNING.
The problem is actually that check_crm sends a CRITICAL for failed actions, and that takes precedence over the warnings for failcounts.
I submitted a patch for optionally ignoring (or warning) on the failed actions block (https:/ /review. openstack. org/#/c/ 615965/).
IMHO failed actions should always be ignored - they are useful to understand what's happened, but they don't tell you how the cluster is doing *right now* - but the new option I proposed defaults to CRITICAL to be backward compatible.