don't alert on paused units
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack HA Cluster Charm |
Fix Released
|
Medium
|
Martin Kalcok |
Bug Description
When a unit is paused, the cluster node is set in standby mode.
root@juju-
Stack: corosync
Current DC: juju-3565e5-47 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Mon May 25 15:12:56 2020
Last change: Wed Apr 22 22:49:50 2020 by root via crm_attribute on juju-3565e5-47
3 nodes configured
4 resources configured
Node juju-3565e5-46: standby
Node juju-3565e5-47: standby
Online: [ juju-3565e5-48 ]
Full list of resources:
Resource Group: grp_mysql_vips
res_
Clone Set: cl_mysql_monitor [res_mysql_monitor]
Started: [ juju-3565e5-48 ]
Stopped: [ juju-3565e5-46 juju-3565e5-47 ]
Migration Summary:
* Node juju-3565e5-47:
* Node juju-3565e5-46:
* Node juju-3565e5-48:
This causes the nrpe alert to fire, even though it's a false positive.
The check_crm script has a -s flag that can be used to ignore standby nodes, but since a standby node will have its resources stopped, the script will still alert on those (with an unclear message):
root@juju-
check_crm CRITICAL - : juju-3565e5-46 juju-3565e5-47 Stopped
I think the following should be done:
* check_crm should not alert on stopped resources that belong to nodes in standby, if the -s option has been provided
* the hacluster charm should invoke check_crm with the -s option by default
tags: | added: canonical-bootstack |
Changed in charm-hacluster: | |
status: | New → Triaged |
importance: | Undecided → Medium |
Changed in charm-hacluster: | |
assignee: | nobody → Martin Kalcok (martin-kalcok) |
status: | Triaged → In Progress |
Changed in charm-hacluster: | |
milestone: | none → 21.01 |
Changed in charm-hacluster: | |
status: | Fix Committed → Fix Released |
This also affects the OpenStack charms, which can also be paused. In such case, host check (by Nagios) and service checks (nrpe checks ran by Nagios) should be disabled. We don't want to get socket timeout or host down alerts on known maintenance operations.