don't alert on paused units

Bug #1880576 reported by Andrea Ieri
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
Medium
Martin Kalcok

Bug Description

When a unit is paused, the cluster node is set in standby mode.

root@juju-3565e5-48:~# crm_mon -1rf
Stack: corosync
Current DC: juju-3565e5-47 (version 1.1.18-2b07d5c5a9) - partition with quorum
Last updated: Mon May 25 15:12:56 2020
Last change: Wed Apr 22 22:49:50 2020 by root via crm_attribute on juju-3565e5-47

3 nodes configured
4 resources configured

Node juju-3565e5-46: standby
Node juju-3565e5-47: standby
Online: [ juju-3565e5-48 ]

Full list of resources:

 Resource Group: grp_mysql_vips
     res_mysql_18e5f60_vip (ocf::heartbeat:IPaddr2): Started juju-3565e5-48
 Clone Set: cl_mysql_monitor [res_mysql_monitor]
     Started: [ juju-3565e5-48 ]
     Stopped: [ juju-3565e5-46 juju-3565e5-47 ]

Migration Summary:
* Node juju-3565e5-47:
* Node juju-3565e5-46:
* Node juju-3565e5-48:

This causes the nrpe alert to fire, even though it's a false positive.

The check_crm script has a -s flag that can be used to ignore standby nodes, but since a standby node will have its resources stopped, the script will still alert on those (with an unclear message):

root@juju-3565e5-48:~# /usr/local/lib/nagios/plugins/check_crm -s
check_crm CRITICAL - : juju-3565e5-46 juju-3565e5-47 Stopped

I think the following should be done:
* check_crm should not alert on stopped resources that belong to nodes in standby, if the -s option has been provided
* the hacluster charm should invoke check_crm with the -s option by default

Revision history for this message
Alvaro Uria (aluria) wrote :

This also affects the OpenStack charms, which can also be paused. In such case, host check (by Nagios) and service checks (nrpe checks ran by Nagios) should be disabled. We don't want to get socket timeout or host down alerts on known maintenance operations.

James Page (james-page)
tags: added: canonical-bootstack
Changed in charm-hacluster:
status: New → Triaged
importance: Undecided → Medium
Changed in charm-hacluster:
assignee: nobody → Martin Kalcok (martin-kalcok)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/758416

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/758416
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=c385fef7b0659f673844a317eb6ba14f8f1821c8
Submitter: Zuul
Branch: master

commit c385fef7b0659f673844a317eb6ba14f8f1821c8
Author: Martin Kalcok <email address hidden>
Date: Fri Nov 6 12:24:57 2020 +0100

    NRPE: Don't report paused hacluster nodes as CRITICAL error

    Previously, paused hacluster units showed up CRITICAL error
    in nagios even though they were only in the 'standby' mode
    in corosync.
    The hacluster charm now uses the '-s' option of the check_crm
    nrpe script to ignore alerts of the standby units.

    Change-Id: I976d5ff01d0156fbaa91f9028ac81b44c96881af
    Closes-Bug: #1880576

Changed in charm-hacluster:
status: In Progress → Fix Committed
David Ames (thedac)
Changed in charm-hacluster:
milestone: none → 21.01
David Ames (thedac)
Changed in charm-hacluster:
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/840227
Committed: https://opendev.org/openstack/charm-hacluster/commit/a0b419519cd438affb24ff80c0221cc33d884c9a
Submitter: "Zuul (22348)"
Branch: master

commit a0b419519cd438affb24ff80c0221cc33d884c9a
Author: Gabriel Cocenza <email address hidden>
Date: Mon May 2 19:17:36 2022 -0300

    Fix standby node regex for check_crm

    Pacemaker has changed the output format of crm_mon and this broke
    the regex to catch nodes that are on standby mode. This change
    updates the regex for not alerting on paused units.

    Change-Id: I137acad076bff58506fea6e1618a00765adacd9b
    Closes-Bug: #1971182
    Related-Bug: #1880576

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-hacluster (stable/focal)

Related fix proposed to branch: stable/focal
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/841589

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to charm-hacluster (stable/jammy)

Related fix proposed to branch: stable/jammy
Review: https://review.opendev.org/c/openstack/charm-hacluster/+/841590

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-hacluster (stable/jammy)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/841590
Committed: https://opendev.org/openstack/charm-hacluster/commit/da1e0ff22b9e3960acf77de13f3637fe63873b3a
Submitter: "Zuul (22348)"
Branch: stable/jammy

commit da1e0ff22b9e3960acf77de13f3637fe63873b3a
Author: Gabriel Cocenza <email address hidden>
Date: Mon May 2 19:17:36 2022 -0300

    Fix standby node regex for check_crm

    Pacemaker has changed the output format of crm_mon and this broke
    the regex to catch nodes that are on standby mode. This change
    updates the regex for not alerting on paused units.

    Change-Id: I137acad076bff58506fea6e1618a00765adacd9b
    Closes-Bug: #1971182
    Related-Bug: #1880576
    (cherry picked from commit a0b419519cd438affb24ff80c0221cc33d884c9a)

tags: added: in-stable-jammy
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to charm-hacluster (stable/focal)

Reviewed: https://review.opendev.org/c/openstack/charm-hacluster/+/841589
Committed: https://opendev.org/openstack/charm-hacluster/commit/7eba2bfeb0059fcaeb8cca3bd9526cb18debc2c5
Submitter: "Zuul (22348)"
Branch: stable/focal

commit 7eba2bfeb0059fcaeb8cca3bd9526cb18debc2c5
Author: Gabriel Cocenza <email address hidden>
Date: Mon May 2 19:17:36 2022 -0300

    Fix standby node regex for check_crm

    Pacemaker has changed the output format of crm_mon and this broke
    the regex to catch nodes that are on standby mode. This change
    updates the regex for not alerting on paused units.

    Change-Id: I137acad076bff58506fea6e1618a00765adacd9b
    Closes-Bug: #1971182
    Related-Bug: #1880576
    (cherry picked from commit a0b419519cd438affb24ff80c0221cc33d884c9a)

tags: added: in-stable-focal
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.