Nagios reports CRITICAL status with information "check_cororings CRITICAL - OK "

Bug #1902919 reported by Przemyslaw Hausman
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack HA Cluster Charm
Fix Released
High
Billy Olsen

Bug Description

For each OpenStack control plane service Nagios reports CRITICAL status with Status Information "check_cororings CRITICAL - OK".

It looks like the service is OK but the command executed to check the status returns with incorrect result. See below example for aodh service.

ubuntu@juju-496933-0-lxd-0:/etc/nagios/nrpe.d$ cat check_corosync_rings.cfg
# check corosync_rings
# The following header was added automatically by juju
# Modifying it will affect nagios monitoring and alerting
# servicegroups: juju
command[check_corosync_rings]=/usr/local/lib/nagios/plugins/check_corosync_rings

ubuntu@juju-496933-0-lxd-0:/etc/nagios/nrpe.d$ sudo /usr/local/lib/nagios/plugins/check_corosync_rings
check_cororings CRITICAL - OK

ubuntu@juju-496933-0-lxd-0:/etc/nagios/nrpe.d$ sudo /usr/sbin/corosync-cfgtool -s
Printing link status.
Local node ID 1001
LINK ID 0
        addr = 10.100.20.56
        status = OK

Revision history for this message
Przemyslaw Hausman (phausman) wrote :

Deployment versions:
- OpenStack Ussuri on Ubuntu Focal
- Charms revisions 20.10

no longer affects: charm-openstack-service-checks
no longer affects: charm-nrpe
Revision history for this message
Przemyslaw Hausman (phausman) wrote :

This is the offending line: https://github.com/openstack/charm-hacluster/blob/3689c551c272d36d4a53437293531c2c43b2c1d2/files/nrpe/check_corosync_rings#L109

The following patch fixes the problem:

< if ( $status =~ m/^ring (\d+) active with no faults/ ) {
---
> if ( $status =~ m/^ring (\d+) active with no faults|OK/ ) {

Revision history for this message
Przemyslaw Hausman (phausman) wrote :

Subscribing ~field-critical as this issue occurs on a customer deployment and no workaround is known yet.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Upstream corosync removed the "active with no faults" status with the addition of the Kronosnet transport in commit https://github.com/corosync/corosync/commit/268cde6ee48ae18004e9de5469f0be97a46e10a0. Per message for commit https://github.com/corosync/corosync/commit/d7f5478b322354799787401873e4b6aedc2d621d the udpu status will always be "OK".

I think one option is to change the check per comment #2 from Przemyslaw, but I don't think its actually of any use and likely the check should just not be used if its focal.

Revision history for this message
Billy Olsen (billy-olsen) wrote :

I should also add that since corosync 2.99, when the Kronosnet transport commit was introduced, the udp[u] will always report the status of "OK", not showing any of the faults.

Revision history for this message
Xav Paice (xavpaice) wrote :

Seems reasonable to me that if the status is always reported as OK there's no need to check it at all, since it has no meaning whatsoever.

Are there alternatives we should check instead, or is it reasonable to just rely on other existing checks?

Revision history for this message
Billy Olsen (billy-olsen) wrote :

I haven't found others upon my initial glance. The nrpe pieces still have the crm status check, so my proposal would simply be to remove the ring check. If another piece is relevant later, we should add that as a separate PR.

Changed in charm-hacluster:
assignee: nobody → Billy Olsen (billy-olsen)
importance: Undecided → High
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/761795

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (master)

Reviewed: https://review.opendev.org/761795
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=3080d64281afa5e65f39bb47d224d66d25bf702c
Submitter: Zuul
Branch: master

commit 3080d64281afa5e65f39bb47d224d66d25bf702c
Author: Billy Olsen <email address hidden>
Date: Fri Nov 6 14:20:00 2020 -0700

    Remove the corosync_rings check in eoan+

    Corosync 2.99 altered the status output for udp/udpu rings to
    be hardcoded to 'OK'. This breaks the check_corosync_rings nrpe
    check which is looking for 'ring $number active with no faults'.
    Since the value has been hardcoded to show 'OK', the check itself
    does not provide any real meaningful value.

    Change-Id: I642ecf11946b1ea791a27c54f0bec54adbfecb83
    Closes-Bug: #1902919

Changed in charm-hacluster:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-hacluster (stable/20.10)

Fix proposed to branch: stable/20.10
Review: https://review.opendev.org/761804

Revision history for this message
Przemyslaw Hausman (phausman) wrote :

I confirm that with the -next version of the charm the issue is no longer present.

Revision history for this message
Przemyslaw Hausman (phausman) wrote :

Thank you for working on it, @billy-olsen!

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-hacluster (stable/20.10)

Reviewed: https://review.opendev.org/761804
Committed: https://git.openstack.org/cgit/openstack/charm-hacluster/commit/?id=506d432d315825ffa038c8c2e50e8b68518a9440
Submitter: Zuul
Branch: stable/20.10

commit 506d432d315825ffa038c8c2e50e8b68518a9440
Author: Billy Olsen <email address hidden>
Date: Fri Nov 6 14:20:00 2020 -0700

    Remove the corosync_rings check in eoan+

    Corosync 2.99 altered the status output for udp/udpu rings to
    be hardcoded to 'OK'. This breaks the check_corosync_rings nrpe
    check which is looking for 'ring $number active with no faults'.
    Since the value has been hardcoded to show 'OK', the check itself
    does not provide any real meaningful value.

    Change-Id: I642ecf11946b1ea791a27c54f0bec54adbfecb83
    Closes-Bug: #1902919
    (cherry picked from commit 3080d64281afa5e65f39bb47d224d66d25bf702c)

Revision history for this message
Michael Skalka (mskalka) wrote :

Dropping the crit subscription as a workaround has been found.

Changed in charm-hacluster:
status: Fix Committed → Fix Released
milestone: none → 20.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.