NRPE check for ceph-osd fails with: File '/var/lib/nagios/ceph-osd-checks' doesn't exist

Bug #2019251 reported by Pedro Castillo
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Committed
Medium
Unassigned

Bug Description

The NRPE check removes the temporary output file created by the collector cronjob, which introduces a race condition that will make the check fail if the file is not present. While the commit where the check was added (faefe90ce6beb5d2b3721cfb637dd7d661cbc21f) mentions that the file is removed so that the check will always have fresh data, it can fail and has been failing on an environment for 10+ hours straight. I think having minute-old stale data is better than having no data at all and having the check fail, so I believe that removing the section of the check that removes the output file [0] would fix the issue, as the collector CRON job already handles the output file being present by using the write+truncate mode on the open operation [1].

[0] https://opendev.org/openstack/charm-ceph-osd/src/branch/master/files/nagios/check_ceph_osd_services.py#L41-L45
[1] https://opendev.org/openstack/charm-ceph-osd/src/branch/master/files/nagios/collect_ceph_osd_services.py#L72

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Agreed that the race this could create can lead to false positives which causes alert fatigue and ultimately can render a check useless (although I can't help but wonder if there's an underlying env issue if it's failing in an env for hours on end as a side note).

Changed in charm-ceph-osd:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Danny Cocks (dannycocks) wrote :

Adding to this, we now have a common root cause for this bug: we are deploying the COS stack while maintaining an old LMA stack. This means we have two upstream users of the NRPE check, which can be triggering in the same minute. In our case we find:

- collect_ceph_osd_services.py runs every minute at 01s
- nagios causes check_ceph_osd_services.py to run every 05:10s
- prometheus causes check_ceph_osd_services.py to run every 05:18s.

This means that prometheus will not see the file before it is regenerated. And it is not possible for us to ensure that these checks run at least one minute apart.

A simple "fix" is to not delete the file. But to preserve the implicit check for staleness, could we add a check for modification time of the file? Or use a second file to record the timestamp of the last check?

Changed in charm-ceph-osd:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (master)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/900563
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/1494d9a24527769064b819d5e1da4fdaebf6dcd2
Submitter: "Zuul (22348)"
Branch: master

commit 1494d9a24527769064b819d5e1da4fdaebf6dcd2
Author: Pedro Castillo <email address hidden>
Date: Thu Nov 9 15:50:44 2023 -0600

    Refactor cache validation for the ceph-osd NRPE check

    Closes-Bug: #2019251
    Closes-Bug: #2021507
    Change-Id: Ib50414756165f2587f0127e572675c7ca8e31ef9

Changed in charm-ceph-osd:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-ceph-osd (stable/quincy.2)

Fix proposed to branch: stable/quincy.2
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/902885

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on charm-ceph-osd (master)

Change abandoned by "Danny Cocks <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-ceph-osd/+/899013
Reason: Similar fix was merged from Ib50414756165f2587f0127e572675c7ca8e31ef9

Revision history for this message
Aymen Frikha (aym-frikha) wrote (last edit ):

Can we please backport this to stable/quincy ?
subscribed ~field-high

Revision history for this message
Peter Sabaini (peter-sabaini) wrote :

Will do -- thanks for the reminder

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-ceph-osd (stable/quincy.2)

Reviewed: https://review.opendev.org/c/openstack/charm-ceph-osd/+/902885
Committed: https://opendev.org/openstack/charm-ceph-osd/commit/e0459f1e2851a592180aa304bb56230fa9a98f38
Submitter: "Zuul (22348)"
Branch: stable/quincy.2

commit e0459f1e2851a592180aa304bb56230fa9a98f38
Author: Pedro Castillo <email address hidden>
Date: Thu Nov 9 15:50:44 2023 -0600

    Refactor cache validation for the ceph-osd NRPE check

    Also update test env, remove zed testing

    Closes-Bug: #2019251
    Closes-Bug: #2021507
    Change-Id: Ib50414756165f2587f0127e572675c7ca8e31ef9
    (cherry picked from commit 1494d9a24527769064b819d5e1da4fdaebf6dcd2)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.