check space generate only part of the checks definition.

Bug #1986654 reported by Nikita Koltsov
22
This bug affects 5 people
Affects Status Importance Assigned to Milestone
NRPE Charm
Fix Released
Medium
Nikita Koltsov

Bug Description

Having two nested children sections in lsblk output caused charm to skip several definitions.

Here is the json, with this json only 6 check generated.

{
   "blockdevices": [
      {"name": "loop0", "maj:min": "7:0", "rm": "0", "size": "9M", "ro": "1", "type": "loop", "mountpoint": "/snap/canonical-livepatch/146"},
      {"name": "loop2", "maj:min": "7:2", "rm": "0", "size": "113.9M", "ro": "1", "type": "loop", "mountpoint": "/snap/core/13308"},
      {"name": "loop3", "maj:min": "7:3", "rm": "0", "size": "9M", "ro": "1", "type": "loop", "mountpoint": "/snap/canonical-livepatch/138"},
      {"name": "loop4", "maj:min": "7:4", "rm": "0", "size": "114M", "ro": "1", "type": "loop", "mountpoint": "/snap/core/13425"},
      {"name": "sda", "maj:min": "8:0", "rm": "0", "size": "372.6G", "ro": "0", "type": "disk", "mountpoint": null,
         "children": [
            {"name": "sda1", "maj:min": "8:1", "rm": "0", "size": "572M", "ro": "0", "type": "part", "mountpoint": "/boot/efi"},
            {"name": "sda2", "maj:min": "8:2", "rm": "0", "size": "18.6G", "ro": "0", "type": "part", "mountpoint": "/"},
            {"name": "sda3", "maj:min": "8:3", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/home"},
            {"name": "sda4", "maj:min": "8:4", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/tmp"},
            {"name": "sda5", "maj:min": "8:5", "rm": "0", "size": "37.3G", "ro": "0", "type": "part", "mountpoint": "/var"},
            {"name": "sda6", "maj:min": "8:6", "rm": "0", "size": "93.1G", "ro": "0", "type": "part", "mountpoint": "/var/log"},
            {"name": "sda7", "maj:min": "8:7", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/var/tmp"},
            {"name": "sda8", "maj:min": "8:8", "rm": "0", "size": "14G", "ro": "0", "type": "part", "mountpoint": "/var/log/audit"},
            {"name": "sda9", "maj:min": "8:9", "rm": "0", "size": "186.3G", "ro": "0", "type": "part", "mountpoint": null,
               "children": [
                  {"name": "bcache0", "maj:min": "252:0", "rm": "0", "size": "7.3T", "ro": "0", "type": "disk", "mountpoint": "/var/lib"}
               ]
            }
         ]
      },
      {"name": "sdb", "maj:min": "8:16", "rm": "0", "size": "7.3T", "ro": "0", "type": "disk", "mountpoint": null,
         "children": [
            {"name": "bcache0", "maj:min": "252:0", "rm": "0", "size": "7.3T", "ro": "0", "type": "disk", "mountpoint": "/var/lib"}
         ]
      }
   ]
}

Tags: bseng-1067

Related branches

Revision history for this message
Nikita Koltsov (nkoltsov) wrote :

Second children section is not handled correctly so None is returned as mountpoint which cause executing replace agains None object in https://git.launchpad.net/charm-nrpe/tree/hooks/nrpe_helpers.py#n570

Revision history for this message
Andreas Hamacher (andreashamacher) wrote :

likely related to 1979860

Revision history for this message
Andrea Ieri (aieri) wrote :

hi Nikita, I have tried running your json lsblk output through the get_partitions_to_check() code and I cannot reproduce the issue you describe. .get("children", []) is run for every blockdevice, and I thus get 9 mountpoints by line 842 (although "/boot/efi" is subsequently skipped):

```
In [9]: partitions_to_check
Out[9]:
{'/',
 '/boot/efi',
 '/home',
 '/tmp',
 '/var',
 '/var/lib',
 '/var/log',
 '/var/log/audit',
 '/var/tmp'}
```

I suggest running the config-changed hook through debug-hooks or debug-code to identify where the failure occurs in your environment.

Please mark the bug back to new once you have more information.

Also jfyi this upcoming merge proposal will slightly alter the listing of mountpoints that need to be checked: https://code.launchpad.net/~nikolay.vinogradov/charm-nrpe/+git/charm-nrpe/+merge/438754

Changed in charm-nrpe:
status: New → Incomplete
Revision history for this message
Andrea Ieri (aieri) wrote :

Also please ensure you are running the latest stable charm (rev 97 as of this writing) in case this bug has already been fixed.

Revision history for this message
Paul Goins (vultaire) wrote (last edit ):
Download full text (3.3 KiB)

I may be hitting the same issue here. I'm on revision 96 (deliberately for the sake of alignment with another cloud), but I've pulled the deployed sources for rev 96 and compared to a downloaded rev 97 nrpe charm and the sources appear the same.

"lsblk -J" output is as follows:

# lsblk -J
{
   "blockdevices": [
      {"name": "loop0", "maj:min": "7:0", "rm": "0", "size": "55.6M", "ro": "1", "type": "loop", "mountpoint": "/snap/core18/2714"},
      {"name": "loop1", "maj:min": "7:1", "rm": "0", "size": "6M", "ro": "1", "type": "loop", "mountpoint": "/snap/prometheus-grok-exporter/1"},
      {"name": "loop2", "maj:min": "7:2", "rm": "0", "size": "116.8M", "ro": "1", "type": "loop", "mountpoint": "/snap/core/14784"},
      {"name": "loop3", "maj:min": "7:3", "rm": "0", "size": "21.8M", "ro": "1", "type": "loop", "mountpoint": "/snap/prometheus-libvirt-exporter/36"},
      {"name": "loop4", "maj:min": "7:4", "rm": "0", "size": "9M", "ro": "1", "type": "loop", "mountpoint": "/snap/canonical-livepatch/164"},
      {"name": "loop5", "maj:min": "7:5", "rm": "0", "size": "116.8M", "ro": "1", "type": "loop", "mountpoint": "/snap/core/14946"},
      {"name": "loop6", "maj:min": "7:6", "rm": "0", "size": "55.6M", "ro": "1", "type": "loop", "mountpoint": "/snap/core18/2721"},
      {"name": "sda", "maj:min": "8:0", "rm": "0", "size": "894.2G", "ro": "0", "type": "disk", "mountpoint": null,
         "children": [
            {"name": "crypt-<redacted>", "maj:min": "253:0", "rm": "0", "size": "894.2G", "ro": "0", "type": "crypt", "mountpoint": "/var/lib/nova/instances"}
         ]
      },
      {"name": "sdb", "maj:min": "8:16", "rm": "0", "size": "447.1G", "ro": "0", "type": "disk", "mountpoint": null,
         "children": [
            {"name": "sdb1", "maj:min": "8:17", "rm": "0", "size": "572M", "ro": "0", "type": "part", "mountpoint": "/boot/efi"},
            {"name": "sdb2", "maj:min": "8:18", "rm": "0", "size": "18.6G", "ro": "0", "type": "part", "mountpoint": "/"},
            {"name": "sdb3", "maj:min": "8:19", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/home"},
            {"name": "sdb4", "maj:min": "8:20", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/tmp"},
            {"name": "sdb5", "maj:min": "8:21", "rm": "0", "size": "223.5G", "ro": "0", "type": "part", "mountpoint": "/var"},
            {"name": "sdb6", "maj:min": "8:22", "rm": "0", "size": "93.1G", "ro": "0", "type": "part", "mountpoint": "/var/log"},
            {"name": "sdb7", "maj:min": "8:23", "rm": "0", "size": "4.7G", "ro": "0", "type": "part", "mountpoint": "/var/tmp"},
            {"name": "sdb8", "maj:min": "8:24", "rm": "0", "size": "93.1G", "ro": "0", "type": "part", "mountpoint": "/var/log/audit"}
         ]
      }
   ]
}

Upon upgrading to this version of the nrpe charm, I have 4 check_space alerts in nagios which return an UNKNOWN status with status message "NRPE: Command 'check_space_$(path)' not defined.

space_check is set to the default, which is:

    default: |-
      check: auto
      auto_params: -w 25% -c 20% -K 5%

With that, I get these checks coming through OK:

check_space_home (/home)
check_space_root...

Read more...

Changed in charm-nrpe:
status: Incomplete → Confirmed
Revision history for this message
Paul Goins (vultaire) wrote :

Hrm... I'm not sure this exactly matches the description after re-reading the original bug. But maybe it's pointing at the same root cause? I'm not sure.

Revision history for this message
Paul Goins (vultaire) wrote :

Noting the above - all my missing checks are sub-checks of an existing parent.
And to be clear - the checks are generated and shared via relation data - and I've confirmed that get_partitions_to_check would be returning the right set of values - but the files on disk for the subdirs are either not being created *or* are being cleaned up by mistake, perhaps after being created previously.

Revision history for this message
Paul Goins (vultaire) wrote :

I'm going to go out on a limb and suggest the problem is caused by the behavior of render_nrped_files in hooks/nrpe_utils.py. What it does is, prior to creating files for a check, it removes all "matching files" the check may have had.

That list of matching files is created in hooks/nrpe_helpers.py in SubordinateCheckDefinitions.__init__(). For each check created, a set of "matching files" is created using these patterns:

* /etc/nagios/nrpe.d/{}.cfg (direct match)
* /etc/nagios/nrpe.d/{}_*.cfg (subordinate match)

This means that, for the check_space checks, a check on /var (i.e. check_space_var), would match on files /etc/nagios/nrpe.d/check_space_var.cfg as well as the glob /etc/nagios/nrpe.d/check_space_var_*.cfg, which would also match the files used by the subdir checks.

At that point, it basically becomes a quirk of ordering. If we want this type of cleanup behavior, then we need to ensure that the checks are properly ordered so that if one check deletes the files from other checks, the other checks will then re-create them afterwards.

**I believe this is where the bug is happening:** get_partitions_to_check() is returning a set, which is inherently unordered. In order for this to work reliably, this needs to be reworked to return a list, with subordinate checks listed after the parent checks. ...If I'm not mistaken, this *may* be as simple as just wrapping the result of that function in list(sorted()).

Revision history for this message
Andrea Ieri (aieri) wrote :

Thanks, this is enough for us to set up a reproducer and a regression test case.
FTR, I think relying on ordering would be quite fragile, I'd rather look into making the removal of old checks more precise, possibly by altering the string substitution that causes character "_" to have multiple meanings.

Changed in charm-nrpe:
importance: Undecided → Medium
status: Confirmed → Triaged
tags: added: bseng-1067
Revision history for this message
Paul Goins (vultaire) wrote :

I'm all in favor for that; I agree with your sentiment.

Revision history for this message
Andrea Ieri (aieri) wrote :

One thing of note is that with the release of the grafana-agent machine charm to the edge channel (which exposes filesystem metrics via node-exporter by default), development work should concentrate on adding suitable alerts to our metrics-based solution. I'll keep this bug as medium priority within the context of the nrpe charm, although it should rather be considered a low in the wider scope of charmed monitoring solutions

Revision history for this message
Andrea Ieri (aieri) wrote :

I've filed https://github.com/canonical/grafana-agent-k8s-operator/issues/167 as a related issue as the grafana-agent charm does not currently ship an alert rule that covers space checks.

Revision history for this message
Przemyslaw Lal (przemeklal) wrote :

I'm on revision 97, also affected by this bug.

Revision history for this message
Marcus Boden (marcusboden) wrote :

To anyone who needs a quick fix immediately (can't wait for an upgrade, or can't upgrade), the following will fix it:

juju run -a nrpe -- sed -i 's/services.manager.reconfigure_services("nrpe-config")/services.get_manager().reconfigure_services("nrpe-config")/' actions/ack-reboot*

Revision history for this message
Pedro Castillo (peterctl) wrote :

I believe the above message should be moved to bug LP#2008046
https://bugs.launchpad.net/charm-nrpe/+bug/2008046

Andrea Ieri (aieri)
Changed in charm-nrpe:
status: Triaged → Fix Committed
Revision history for this message
Andrea Ieri (aieri) wrote :

This fix is now available in revision 110 (channel edge for now)

Revision history for this message
Nikita Koltsov (nkoltsov) wrote (last edit ):

This bug is not fixed actually, in the discussion we located another issue and fixed it. However nested children sections are still not handled correctly

I noticed this bug in the code, but yes it was not the root cause in the reported case, but it is a valid issue. Here is the link to the branch with unit test which could be used for reproduce the issue.
https://code.launchpad.net/~nkoltsov/charm-nrpe/+git/charm-nrpe/+ref/lp1986654

And output was captured from the real customer environment

Revision history for this message
Andrea Ieri (aieri) wrote :

Available in edge (rev 111)

Changed in charm-nrpe:
milestone: none → 23.10
Eric Chen (eric-chen)
Changed in charm-nrpe:
assignee: nobody → Nikita Koltsov (nkoltsov)
Tianqi Xiao (txiao)
Changed in charm-nrpe:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.