check_all_disks includes squashfs /snap/* which are 100%

Bug #1827159 reported by Xav Paice
74
This bug affects 39 people
Affects Status Importance Assigned to Milestone
Nagios Charm
Fix Released
Undecided
Unassigned
coreutils (Ubuntu)
Fix Released
Undecided
Unassigned
monitoring-plugins (Ubuntu)
Fix Released
Low
Bryce Harrington

Bug Description

[Impact]
False positive reports are generated in monitoring tools when artificial filesystems are mounted, since they show 100% disk utilization, and thus add unnecessary (but dire sounding) "DISK CRITICAL" noise.

[Test Case]
$ lxc create ubuntu-daily:19.10/amd64 lp1827159
$ lxc exec lp1827159 bash
# apt-get -y update
# apt-get install monitoring-plugins
# snap install gnome-calculator
[...]
# /usr/lib/nagios/plugins/check_disk -w 10 -c 10
DISK CRITICAL - free space: / 1903 MB (1% inode=78%); /dev 0 MB (100% inode=99%); /dev/full 16018 MB (100% inode=99%); /dev/null 16018 MB (100% inode=99%); /dev/random 16018 MB (100% inode=99%); /dev/tty 16018 MB (100% inode=99%); /dev/urandom 16018 MB (100% inode=99%); /dev/zero 16018 MB (100% inode=99%); /dev/fuse 16018 MB (100% inode=99%); /dev/net/tun 16018 MB (100% inode=99%); /dev/lxd 0 MB (100% inode=99%); /dev/.lxd-mounts 0 MB (100% inode=99%); /dev/shm 16041 MB (100% inode=99%); /run 3208 MB (99% inode=99%); /run/lock 5 MB (100% inode=99%); /sys/fs/cgroup 16041 MB (100% inode=99%); /snap 1903 MB (1% inode=78%); /run/snapd/ns 3208 MB (99% inode=99%);| /=111171MB;119160;119160;0;119170 /dev=0MB;-10;-10;0;0 /dev/full=0MB;16008;16008;0;16018 /dev/null=0MB;16008;16008;0;16018 /dev/random=0MB;16008;16008;0;16018 /dev/tty=0MB;16008;16008;0;16018 /dev/urandom=0MB;16008;16008;0;16018 /dev/zero=0MB;16008;16008;0;16018 /dev/fuse=0MB;16008;16008;0;16018 /dev/net/tun=0MB;16008;16008;0;16018 /dev/lxd=0MB;-10;-10;0;0 /dev/.lxd-mounts=0MB;-10;-10;0;0 /dev/shm=0MB;16031;16031;0;16041 /run=0MB;3198;3198;0;3208 /run/lock=0MB;-5;-5;0;5 /sys/fs/cgroup=0MB;16031;16031;0;16041 /snap=111171MB;119160;119160;0;119170 /run/snapd/ns=0MB;3198;3198;0;3208

# /usr/lib/nagios/plugins/check_disk -w 10 -c 10 -e -X squashfs
DISK CRITICAL - free space: /dev 0 MB (100% inode=99%); /dev/lxd 0 MB (100% inode=99%); /dev/.lxd-mounts 0 MB (100% inode=99%); /run/lock 5 MB (100% inode=99%);| /=111392MB;119160;119160;0;119170 /dev=0MB;-10;-10;0;0 /dev/full=0MB;16008;16008;0;16018 /dev/null=0MB;16008;16008;0;16018 /dev/random=0MB;16008;16008;0;16018 /dev/tty=0MB;16008;16008;0;16018 /dev/urandom=0MB;16008;16008;0;16018 /dev/zero=0MB;16008;16008;0;16018 /dev/fuse=0MB;16008;16008;0;16018 /dev/net/tun=0MB;16008;16008;0;16018 /dev/lxd=0MB;-10;-10;0;0 /dev/.lxd-mounts=0MB;-10;-10;0;0 /dev/shm=0MB;16031;16031;0;16041 /run=0MB;3198;3198;0;3208 /run/lock=0MB;-5;-5;0;5 /sys/fs/cgroup=0MB;16031;16031;0;16041 /snap=111392MB;119160;119160;0;119170 /run/snapd/ns=0MB;3198;3198;0;3208

# /usr/lib/nagios/plugins/check_disk -w 10 -c 10 -e -X tmpfs
DISK OK| /=111171MB;119160;119160;0;119170 /dev/full=0MB;16008;16008;0;16018 /dev/null=0MB;16008;16008;0;16018 /dev/random=0MB;16008;16008;0;16018 /dev/tty=0MB;16008;16008;0;16018 /dev/urandom=0MB;16008;16008;0;16018 /dev/zero=0MB;16008;16008;0;16018 /dev/fuse=0MB;16008;16008;0;16018 /dev/net/tun=0MB;16008;16008;0;16018 /snap=111171MB;119160;119160;0;119170

[Regression Potential]
As this alters the logic of how out-of-space checks are handled, relevant issues to keep an eye out for would relate to filesystem checks reporting improperly. These tools underlay a few different front-ends, so regression bugs may get filed in a few different places, however they will tend to display error messages involving check_disk, nagios, and either tmpfs or tracefs.

Note that there are likely other synthetic filesystems beyond tmpfs and tracefs (e.g. udev, usbfs, devtmpfs, fuse.*, ...) which might also cause similar false positives; these should be handled as separate bugs, although they can likely be fixed the same way.

[Fix]
monitoring-plugins is modified to exclude the unwanted filesystems by default, in check_disk.c (see patch).

[Discussion]
There have been several bug reports filed about false positives with different synthetic file systems (see Dupes), including tracefs, squashfs, and tmpfs. The commonly discussed workaround is to exclude these when running the tools (e.g. using the '-X <fs>' parameter for check_all_disks). Since wrappers are typically used for running the underlying tools, it is possible to add a string of -X... parameters.

However, a cleaner solution is possible. monitoring-plugins' check_disk.c maintains an internal exclusion list, fs_exclude_list, which already excludes iso9660, and can be modified to add other filesystems to exclude by default.

In other words, check_disk.c is modified thusly:

  np_add_name(&fs_exclude_list, "iso9660");
  np_add_name(&fs_exclude_list, "squashfs");
  np_add_name(&fs_exclude_list, "tmpfs");
  np_add_name(&fs_exclude_list, "tracefs");

This code is added prior to the command line parsing logic, and as such simply sets default behavior. It does not preclude further adding or removing filesystems via the -X and -N parameters. Indeed, if someone were to desire checking tmpfs, they are able to manually add it, via "-N tmpfs".

[Original Report]
When using nagios to monitor the Nagios host itself, if the host is not a container, the template for checking the disk space on the Nagios host does not exclude any snap filesystems. This means we get a Critical report if any snap is installed.

This can be changed by adding to the check_all_disks command a '-X squashfs', but that command is defined in the nagios plugins package.

(Or, perhaps '-X tmpfs'? -- bryce)

Related branches

Revision history for this message
Andrea Ieri (aieri) wrote :

This has been fixed in nagios-32 by deploying a custom check_all_disks_no_squashfs check

Changed in nagios-charm:
status: New → Fix Released
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

This can be overridden in /etc/nagios-plugins/config/disk.cfg, right? By adding the parameter you suggested.

Changed in monitoring-plugins (Ubuntu):
status: New → Triaged
importance: Undecided → Low
Bryce Harrington (bryce)
description: updated
Revision history for this message
Bryce Harrington (bryce) wrote :

Was able to reproduce using the command in /etc/nagios-plugins/config/disk.cfg; I've added this as a test case in the report body.

Adding -X squashfs did nothing, but -X tmpfs worked:

# /usr/lib/nagios/plugins/check_disk -w 10 -c 10 -e -X squashfs
DISK CRITICAL - free space: /dev 0 MB (100% inode=99%); /dev/lxd 0 MB (100% inode=99%); [...]

# /usr/lib/nagios/plugins/check_disk -w 10 -c 10 -e -X tmpfs
DISK OK| /=111171MB;119160;119160;0;119170

monitoring-plugins-gu/plugins/check_disk.c includes some filtering code, which skips:
* Remote file systems (me_remote)
* Pseudo file systems (me_dummy)
* Excluded filesystems (fs_exclude_list)
* Excluded filesystem types (dp_exclude_list)

Looks like by default the fs_exclude_list has only one fs type added to it:

  np_add_name(&fs_exclude_list, "iso9660");

Maybe a possible patch might be to add tmpfs there too (see attached; untested)?

tags: added: server-next
Bryce Harrington (bryce)
description: updated
tags: added: patch
Revision history for this message
Ramon Grullon (rgrullon) wrote :

Currently experiencing this issue at customer site where there is a permission issue as nagios user can't access this directory, this particular mount point/directory is owned by root and the permission set on this is 700.

The -X tracefs is not working. Would a new check be needed for tracefs similar to the one for squashfs?

ubuntu@XXXXXXnagios-1:/snap/core/7270$ /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

ubuntu@XXXXXXnagios-1:/snap/core/7270$ sudo ls -ld /sys/kernel/debug/tracing
drwx------ 8 root root 0 May 9 11:22 /sys/kernel/debug/tracing

ubuntu@XXXXXXXXnagios-1:/snap/core/7270$ mount | grep /sys/kernel/debug/tracing
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

Revision history for this message
Bryce Harrington (bryce) wrote :
Revision history for this message
Bryce Harrington (bryce) wrote :

Ramon, I'm not certain yet if LP #1516451 and LP #1516451 are precisely the same issue, although it's likely they can be solved in similar fashion. Your customer issue looks like a better match to 1516451 since it involves the /sys/kernel/debug/tracing file which I see you've already commented on with a better test case than I had found, so I'll follow up with you there.

Changed in monitoring-plugins (Ubuntu):
assignee: nobody → Bryce Harrington (bryce)
Bryce Harrington (bryce)
description: updated
Bryce Harrington (bryce)
description: updated
Revision history for this message
Bryce Harrington (bryce) wrote :
Revision history for this message
Bryce Harrington (bryce) wrote :
Changed in monitoring-plugins (Ubuntu):
status: Triaged → In Progress
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package monitoring-plugins - 2.2-6ubuntu1

---------------
monitoring-plugins (2.2-6ubuntu1) focal; urgency=medium

  * d/p/exclude-tmpfs-squashfs-tracefs.patch: Ignore artificial filesystems
    that trigger false-positive DISK CRITICAL checks due to reporting as at
    100% capacity.
    (LP: #1827159)

 -- Bryce Harrington <email address hidden> Thu, 31 Oct 2019 00:21:55 +0000

Changed in monitoring-plugins (Ubuntu):
status: In Progress → Fix Released
Bryce Harrington (bryce)
Changed in coreutils (Ubuntu):
status: New → Fix Committed
Revision history for this message
Bryce Harrington (bryce) wrote :

df no longer shows squashfs mounts by default, as of today's coreutils update

Changed in coreutils (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Revision history for this message
Hua Zhang (zhhuabj) wrote :

We have a customer who is using bionic and is also suffering from this problem, so who can nominate it to bionic as well?

I have uploaded bionic.debdiff and built a test PPA[1], I verified it and it works well. thanks.

[1] https://launchpad.net/~zhhuabj/+archive/ubuntu/case338917

Changed in monitoring-plugins (Ubuntu Bionic):
assignee: nobody → Hua Zhang (zhhuabj)
importance: Undecided → Medium
tags: added: sts sts-sponsor-dgadomski
Changed in coreutils (Ubuntu Bionic):
status: New → Invalid
Revision history for this message
Hua Zhang (zhhuabj) wrote :

the sru for bionic will be in the lp bug 1940916

no longer affects: coreutils (Ubuntu Bionic)
no longer affects: monitoring-plugins (Ubuntu Bionic)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.