check_disk plugin broken after upgrade to 15.10

Bug #1516451 reported by Ralf G. R. Bergs on 2015-11-15
nagios-plugins (Ubuntu)

Bug Description

I didn't touch my Nagios config, just update my system from 15.04 to 15.10. Suddenly the default localhost/Disk Space check fails with the following output:

DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

This can be reproduced when manually running the underlying command as user "nagios":

$ /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

When I run it as root it works:

# /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK OK| /dev=0MB;1186;1334;0;1483 /run=8MB;239;269;0;299 /=17157MB;57386;64559;0;71733 /dev/shm=0MB;1199;1349;0;1499 /run/lock=0MB;4;4;0;5 /sys/fs/cgroup=0MB;1199;1349;0;1499 /boot=48MB;181;204;0;227 /run/user/0=0MB;239;269;0;299

Seems "nagios" user can't access the dir it tries to access:

# ls -la /sys/kernel/debug/tracing
drwx------ 7 root root 0 Nov 15 19:40 .

# lsb_release -rd
Description: Ubuntu 15.10
Release: 15.10

# apt-cache policy nagios-plugins-basic
  Installed: 1.5-3ubuntu1
  Candidate: 1.5-3ubuntu1
  Version table:
 *** 1.5-3ubuntu1 0
        500 wily/main amd64 Packages
        100 /var/lib/dpkg/status

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nagios-plugins (Ubuntu):
status: New → Confirmed
Ben Coleman (oloryn) wrote :

Also note that while 15.10 does change the permissions on /sys/kernel/debug/tracing (from drwxr-xr-x in 15.04 to drwx------ in 15.10), the permissions on /sys/kernel/debug are drwx------ on both 15.10 and 15.04 - which means that /sys/kernel/debug shouldn't be readable from a non-root account on either release, so this looks like a code change in check_disk.

Brian Morton (rokclimb15) wrote :

strace confirms that check_disk on 12.04 doesn't check /sys/kernel/debug/tracing

Not having any luck tracking down a code change in the monitoring-plugins github repo. I wonder if this is a change in a dependent lib instead.

Here's a workaround

sudo chown root:root /usr/lib/nagios/plugins/check_disk
sudo chmod u+s /usr/lib/nagios/plugins/check_disk
sudo chmod o+x /usr/lib/nagios/plugins/check_disk

Brian Morton (rokclimb15) wrote :

I suspect there isn't a code change here, but rather a difference in the way Ubuntu is presenting its mount points. The plugin tries to enumerate and check all mounts. A better use might be to add the actual mount points to be monitored with -p

/usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e -p / -p /var -p /boot

none on /sys/kernel/debug type debugfs (rw)


debugfs on /sys/kernel/debug type debugfs (rw,relatime)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

Robie Basak (racb) wrote :

Thank you for taking the time to report and investigate this bug and helping to make Ubuntu better.

It sounds to me like check_disk should have a blacklist of filesystem types to ignore. But explicitly specifying which mount points looks like a suitable workaround.

I wonder if this affects monitoring-plugins in Xenial?

Changed in nagios-plugins (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
Gabriele Tozzi (gabriele-tozzi) wrote :

You can use the --exclude-type option to work this bug around:

/usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs

Same probably for gvfs-fuse filesystems. Recommend

--exclude-type=tracefs --exclude-type=fuse.gvfsd-fuse

Darragh Grealish (grealish) wrote :

This is also broken in ubuntu 16.04, however the workaround mentioned works
/usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs

TomaszChmielewski (mangoo-wpkg) wrote :

The workaround is not really great when LXD/LXC is in use:

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs
DISK CRITICAL - /run/lxcfs/controllers is not accessible: Permission denied

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=cgroup
DISK CRITICAL - /run/lxcfs/controllers is not accessible: Permission denied

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=tmpfs
DISK CRITICAL - /run/lxcfs/controllers/blkio is not accessible: Permission denied

So it only works when we exclude all three above, including tmpfs:

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=cgroup --exclude-type=tmpfs

However, tmpfs is very often used for /tmp, /dev/shm, which are also important to monitor - and --exclude-type=tmpfs makes the check skip these mountpoints.

Nicholas Sherlock (n-sherlock) wrote :

Rather than excluding tmpfs, just exclude /run/lxcfs/controllers. This is the check_all_disks command I'm now using in my /etc/nagios-plugins/config/disk.cfg:

# 'check_all_disks' command definition
define command{
    command_name check_all_disks
    command_line /usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e -A --exclude-type=tracefs --exclude-type=cgroup --exclude_device=/run/lxcfs/controllers

Danny Howard (dannyman) wrote :

We normally netboot but we had a few machines that would not PXE, so we installed 14.04 via medium. Afterwards, some of the medium-installed machines were throwing this error in Nagios. Found this line in /etc/mtab on the afflicted hosts:

tracefs /var/lib/ureadahead/debugfs/tracing tracefs rw,relatime 0 0

This appears to be an artifact on the medium-based install process. I removed the above line and ran:

sudo service nagios-nrpe-server restart

Error condition cleared.

Danny Howard (dannyman) wrote :

Possibly related to #499773 which is about install adding spurious entries to mtab.

Marius Gedminas (mgedmin) wrote :

These days /etc/mtab is a symlink to /proc/self/mounts, so you cannot control what is exposed there.

Gerald Combs (gerald.combs) wrote :

This appears to be fixed upstream via

Gerald Combs (gerald.combs) wrote :

Oops - please disregard comment #14 - it's specific to Icinga.

Ian Gibbs (realflash-uk) wrote :

Since check_all_disks is internally defined in Nagios, you might well see a "duplicate definition" error if you define your own check_all_disks command. I'd recommend

define command{
    command_name check_all_physical_disks
    command_line /usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e -A --exclude-type=tracefs --exclude-type=cgroup --exclude_device=/run/lxcfs/controllers

instead, and then call that in your host definition:

define service {
 use generic-service
 hostgroup_name all
 service_description Disk Space
 check_command check_all_physical_disks!6%!4%

