Ubuntu
nagios-plugins package

check_disk plugin broken after upgrade to 15.10

Bug #1516451 reported by Ralf G. R. Bergs on 2015-11-15

This bug report is a duplicate of: Bug #1827159: check_all_disks includes squashfs /snap/* which are 100%. Edit Remove

174

This bug affects 35 people

Affects		Status	Importance	Assigned to	Milestone
	nagios-plugins (Ubuntu)	Triaged	High	Bryce Harrington

Bug Description

I didn't touch my Nagios config, just update my system from 15.04 to 15.10. Suddenly the default localhost/Disk Space check fails with the following output:

DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

This can be reproduced when manually running the underlying command as user "nagios":

$ /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

When I run it as root it works:

# /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK OK| /dev=0MB;1186;1334;0;1483 /run=8MB;239;269;0;299 /=17157MB;57386;64559;0;71733 /dev/shm=0MB;1199;1349;0;1499 /run/lock=0MB;4;4;0;5 /sys/fs/cgroup=0MB;1199;1349;0;1499 /boot=48MB;181;204;0;227 /run/user/0=0MB;239;269;0;299

Seems "nagios" user can't access the dir it tries to access:

# ls -la /sys/kernel/debug/tracing
drwx------ 7 root root 0 Nov 15 19:40 .

# lsb_release -rd
Description: Ubuntu 15.10
Release: 15.10

# apt-cache policy nagios-plugins-basic
nagios-plugins-basic:
  Installed: 1.5-3ubuntu1
  Candidate: 1.5-3ubuntu1
  Version table:
*** 1.5-3ubuntu1 0
        500 http://de.archive.ubuntu.com/ubuntu/ wily/main amd64 Packages
        100 /var/lib/dpkg/status

Tags:

Revision history for this message

Launchpad Janitor (janitor) wrote on 2015-11-19:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nagios-plugins (Ubuntu):
status:	New → Confirmed

Revision history for this message

Ben Coleman (oloryn) wrote on 2015-11-25:

Also note that while 15.10 does change the permissions on /sys/kernel/debug/tracing (from drwxr-xr-x in 15.04 to drwx------ in 15.10), the permissions on /sys/kernel/debug are drwx------ on both 15.10 and 15.04 - which means that /sys/kernel/debug shouldn't be readable from a non-root account on either release, so this looks like a code change in check_disk.

Revision history for this message

Brian Morton (rokclimb15) wrote on 2015-12-16:

strace confirms that check_disk on 12.04 doesn't check /sys/kernel/debug/tracing

Not having any luck tracking down a code change in the monitoring-plugins github repo. I wonder if this is a change in a dependent lib instead.

Here's a workaround

sudo chown root:root /usr/lib/nagios/plugins/check_disk
sudo chmod u+s /usr/lib/nagios/plugins/check_disk
sudo chmod o+x /usr/lib/nagios/plugins/check_disk

Revision history for this message

Brian Morton (rokclimb15) wrote on 2015-12-16:

I suspect there isn't a code change here, but rather a difference in the way Ubuntu is presenting its mount points. The plugin tries to enumerate and check all mounts. A better use might be to add the actual mount points to be monitored with -p

/usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e -p / -p /var -p /boot

12.04:
mount
<snip>
none on /sys/kernel/debug type debugfs (rw)
<snip>

14.04:

debugfs on /sys/kernel/debug type debugfs (rw,relatime)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

Revision history for this message

Robie Basak (racb) wrote on 2015-12-16:

Thank you for taking the time to report and investigate this bug and helping to make Ubuntu better.

It sounds to me like check_disk should have a blacklist of filesystem types to ignore. But explicitly specifying which mount points looks like a suitable workaround.

I wonder if this affects monitoring-plugins in Xenial?

Changed in nagios-plugins (Ubuntu):
importance:	Undecided → Medium
status:	Confirmed → Triaged

Revision history for this message

Gabriele Tozzi (gabriele-tozzi) wrote on 2016-01-16:

You can use the --exclude-type option to work this bug around:

/usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs

Revision history for this message

Harald Hannelius (harald-arcada) wrote on 2016-05-20:

Same probably for gvfs-fuse filesystems. Recommend

--exclude-type=tracefs --exclude-type=fuse.gvfsd-fuse

Revision history for this message

Darragh Grealish (grealish) wrote on 2016-05-20:

This is also broken in ubuntu 16.04, however the workaround mentioned works
/usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs

Revision history for this message

TomaszChmielewski (mangoo-wpkg) wrote on 2016-05-21:

The workaround is not really great when LXD/LXC is in use:

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs
DISK CRITICAL - /run/lxcfs/controllers is not accessible: Permission denied

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=cgroup
DISK CRITICAL - /run/lxcfs/controllers is not accessible: Permission denied

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=tmpfs
DISK CRITICAL - /run/lxcfs/controllers/blkio is not accessible: Permission denied

So it only works when we exclude all three above, including tmpfs:

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=cgroup --exclude-type=tmpfs

However, tmpfs is very often used for /tmp, /dev/shm, which are also important to monitor - and --exclude-type=tmpfs makes the check skip these mountpoints.

Revision history for this message

Nicholas Sherlock (n-sherlock) wrote on 2016-09-02:

#10

Rather than excluding tmpfs, just exclude /run/lxcfs/controllers. This is the check_all_disks command I'm now using in my /etc/nagios-plugins/config/disk.cfg:

# 'check_all_disks' command definition
define command{
command_name check_all_disks
command_line /usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e -A --exclude-type=tracefs --exclude-type=cgroup --exclude_device=/run/lxcfs/controllers
}

Revision history for this message

Danny Howard (dannyman) wrote on 2016-09-13:

#11

We normally netboot but we had a few machines that would not PXE, so we installed 14.04 via medium. Afterwards, some of the medium-installed machines were throwing this error in Nagios. Found this line in /etc/mtab on the afflicted hosts:

tracefs /var/lib/ureadahead/debugfs/tracing tracefs rw,relatime 0 0

This appears to be an artifact on the medium-based install process. I removed the above line and ran:

sudo service nagios-nrpe-server restart

Error condition cleared.

Revision history for this message

Danny Howard (dannyman) wrote on 2016-09-13:

#12

Possibly related to #499773 which is about install adding spurious entries to mtab.

Revision history for this message

Marius Gedminas (mgedmin) wrote on 2016-12-06:

#13

These days /etc/mtab is a symlink to /proc/self/mounts, so you cannot control what is exposed there.

Revision history for this message

Gerald Combs (gerald.combs) wrote on 2017-03-13:

#14

This appears to be fixed upstream via https://github.com/Icinga/icinga2/issues/4184

Revision history for this message

Gerald Combs (gerald.combs) wrote on 2017-03-13:

#15

Oops - please disregard comment #14 - it's specific to Icinga.

Revision history for this message

Ian Gibbs (realflash-uk) wrote on 2017-03-22:

#16

Since check_all_disks is internally defined in Nagios, you might well see a "duplicate definition" error if you define your own check_all_disks command. I'd recommend

define command{
command_name check_all_physical_disks
command_line /usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e -A --exclude-type=tracefs --exclude-type=cgroup --exclude_device=/run/lxcfs/controllers
}

instead, and then call that in your host definition:

define service {
use generic-service
hostgroup_name all
service_description Disk Space
check_command check_all_physical_disks!6%!4%
}

Revision history for this message

Alvaro Uria (aluria) wrote on 2018-04-13:

#17

This is also affecting a confined xenial LXC environment, and was fixed by adding "--exclude-type=tracefs" on the check_all_disks command definition at /etc/nagios-plugins/config/disk.cfg

monitoring-plugins-basic should be updated with the above.

tags:

added: canonical-bootstack

Revision history for this message

Ramon Grullon (rgrullon) wrote on 2019-07-09:

#18

Currently experiencing this issue at customer site where there is a permission issue as nagios user can't access this directory, this particular mount point/directory is owned by root and the permission set on this is 700.
ubuntu@XXXXXXnagios-1:/snap/core/7270$ /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

ubuntu@XXXXXXnagios-1:/snap/core/7270$ sudo ls -ld /sys/kernel/debug/tracing
drwx------ 8 root root 0 May 9 11:22 /sys/kernel/debug/tracing

ubuntu@XXXXXXXXnagios-1:/snap/core/7270$ mount | grep /sys/kernel/debug/tracing
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

Revision history for this message

Bryce Harrington (bryce) wrote on 2019-07-11:

#19

This looks similar to https://bugs.launchpad.net/ubuntu/+source/monitoring-plugins/+bug/1827159.

However, installing nagios-plugins in a fresh Xenial LXC container does not appear sufficient to reproduce the bug:

1. There is no /sys/kernel/debug/tracing present on the system. Installing perf-tools-unstable caused the directory to be created.
2. There is not a nagios user on the system. I created this manually, but wonder if there is some third component that should be installed, that would create this?
3. The directory in question is owned by 'nobody':
    root@triage-xenial:~# ls -l /sys/kernel/debug/tracing
    ls: cannot access '/sys/kernel/debug/tracing': Permission denied
    root@triage-xenial:~# ls -l /sys/kernel
    (...)
    drwx------ 36 nobody nogroup 0 Jul 10 23:10 debug
    (...)

It would be quite helpful to have a step-by-step test case that can be invoked in a Xenial lxc container.

Has anyone checked that this same issue affects bionic or newer, or is Xenial-specific?

Changed in nagios-plugins (Ubuntu):
status:	Triaged → Incomplete

Revision history for this message

Ramon Grullon (rgrullon) wrote on 2019-07-16:

#20

The reason why this alert pops out is related to running sosreport on this node. Nagios can not access it - which is good as this directory is only accessible by root:
# ll /sys/kernel/debug/ | grep trac
drwx------ 8 root root 0 Jul 9 09:06 tracing/

To validate/replicate this behaviour, please open three terminals.

Terminal 1, run
mount | grep -i tracing
- no output here as this directory is generally not presented by mount

Terminal 2:
mkdir testing; cd testing
while true;do mount | grep -i tracing > mounted-$(date +%s); done

on terminal 1 please run
cd testing
watch ls -lt

Terminal 3
sudo sosreport -a --all-logs

Watch Terminal 1. On the beginning, you will see files created with 0 in size. When you start sosreport in another window/tmux this file gets populated - meaning munt can see it.
Why does it become available? Sosreport gathers diagnostic information and initiates it. This directory is not visible from mount during normal operation.

Bryce Harrington (bryce) on 2019-07-17

tags:

added: server-next

Revision history for this message

Bryce Harrington (bryce) wrote on 2019-07-17:

#21

Screenshot from 2019-07-17 14-19-10.png Edit (477.3 KiB, image/png)

Ramon, thank you for the detailed test case, I was able to run through it exactly as you described, both as root user (see attached) and as nagios (with sudo setup). I suspect I'm unable to reproduce the issue you're seeing since under lxc the /sys/kernel/debug directory belongs to the host and thus is owned by nobody:nogroup, (although I should think that it would produce a permission denied error.)

From the host:
# mount | grep tracing
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

In any case, regarding the bug itself, I am able to detect the permissions error:

# /usr/lib/nagios/plugins/check_disk -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

# ls -la /sys/kernel/debug/tracing
ls: cannot access '/sys/kernel/debug/tracing': Permission denied

# /usr/lib/nagios/plugins/check_disk -e -X tracefs
DISK OK| /=44420MB;;;0;3754403 /dev=0MB;;;0;0 /dev/full=0MB;;;0;16018 /dev/null=0MB;;;0;16018 /dev/random=0MB;;;0;16018 /dev/tty=0MB;;;0;16018 /dev/urandom=0MB;;;0;16018 /dev/zero=0MB;;;0;16018 /dev/fuse=0MB;;;0;16018 /dev/net/tun=0MB;;;0;16018 /dev/lxd=0MB;;;0;0 /dev/.lxd-mounts=0MB;;;0;0 /dev/shm=0MB;;;0;16041 /run=16MB;;;0;16041 /run/lock=0MB;;;0;5 /sys/fs/cgroup=0MB;;;0;16041 /var/lib/lxd/shmounts=0MB;;;0;0 /var/lib/lxd/devlxd=0MB;;;0;0 /run/user/1001=0MB;;;0;3208

The suggestion in comment #16 looks like the best approach for addressing the issue so far. Alternatively, I posted a patch to LP #1827159 for altering check_disk itself, however as mentioned in comment #9 on this bug, excluding all tmpfs would be too broad.

Ramon, if you can test out the approach outlined in comment #16 and let me know if it seems suitable for your use case, perhaps we should proceed with implementing an SRU for that.

Revision history for this message

Bryce Harrington (bryce) wrote on 2019-07-17:

#22

Meanwhile, I've verified the issue seems relevant for newer ubuntu's too:

### Bionic
# /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

### Eoan
# /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

The issue was reported to Debian but don't think an action was taken on it:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910267

Upstream also has recommendation to exclude tracefs for this issue:
http://www.dailyithelp.com/nagios-disk-critical-syskerneldebugtracing-is-not-accessible-permission/