check_disk plugin broken after upgrade to 15.10

Bug #1516451 reported by Ralf G. R. Bergs
174
This bug affects 35 people
Affects Status Importance Assigned to Milestone
nagios-plugins (Ubuntu)
Triaged
High
Bryce Harrington

Bug Description

I didn't touch my Nagios config, just update my system from 15.04 to 15.10. Suddenly the default localhost/Disk Space check fails with the following output:

DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

This can be reproduced when manually running the underlying command as user "nagios":

$ /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

When I run it as root it works:

# /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK OK| /dev=0MB;1186;1334;0;1483 /run=8MB;239;269;0;299 /=17157MB;57386;64559;0;71733 /dev/shm=0MB;1199;1349;0;1499 /run/lock=0MB;4;4;0;5 /sys/fs/cgroup=0MB;1199;1349;0;1499 /boot=48MB;181;204;0;227 /run/user/0=0MB;239;269;0;299

Seems "nagios" user can't access the dir it tries to access:

# ls -la /sys/kernel/debug/tracing
drwx------ 7 root root 0 Nov 15 19:40 .

# lsb_release -rd
Description: Ubuntu 15.10
Release: 15.10

# apt-cache policy nagios-plugins-basic
nagios-plugins-basic:
  Installed: 1.5-3ubuntu1
  Candidate: 1.5-3ubuntu1
  Version table:
 *** 1.5-3ubuntu1 0
        500 http://de.archive.ubuntu.com/ubuntu/ wily/main amd64 Packages
        100 /var/lib/dpkg/status

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nagios-plugins (Ubuntu):
status: New → Confirmed
Revision history for this message
Ben Coleman (oloryn) wrote :

Also note that while 15.10 does change the permissions on /sys/kernel/debug/tracing (from drwxr-xr-x in 15.04 to drwx------ in 15.10), the permissions on /sys/kernel/debug are drwx------ on both 15.10 and 15.04 - which means that /sys/kernel/debug shouldn't be readable from a non-root account on either release, so this looks like a code change in check_disk.

Revision history for this message
Brian Morton (rokclimb15) wrote :

strace confirms that check_disk on 12.04 doesn't check /sys/kernel/debug/tracing

Not having any luck tracking down a code change in the monitoring-plugins github repo. I wonder if this is a change in a dependent lib instead.

Here's a workaround

sudo chown root:root /usr/lib/nagios/plugins/check_disk
sudo chmod u+s /usr/lib/nagios/plugins/check_disk
sudo chmod o+x /usr/lib/nagios/plugins/check_disk

Revision history for this message
Brian Morton (rokclimb15) wrote :

I suspect there isn't a code change here, but rather a difference in the way Ubuntu is presenting its mount points. The plugin tries to enumerate and check all mounts. A better use might be to add the actual mount points to be monitored with -p

/usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e -p / -p /var -p /boot

12.04:
mount
<snip>
none on /sys/kernel/debug type debugfs (rw)
<snip>

14.04:

debugfs on /sys/kernel/debug type debugfs (rw,relatime)
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

Revision history for this message
Robie Basak (racb) wrote :

Thank you for taking the time to report and investigate this bug and helping to make Ubuntu better.

It sounds to me like check_disk should have a blacklist of filesystem types to ignore. But explicitly specifying which mount points looks like a suitable workaround.

I wonder if this affects monitoring-plugins in Xenial?

Changed in nagios-plugins (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged
Revision history for this message
Gabriele Tozzi (gabriele-tozzi) wrote :

You can use the --exclude-type option to work this bug around:

/usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs

Revision history for this message
Harald Hannelius (harald-arcada) wrote :

Same probably for gvfs-fuse filesystems. Recommend

--exclude-type=tracefs --exclude-type=fuse.gvfsd-fuse

Revision history for this message
Darragh Grealish (grealish) wrote :

This is also broken in ubuntu 16.04, however the workaround mentioned works
/usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs

Revision history for this message
TomaszChmielewski (mangoo-wpkg) wrote :

The workaround is not really great when LXD/LXC is in use:

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs
DISK CRITICAL - /run/lxcfs/controllers is not accessible: Permission denied

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=cgroup
DISK CRITICAL - /run/lxcfs/controllers is not accessible: Permission denied

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=tmpfs
DISK CRITICAL - /run/lxcfs/controllers/blkio is not accessible: Permission denied

So it only works when we exclude all three above, including tmpfs:

$ /usr/lib/nagios/plugins/check_disk -e --exclude-type=tracefs --exclude-type=cgroup --exclude-type=tmpfs

However, tmpfs is very often used for /tmp, /dev/shm, which are also important to monitor - and --exclude-type=tmpfs makes the check skip these mountpoints.

Revision history for this message
Nicholas Sherlock (n-sherlock) wrote :

Rather than excluding tmpfs, just exclude /run/lxcfs/controllers. This is the check_all_disks command I'm now using in my /etc/nagios-plugins/config/disk.cfg:

# 'check_all_disks' command definition
define command{
    command_name check_all_disks
    command_line /usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e -A --exclude-type=tracefs --exclude-type=cgroup --exclude_device=/run/lxcfs/controllers
}

Revision history for this message
Danny Howard (dannyman) wrote :

We normally netboot but we had a few machines that would not PXE, so we installed 14.04 via medium. Afterwards, some of the medium-installed machines were throwing this error in Nagios. Found this line in /etc/mtab on the afflicted hosts:

tracefs /var/lib/ureadahead/debugfs/tracing tracefs rw,relatime 0 0

This appears to be an artifact on the medium-based install process. I removed the above line and ran:

sudo service nagios-nrpe-server restart

Error condition cleared.

Revision history for this message
Danny Howard (dannyman) wrote :

Possibly related to #499773 which is about install adding spurious entries to mtab.

Revision history for this message
Marius Gedminas (mgedmin) wrote :

These days /etc/mtab is a symlink to /proc/self/mounts, so you cannot control what is exposed there.

Revision history for this message
Gerald Combs (gerald.combs) wrote :

This appears to be fixed upstream via https://github.com/Icinga/icinga2/issues/4184

Revision history for this message
Gerald Combs (gerald.combs) wrote :

Oops - please disregard comment #14 - it's specific to Icinga.

Revision history for this message
Ian Gibbs (realflash-uk) wrote :

Since check_all_disks is internally defined in Nagios, you might well see a "duplicate definition" error if you define your own check_all_disks command. I'd recommend

define command{
    command_name check_all_physical_disks
    command_line /usr/lib/nagios/plugins/check_disk -w '$ARG1$' -c '$ARG2$' -e -A --exclude-type=tracefs --exclude-type=cgroup --exclude_device=/run/lxcfs/controllers
}

instead, and then call that in your host definition:

define service {
 use generic-service
 hostgroup_name all
 service_description Disk Space
 check_command check_all_physical_disks!6%!4%
}

Revision history for this message
Alvaro Uria (aluria) wrote :

This is also affecting a confined xenial LXC environment, and was fixed by adding "--exclude-type=tracefs" on the check_all_disks command definition at /etc/nagios-plugins/config/disk.cfg

monitoring-plugins-basic should be updated with the above.

tags: added: canonical-bootstack
Revision history for this message
Ramon Grullon (rgrullon) wrote :

Currently experiencing this issue at customer site where there is a permission issue as nagios user can't access this directory, this particular mount point/directory is owned by root and the permission set on this is 700.
ubuntu@XXXXXXnagios-1:/snap/core/7270$ /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

ubuntu@XXXXXXnagios-1:/snap/core/7270$ sudo ls -ld /sys/kernel/debug/tracing
drwx------ 8 root root 0 May 9 11:22 /sys/kernel/debug/tracing

ubuntu@XXXXXXXXnagios-1:/snap/core/7270$ mount | grep /sys/kernel/debug/tracing
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

Revision history for this message
Bryce Harrington (bryce) wrote :

This looks similar to https://bugs.launchpad.net/ubuntu/+source/monitoring-plugins/+bug/1827159.

However, installing nagios-plugins in a fresh Xenial LXC container does not appear sufficient to reproduce the bug:

1. There is no /sys/kernel/debug/tracing present on the system. Installing perf-tools-unstable caused the directory to be created.
2. There is not a nagios user on the system. I created this manually, but wonder if there is some third component that should be installed, that would create this?
3. The directory in question is owned by 'nobody':
    root@triage-xenial:~# ls -l /sys/kernel/debug/tracing
    ls: cannot access '/sys/kernel/debug/tracing': Permission denied
    root@triage-xenial:~# ls -l /sys/kernel
    (...)
    drwx------ 36 nobody nogroup 0 Jul 10 23:10 debug
    (...)

It would be quite helpful to have a step-by-step test case that can be invoked in a Xenial lxc container.

Has anyone checked that this same issue affects bionic or newer, or is Xenial-specific?

Changed in nagios-plugins (Ubuntu):
status: Triaged → Incomplete
Revision history for this message
Ramon Grullon (rgrullon) wrote :

The reason why this alert pops out is related to running sosreport on this node. Nagios can not access it - which is good as this directory is only accessible by root:
# ll /sys/kernel/debug/ | grep trac
drwx------ 8 root root 0 Jul 9 09:06 tracing/

To validate/replicate this behaviour, please open three terminals.

Terminal 1, run
mount | grep -i tracing
- no output here as this directory is generally not presented by mount

Terminal 2:
mkdir testing; cd testing
while true;do mount | grep -i tracing > mounted-$(date +%s); done

on terminal 1 please run
cd testing
watch ls -lt

Terminal 3
sudo sosreport -a --all-logs

Watch Terminal 1. On the beginning, you will see files created with 0 in size. When you start sosreport in another window/tmux this file gets populated - meaning munt can see it.
Why does it become available? Sosreport gathers diagnostic information and initiates it. This directory is not visible from mount during normal operation.

Bryce Harrington (bryce)
tags: added: server-next
Revision history for this message
Bryce Harrington (bryce) wrote :

Ramon, thank you for the detailed test case, I was able to run through it exactly as you described, both as root user (see attached) and as nagios (with sudo setup). I suspect I'm unable to reproduce the issue you're seeing since under lxc the /sys/kernel/debug directory belongs to the host and thus is owned by nobody:nogroup, (although I should think that it would produce a permission denied error.)

From the host:
# mount | grep tracing
tracefs on /sys/kernel/debug/tracing type tracefs (rw,relatime)

In any case, regarding the bug itself, I am able to detect the permissions error:

# /usr/lib/nagios/plugins/check_disk -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

# ls -la /sys/kernel/debug/tracing
ls: cannot access '/sys/kernel/debug/tracing': Permission denied

# /usr/lib/nagios/plugins/check_disk -e -X tracefs
DISK OK| /=44420MB;;;0;3754403 /dev=0MB;;;0;0 /dev/full=0MB;;;0;16018 /dev/null=0MB;;;0;16018 /dev/random=0MB;;;0;16018 /dev/tty=0MB;;;0;16018 /dev/urandom=0MB;;;0;16018 /dev/zero=0MB;;;0;16018 /dev/fuse=0MB;;;0;16018 /dev/net/tun=0MB;;;0;16018 /dev/lxd=0MB;;;0;0 /dev/.lxd-mounts=0MB;;;0;0 /dev/shm=0MB;;;0;16041 /run=16MB;;;0;16041 /run/lock=0MB;;;0;5 /sys/fs/cgroup=0MB;;;0;16041 /var/lib/lxd/shmounts=0MB;;;0;0 /var/lib/lxd/devlxd=0MB;;;0;0 /run/user/1001=0MB;;;0;3208

The suggestion in comment #16 looks like the best approach for addressing the issue so far. Alternatively, I posted a patch to LP #1827159 for altering check_disk itself, however as mentioned in comment #9 on this bug, excluding all tmpfs would be too broad.

Ramon, if you can test out the approach outlined in comment #16 and let me know if it seems suitable for your use case, perhaps we should proceed with implementing an SRU for that.

Revision history for this message
Bryce Harrington (bryce) wrote :

Meanwhile, I've verified the issue seems relevant for newer ubuntu's too:

### Bionic
# /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

### Eoan
# /usr/lib/nagios/plugins/check_disk -w '20%' -c '10%' -e
DISK CRITICAL - /sys/kernel/debug/tracing is not accessible: Permission denied

The issue was reported to Debian but don't think an action was taken on it:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=910267

Upstream also has recommendation to exclude tracefs for this issue:
http://www.dailyithelp.com/nagios-disk-critical-syskerneldebugtracing-is-not-accessible-permission/

Changed in nagios-plugins (Ubuntu):
status: Incomplete → Triaged
importance: Medium → High
Changed in nagios-plugins (Ubuntu):
assignee: nobody → Bryce Harrington (bryce)
tags: removed: server-next
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Also applies to squashfs, added bugs as dup

tags: added: server-next
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.