nagios user can't access /var/lib/ceph - check silently broken

Bug #1810749 reported by Andrea Ieri
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ceph OSD Charm
Fix Released
High
Alex Kavanagh
ceph-osd (Juju Charms Collection)
Invalid
Undecided
Unassigned

Bug Description

The ceph-osd nagios check is currently broken, because /var/lib/ceph has perms set to 750 and `cat /var/lib/ceph/osd/ceph-*/whoami` fails as user nagios.

Furthermore, as already mentioned in LP#1749417, the final `exit 2` makes nagios think that the result of the check is actually ok, so it is kind of out of chance that one can notice that the check is actually broken.

A quick fix would be to cat via the mountpoints instead, as either the directories or the devices are listed in the osd-devices option, but perhaps there's a more robust solution.

Revision history for this message
Andrea Ieri (aieri) wrote :

Subscribing field-medium: ceph-mon checks will alert if an osd is down, but having the ceph-osd return OK when it's actually not working is dangerous

Changed in ceph-osd (Juju Charms Collection):
status: New → Confirmed
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

The directory is owned by ceph:ceph, so probably the logical thing to do is to add nagios to the ceph group so that it can read the directories with ceph group permissions. This will maintain the permissions on the directory.

Changed in ceph-osd (Juju Charms Collection):
assignee: nobody → Alex Kavanagh (ajkavanagh)
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

So the actual charm is fine; it runs as root at so can access the relevant directories. However, the actual /usr/sbin/nrpe code runs as nagios:

# ps aux | grep nrpe
nagios 32690 0.0 0.3 26548 4836 ? Ss 12:46 0:00 /usr/sbin/nrpe -c /etc/nagios/nrpe.cfg -f

Using the useful nrpe-runner: (https://raw.githubusercontent.com/deanwilson/sysadmin-scripts/master/nrpe-runner) - I've reproduced it on a ceph-osd unit:

root@juju-b3e57d-zaza-d1bb6933de32-3:/etc/nagios/nrpe.d# sudo -u nagios nrpe-runner -a -s -d .
/bin/cat: '/var/lib/ceph/osd/ceph-*/whoami': Permission denied
check_swap_activity => OK - 0 kb swapped out in last 5 seconds | swapout_size=0KB;100;500;
check_conntrack => OK: conntrack table normal | current=107 max=65536 percent=0;80;90
check_mem => OK - 33.7% (513388 kB) used.|TOTAL=1524932KB;;;; USED=513388KB;1296192;1372438;; FREE=1011544KB;;;; CACHES=858384KB;;;; HUGEPAGES=0KB;;;;
check_load => OK - load average: 0.02, 0.04, 0.00|load1=0.020;8.000;16.000;0; load5=0.040;4.000;8.000;0; load15=0.000;2.000;4.000;0;
check_ceph-osd =>
check_disk_root => DISK OK - free space: / 12 GB (85% inode=95%);| /=2GB;10;11;0;14
check_swap => SWAP OK - 100% free (4095 MB out of 4095 MB) |swap=4095MB;1638;1023;0;4095
Ran 7 checks - OK 7. WARN 0, CRIT 0, UNKNOWN 0

Adding nagios to the ceph group:

usermod -a -G ceph nagios

root@juju-b3e57d-zaza-d1bb6933de32-3:/etc/nagios/nrpe.d# sudo -u nagios nrpe-runner -a -s -d .
check_swap_activity => OK - 0 kb swapped out in last 5 seconds | swapout_size=0KB;100;500;
check_conntrack => OK: conntrack table normal | current=107 max=65536 percent=0;80;90
check_mem => OK - 33.7% (514224 kB) used.|TOTAL=1524932KB;;;; USED=514224KB;1296192;1372438;; FREE=1010708KB;;;; CACHES=858608KB;;;; HUGEPAGES=0KB;;;;
check_load => OK - load average: 0.00, 0.01, 0.00|load1=0.000;8.000;16.000;0; load5=0.010;4.000;8.000;0; load15=0.000;2.000;4.000;0;
check_ceph-osd => OK: ceph-osd@0.service is running
OK: ceph-osd@1.service is running
OK: ceph-osd@6.service is running
check_disk_root => DISK OK - free space: / 12 GB (85% inode=95%);| /=2GB;10;11;0;14
check_swap => SWAP OK - 100% free (4095 MB out of 4095 MB) |swap=4095MB;1638;1023;0;4095
Ran 7 checks - OK 7. WARN 0, CRIT 0, UNKNOWN 0

Notice that the ceph-osd check is now present; thus a solution will be to add nagios to the ceph group so that it can read these directories; however, I need to discuss if this will be a security issue and what alternatives may exist.

Changed in ceph-osd (Juju Charms Collection):
importance: Undecided → High
Changed in ceph-osd (Juju Charms Collection):
status: Confirmed → In Progress
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

After some discussion, it's a really bad idea to add nagios to the ceph group. An alternative strategy will be found which uses the ceph user to perform the check, and the nagios user to read the results of the check from a non security issue file.

Revision history for this message
Frode Nordahl (fnordahl) wrote :
Changed in ceph-osd (Juju Charms Collection):
status: In Progress → Fix Committed
Changed in charm-ceph-osd:
status: New → Fix Committed
importance: Undecided → High
assignee: nobody → Alex Kavanagh (ajkavanagh)
milestone: none → 19.04
Changed in ceph-osd (Juju Charms Collection):
status: Fix Committed → Invalid
assignee: Alex Kavanagh (ajkavanagh) → nobody
importance: High → Undecided
David Ames (thedac)
Changed in charm-ceph-osd:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.