NRPE check reports "OK: ceph-osd@xx service is running" when one of the N services is not running
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Ceph OSD Charm |
Triaged
|
Wishlist
|
Unassigned |
Bug Description
The "Service Status Details" page in Nagios shows only the first line of the NRPE script response, so when one of the OSDs is failed (or intentionally has been shut down, like in my case), it can cause a confusion - an operator could open Nagios, see 'CRITICAL: OK: ceph-osd@30.service is running' and treat this as a false positive; while in reality the full message looks like:
Current Status:
CRITICAL
(for 0d 2h 50m 49s)
Status Information: OK: ceph-osd@30.service is running
OK: ceph-osd@36.service is running
OK: ceph-osd@41.service is running
OK: ceph-osd@47.service is running
OK: ceph-osd@51.service is running
OK: ceph-osd@55.service is running
Failed: check command raised: CRITICAL: ceph-osd@58.service is not running
Failed: check command raised: CRITICAL: ceph-osd@60.service is not running
Can we re-order these messages in order to show "Failed" first (so it can be clearly visible in Nagios error preview) ?
Changed in charm-ceph-osd: | |
importance: | Undecided → Wishlist |
status: | New → Triaged |
Note that the standards for writing Nagios plugins is that the output has only one line, so this check is not compliant with what Nagios expects.