Update NRPE checks to separate "site unreachable" and "nagios not responding correctly" alerts.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Thruk External Agent Charm |
Won't Fix
|
Wishlist
|
Unassigned |
Bug Description
We have found that there are two or three different responses to the NRPE checks made against remote thruk agents and having these as one check can create a situation where a nagios unit misbehaving can be misunderstood to be a temporary VPN access issue leading to misunderstanding of the severity of the alert.
I would suggest having two different checks to better inform operators of the severity and type of alert. One would be a reachability check, and the other would be a content check.
Currently the check looks for reachability and has warning/critical values for length of time for response, and those checks return Critical on values such as:
"CRITICAL - Socket timeout after 16 seconds"
or
"No route to host" (return code triggers critical, but no critical text in the check's output)
Then there are content checks if reachability is okay:
HTTP CRITICAL: HTTP/1.1 200 OK - string 'nagios_pid' not found on 'https:/
Here is a canonical internal reference for thruk status of such checks:
https:/
I'd suggest that we run reachability checks as "check_
Changed in charm-thruk-external-agent: | |
importance: | Undecided → Wishlist |
This charm is no longer being actively maintained.