Update NRPE checks to separate "site unreachable" and "nagios not responding correctly" alerts.

Bug #1908430 reported by Drew Freiberger
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Thruk External Agent Charm
Won't Fix
Wishlist
Unassigned

Bug Description

We have found that there are two or three different responses to the NRPE checks made against remote thruk agents and having these as one check can create a situation where a nagios unit misbehaving can be misunderstood to be a temporary VPN access issue leading to misunderstanding of the severity of the alert.

I would suggest having two different checks to better inform operators of the severity and type of alert. One would be a reachability check, and the other would be a content check.

Currently the check looks for reachability and has warning/critical values for length of time for response, and those checks return Critical on values such as:

"CRITICAL - Socket timeout after 16 seconds"
or
"No route to host" (return code triggers critical, but no critical text in the check's output)

Then there are content checks if reachability is okay:

HTTP CRITICAL: HTTP/1.1 200 OK - string 'nagios_pid' not found on 'https://remote-site.com:443/thruk/cgi-bin/remote.cgi' - 419 bytes in 0.408 second response time

Here is a canonical internal reference for thruk status of such checks:
https://pastebin.canonical.com/p/Qtp6WbqgrT/

I'd suggest that we run reachability checks as "check_$(agent_name)_reachability" and another check with the content check as "check_$(agent_name)_nagios_content".

Edin S (exsdev)
Changed in charm-thruk-external-agent:
importance: Undecided → Wishlist
Revision history for this message
Eric Chen (eric-chen) wrote :

This charm is no longer being actively maintained.

Changed in charm-thruk-external-agent:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.