Watchdog feature to monitor internal status of NAV processes

Bug #1062298 reported by Morten Brekkevold
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Network Administration Visualized
Fix Released
Wishlist
John-Magne Bredal

Bug Description

NAV knows how to monitor a network, but doesn't do much to monitor itself.

It has been suggested that NAV needs a watchdog system, consisting of a UI dashboard to display the health status of various internal NAV processes/mechanisms, and provisions for generating alerts on suspicious or otherwise aberrant behavior.

Some suggestions for watchdog checks are:

* Flag overdue ipdevpoll jobs.
* Flag GW/GSW devices that have no router interfaces
* Flag SW/GSW devices that have now switch ports
* Devices with abnormal port counts
* No new cam or arp entries in a given period
* NAV db stats such as: Number of netboxes, device serial numbers, historic alerts, cam and arp records and so forth..

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Bug 1062150 has real world implications for the implementation of this feature.

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Watchdog tests must have a low-level component that returns True/False. Then there must be a higher level interface which can produce details of the check. The former can primarily be used to generate alerts by a background daemon, the latter to display details on a web page.

Changed in nav:
status: New → In Progress
assignee: nobody → John-Magne Bredal (john-m-bredal)
milestone: none → 4.1.0
Revision history for this message
John-Magne Bredal (john-m-bredal) wrote :

WatchDog implemented as a series of custom tests based on original bug description.

Two GUI components implemented:
1. Widget that periodically runs the tests and displays result status, with option to display list of errors leading to error status.
2. Separate page that displays one time test result status, in addition to generic information about the NAV installation according to bug description.

Problems:
The test "Job Status" tests for any jobs that failed. This seems to always be the case (there is always at least one job that has failed), i.e. the test is useless. A test that does a more thorough inspection to discover if this is really is a problem should be implemented. For instance, if a job has failed 3 times in a row that is considered a problem, but a single failure is not.

summary: - NAV watchdog
+ Watchdog feature to monitor internal status of NAV processes
Changed in nav:
status: In Progress → Fix Committed
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers