Servicemon silently stops checking services after running for a while
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Network Administration Visualized |
Fix Released
|
Critical
|
Morten Brekkevold |
Bug Description
While investigating a customer report of issues with service monitor alerts, it was discovered that the servicemon process on their installation was doing nothing. The customer had restarted the servicemon daemon, and it resumed its work in a normal fashion.
The logs (debug level) indicated that all the checkers were being instantiated on each cycle, but none were run. After the restart, logs showed normal behavior for a while, until a traceback was logged and servicemon resumed the errant behavior.
A code review reveals that a refactoring from 2011 introduced this bug. A line of code refers to a renamed variable by its old name, causing an AttributeError exception when recycling old worker threads.
This means the bug is triggered as each worker thread reaches its maximum number of jobs and is recycled. Once all the worker threads have triggered the exception, no more worker threads remain available, and the servicemon ceases entirely to monitor services.
The time it takes for this to happen is dependent on the number of configured worker threads in `servicemon.conf` (default: 20), and the value of the `recycle interval` option (default: 50), and, of course, how many service checkers that have been configured in SeedDB.
Once this message appears in `servicemon.log`, normal service checking ceases:
Exception in thread worker19:
Traceback (most recent call last):
File "/usr/lib/
self.run()
File "/usr/lib/
self.execute()
File "/usr/lib/
self.
AttributeError: '_RunQueue' object has no attribute 'unusedThreadName'
description: | updated |
Changed in nav: | |
status: | Fix Committed → Fix Released |
fix here: https:/ /nav.uninett. no/hg/stable/ rev/e49842a11c3 5