Servicemon silently stops checking services after running for a while

Bug #1520119 reported by Morten Brekkevold
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Network Administration Visualized
Fix Released
Critical
Morten Brekkevold

Bug Description

While investigating a customer report of issues with service monitor alerts, it was discovered that the servicemon process on their installation was doing nothing. The customer had restarted the servicemon daemon, and it resumed its work in a normal fashion.

The logs (debug level) indicated that all the checkers were being instantiated on each cycle, but none were run. After the restart, logs showed normal behavior for a while, until a traceback was logged and servicemon resumed the errant behavior.

A code review reveals that a refactoring from 2011 introduced this bug. A line of code refers to a renamed variable by its old name, causing an AttributeError exception when recycling old worker threads.

This means the bug is triggered as each worker thread reaches its maximum number of jobs and is recycled. Once all the worker threads have triggered the exception, no more worker threads remain available, and the servicemon ceases entirely to monitor services.

The time it takes for this to happen is dependent on the number of configured worker threads in `servicemon.conf` (default: 20), and the value of the `recycle interval` option (default: 50), and, of course, how many service checkers that have been configured in SeedDB.

Once this message appears in `servicemon.log`, normal service checking ceases:

Exception in thread worker19:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/dist-packages/nav/statemon/RunQueue.py", line 61, in run
    self.execute()
  File "/usr/lib/python2.7/dist-packages/nav/statemon/RunQueue.py", line 78, in execute
    self._runqueue.unusedThreadName.append(self.getName())
AttributeError: '_RunQueue' object has no attribute 'unusedThreadName'

Tags: servicemon
description: updated
Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :
Changed in nav:
status: Confirmed → Fix Committed
Changed in nav:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.