Network Administration Visualized

Servicemon silently stops checking services after running for a while

Bug #1520119 reported by Morten Brekkevold on 2015-11-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Network Administration Visualized	Fix Released	Critical	Morten Brekkevold	Network Administration Visualized 4.3.3

Bug Description

While investigating a customer report of issues with service monitor alerts, it was discovered that the servicemon process on their installation was doing nothing. The customer had restarted the servicemon daemon, and it resumed its work in a normal fashion.

The logs (debug level) indicated that all the checkers were being instantiated on each cycle, but none were run. After the restart, logs showed normal behavior for a while, until a traceback was logged and servicemon resumed the errant behavior.

A code review reveals that a refactoring from 2011 introduced this bug. A line of code refers to a renamed variable by its old name, causing an AttributeError exception when recycling old worker threads.

This means the bug is triggered as each worker thread reaches its maximum number of jobs and is recycled. Once all the worker threads have triggered the exception, no more worker threads remain available, and the servicemon ceases entirely to monitor services.

The time it takes for this to happen is dependent on the number of configured worker threads in `servicemon.conf` (default: 20), and the value of the `recycle interval` option (default: 50), and, of course, how many service checkers that have been configured in SeedDB.

Once this message appears in `servicemon.log`, normal service checking ceases:

Exception in thread worker19:
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 552, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/dist-packages/nav/statemon/RunQueue.py", line 61, in run
    self.execute()
  File "/usr/lib/python2.7/dist-packages/nav/statemon/RunQueue.py", line 78, in execute
    self._runqueue.unusedThreadName.append(self.getName())
AttributeError: '_RunQueue' object has no attribute 'unusedThreadName'

See original description

Tags:

Morten Brekkevold (mbrekkevold) on 2015-11-26

description:

updated

Revision history for this message

Morten Brekkevold (mbrekkevold) wrote on 2015-11-26:

fix here: https://nav.uninett.no/hg/stable/rev/e49842a11c35

Changed in nav:
status:	Confirmed → Fix Committed

Morten Brekkevold (mbrekkevold) on 2015-11-26

Changed in nav:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.