Comment 1 for bug 258219

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

This bug is related to getDeviceData's module monitoring
plugins.

Module status is set by two different plugins. One plugin
performs actual module probing, either by asking specific
module status OIDs (such as for HP and 3Com devices), or by
a generic probe for a random ifindex known to exist on a
previously seen module. If a probe succeeds, the module is
marked as up. Another plugin tries to discover which
modules an IP device actually consists of. Any module this
plugin discovers, is marked as up. This latter method is
also how modules are initially discovered.

After this probing, the module monitor plugin will verify
the list of modules the other plugins marked as up against
the list of previously seen modules on the IP device. Any
previously seen module not in the up-list, is then
considered to be down.

This bug first appeared when the moduleMon probe OIDs where
rescheduled to be collected in one hour intervals, instead
of the regular six hours. This was done because one doesn't
want to wait 6 hours for an alert about a module going down.
 The generic ifindex probe only probes ifindexes of switch
ports, not router ports. Also, there is no generic way to
probe a module with no interfaces. Now, this probe will run
every hour, while the full module discovery only runs every
six hours. Since the module probe doesn't probe router- or
interfaceless modules, these will be considered down when
the probe runs single handedly.

This is why the symptoms are a pattern of 5 hour module
downtimes - A probe and full discovery are first run at the
same time, then a single handed probe is run one hour later,
and modules are marked as down for the remaining 5 hours
until a full discovery is run again.