ipdevpoll-inventory job fails on certain devices

Bug #690723 reported by Miloslav Izar
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Network Administration Visualized
Invalid
Undecided
Unassigned

Bug Description

Hi I'm just a simple network engineer so bear with me,
I'm not sure what info you will require so I'll do my best to provide it when you ask for it.
We have 2 device types that ipdevpoll fails .

I'm running NAV 3.7.1 on debian from repository

first one is:

Slot-2 XXX # sh switch

SysName: XXX
SysLocation: XXX
SysContact: <email address hidden>, +1 888 257 3000
System MAC: 02:04:96:27:85:FE
System Type: X250e-48t (Stack)

SysHealth check: Enabled (Normal)
Recovery Mode: All
System Watchdog: Enabled

Current Time: Wed Dec 15 17:17:27 2010
Timezone: [Auto DST Enabled] GMT Offset: 60 minutes, name is MET.
                  DST of 60 minutes is currently not in effect, name is MDT.
                  DST begins every last Sunday March at 1:30
                  DST ends every last Sunday October at 1:30

Boot Time: Thu Oct 16 08:12:16 2008
Boot Count: 16
Next Reboot: None scheduled
System UpTime: 790 days 10 hours 5 minutes 10 seconds

Slot: Slot-2 * Slot-3
                  ------------------------ ------------------------
Current State: MASTER BACKUP (In Sync)

Image Selected: secondary secondary
Image Booted: secondary secondary
Primary ver: 12.0.1.11 12.1.1.4
Secondary ver: 12.1.1.4 12.1.1.4

Config Selected: primary.cfg
Config Booted: Factory Default

primary.cfg Created by ExtremeXOS version 12.1.1.4
                  409679 bytes saved on Wed Nov 24 15:00:11 2010

ipdevpoll.log:

2010-12-15 17:18:34,767 [ERROR] [nav.ipdevpoll.jobs.jobhandler.inventory.(xxx)] Plugin nav.ipdevpoll.plugins.modules.Modules(u'xxx') reported a timeout
2010-12-15 17:18:34,769 [ERROR] [nav.ipdevpoll.jobs.jobhandler.inventory.(xxx)] Job 'inventory' for xxx aborted.

second:

C3750 stack

Switch Ports Model SW Version SW Image
------ ----- ----- ---------- ----------
* 1 52 WS-C3750-48TS 12.2(46)SE C3750-IPBASEK9-M
     2 52 WS-C3750-48TS 12.2(46)SE C3750-IPBASEK9-M
     3 52 WS-C3750-48TS 12.2(46)SE C3750-IPBASEK9-M
     4 52 WS-C3750-48TS 12.2(46)SE C3750-IPBASEK9-M
     5 52 WS-C3750-48TS 12.2(46)SE C3750-IPBASEK9-M

2010-12-15 17:05:02,012 [ERROR] [nav.ipdevpoll.jobs.jobhandler.inventory.(yyy)] Job 'inventory' for yyy aborted.
2010-12-15 17:14:40,023 [ERROR] [nav.ipdevpoll.jobs.jobhandler.inventory.(yyy)] Plugin nav.ipdevpoll.plugins.bridge.Bridge(u'yyy') reported a timeout

however c3750 stack with different ios ver works ok:
OK:
Switch Ports Model SW Version SW Image
------ ----- ----- ---------- ----------
* 1 52 WS-C3750-48TS 12.2(44)SE2 C3750-IPBASEK9-M
     2 52 WS-C3750-48TS 12.2(44)SE2 C3750-IPBASEK9-M

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Hi,

You are having some sort of SNMP timeout problem. Either the device is being overloaded with SNMP requests, or it doesn't properly respond to ENTITY-MIB requests.

I would like you to add some more debugging info from ipdevpoll by changing the logging settings. Please create the file /etc/nav/logging.conf (or /usr/local/nav/etc/logging.conf, depending on your install location). Add these 3 lines:

[levels]
nav.ipdevpoll = INFO
nav.ipdevpoll.plugins.modules = DEBUG

Then restart ipdevpoll. Please post the relevant DEBUG messages from nav.ipdevpoll.plugins.modules here.

Changed in nav:
status: New → Incomplete
Revision history for this message
Miloslav Izar (izar) wrote :
Download full text (6.8 KiB)

Hi,

You are right, in first case it fails while it collects ENTITY-MIB,however log doesn't show much:

Extreme switch:
2011-01-04 13:12:57,198 [DEBUG] [nav.ipdevpoll.plugins.modules.modules.(st0192okprgac7)] Collecting ENTITY-MIB module data
2011-01-04 13:13:06,103 [ERROR] [nav.ipdevpoll.jobs.jobhandler.inventory.(st0192okprgac7)] Plugin nav.ipdevpoll.plugins.modules.Modules(u'st0192okprgac7') reported a timeout
2011-01-04 13:13:06,106 [ERROR] [nav.ipdevpoll.jobs.jobhandler.inventory.(st0192okprgac7)] Job 'inventory' for st0192okprgac7 aborted.
2011-01-04 13:13:06,106 [INFO] [nav.ipdevpoll.schedule.netboxscheduler.inventory.(st0192okprgac7)] Rescheduling 'inventory' for st0192okprgac7 in 520 seconds
2011-01-04 13:14:36,136 [INFO] [nav.ipdevpoll.jobs.jobhandler.profiling.(st0192okprgac7)] Job profiling for st0192okprgac7 done.
2011-01-04 13:21:46,107 [INFO] [nav.ipdevpoll.jobs.jobhandler.inventory.(st0192okprgac7)] Starting job 'inventory' for st0192okprgac7
2011-01-04 13:21:46,109 [WARNING] [nav.ipdevpoll.plugins.dnsname.dnsname.(st0192okprgac7)] Box dnsname has changed from u'st0192okprgac7' to 'p333apc3.nt.csa.cz'
2011-01-04 13:22:53,646 [DEBUG] [nav.ipdevpoll.plugins.modules.modules.(st0192okprgac7)] Collecting ENTITY-MIB module data
2011-01-04 13:22:59,655 [ERROR] [nav.ipdevpoll.jobs.jobhandler.inventory.(st0192okprgac7)] Plugin nav.ipdevpoll.plugins.modules.Modules(u'st0192okprgac7') reported a timeout
2011-01-04 13:22:59,656 [ERROR] [nav.ipdevpoll.jobs.jobhandler.inventory.(st0192okprgac7)] Job 'inventory' for st0192okprgac7 aborted.
2011-01-04 13:22:59,656 [INFO] [nav.ipdevpoll.schedule.netboxscheduler.inventory.(st0192okprgac7)] Rescheduling 'inventory' for st0192okprgac7 in 579 seconds

Device itself is not very CPU utilized..at least according to monitoring it's not.

I'm just guessing: Could it be caused by too low treshold for timeout in this case(stack with several members)?
When I snmpwalk the device: "time snmpwalk -c <censored> -v 1 st0192okprgac7 -M ENTITY-MIB" it takes more than 7mins:

real 7m13.813s
user 0m1.200s
sys 0m0.760s

Cisco 3750 stack fails at something else:

2011-01-04 13:34:54,667 [INFO] [nav.ipdevpoll.jobs.jobhandler.inventory.(st0100okprgac3.nt.csa.cz)] Starting job 'inventory' for st0100okprgac3.nt.csa.cz
2011-01-04 13:35:08,029 [DEBUG] [nav.ipdevpoll.plugins.modules.modules.(st0100okprgac3.nt.csa.cz)] Collecting ENTITY-MIB module data
2011-01-04 13:35:11,572 [DEBUG] [nav.ipdevpoll.plugins.modules.modules.(st0100okprgac3.nt.csa.cz)] alias mapping: {1024L: [10017], 2049L: [10540], 2050L: [10541], 1027L: [10020], 2052L: [10543], 1029L: [10022], 2054L: [10545], 1031L: [10024], 2056L: [10547], 2057L: [10548], 2058L: [10601], 1035L: [10026], 1036L: [10027], 1037L: [10028], 1038L: [10029], 5037L: [12028], 1040L: [10031], 1041L: [10032], 1042L: [10033], 2051L: [10542], 1044L: [10035], 4014L: [11507], 1046L: [10037], 1047L: [10038], 1048L: [10039], 1049L: [10040], 1050L: [10041], 1051L: [10042], 1052L: [10043], 1053L: [10044], 1054L: [10045], 1055L: [10046], 1056L: [10047], 5040L: [12031], 1058L: [10101], 1023L: [10016], 1030L: [10023], 5041L: [12032], 3053L: [11044], 4018L: [11511], 4019L: [11512], 5...

Read more...

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Yes, if a walk of the entire ENTITY-MIB takes all of 7 minutes, I'd say the the 3750 is pretty slow on SNMP, and will likely time out too fast with ipdevpoll.

Have you patched your TwistedSNMP package according to NAV's release notes? There's a bug in the library that causes it to not increase its waiting time for each timeout. Under normal operations, it would first wait 2 seconds before timing out, then 3 seconds on the second attempt, 4.5 on the third attempt, 6.75 on the fourth, and 10.125 seconds on the final attempt (i.e. four retries, with the timeout period increasing by a factor of 1.5 for each timeout). Without the patch in place, only three attempts are made, each with a 2 second timeout.

Revision history for this message
Miloslav Izar (izar) wrote :

thanks...patching twistedsnmp fixed it

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Thanks for the feedback! Closing this issue as invalid then :-)

Changed in nav:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.