getDeviceData produces OutOfMemoryErrors and freaks out

Bug #258239 reported by Morten Brekkevold
2
Affects Status Importance Assigned to Milestone
Network Administration Visualized
Fix Released
High
Morten Brekkevold

Bug Description

Several users report that their getDeviceData process freaks out and makes
strange changes to their NAV database.

Looking at stderr logs reveals that the Java process has run out of heap
memory. In turn, this seems to throw the PostgreSQL JDBC driver out of
whack, producing numerous SQLExceptions about protocol errors.

The symptoms make it look like a memory leak, i.e. something in the code
holds on to data references that are never used again. There is cause to
believe the problem arises because of unexpected SNMP responses from
devices that gDD is monitoring.

The problem is present in both versions 3.1.1 and 3.2.1.

[http://sourceforge.net/tracker/index.php?func=detail&aid=1675508&group_id=107608&atid=648170]

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Originator: YES

An initial session of memory profiling the gDD process (using jmp) on one
of the afflicted installations, revealed that one or more threads seemed to
get stuck in OID testing, causing the memory usage of gDD to increase even
as the logs indicated gDD was idle.

During a second profiling session, combined with ethereal sniffing for
SNMP traffic, I've found one specific cause of an OutOfMemoryError:
The OID tester tested the c1900Bandwidth (1.3.6.1.4.1.437.1.1.3.7.1.0) OID
against a device, using a GETNEXT request. The device responded with an
ENDOFMIBVIEW response, i.e. the requested OID is outside the MIB tree of
the agent. gDD then proceeded to re-send the exact same GETNEXT-request,
and continued in a seemingly infinite loop, while the memory usage of the
process just kept expanding, until an OutOfMemoryError was thrown,
destabilizing the entire process.

As a result of this, I will be doing some code review; apparently the OID
tester, the SimpleSnmp library, or Drexel's SNMP library fails to interpret
the ENDOFMIBVIEW response, which in turn causes the OID tester to re-test
the same OID indefinitely.

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Originator: YES

I've pinpointed the error, and it is actually located in the Drexel SNMP
library, not in NAV. I can probably write a work-around for NAV's
SimpleSnmp wrapper library, but a bug report should probably be posted to
the drexel authors.

The problem is the SNMPv1CommunicationInterface method called
retrieveMIBTable, which doesn't perform proper checks on the returned data
as it performs an snmpwalk operation on a table. When attempting to walk
past the end of the MIB tree using SNMPv1, the response packet will have an
error status set. The method properly breaks out of its walking loop when
it sees this.

But SNMPv2 does not return an ERROR status in this case, it returns the
same OID as was used in the GETNEXT request, and a special value to
indicate that the end of the MIB tree has been reached. The method just
throws this variable binding onto a list, and loops forever, sending a
GETNEXT request for the same OID, ever increasing the list of responses.

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Originator: YES

A patch for drexel's Java SNMP library has been written and sent to the
upstream authors.

Still need to write a temporary workaround for NAV.

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Originator: YES

Attaching the simple version of the patch I submitted upstream, in case
anyone else wants to apply it.

File Added: drexelsnmp-endofmibview.simpler.patch

Revision history for this message
Morten Brekkevold (mbrekkevold) wrote :

Originator: YES

NAV workaround added in r3942.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.