getDeviceData produces OutOfMemoryErrors and freaks out
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Network Administration Visualized |
Fix Released
|
High
|
Morten Brekkevold |
Bug Description
Several users report that their getDeviceData process freaks out and makes
strange changes to their NAV database.
Looking at stderr logs reveals that the Java process has run out of heap
memory. In turn, this seems to throw the PostgreSQL JDBC driver out of
whack, producing numerous SQLExceptions about protocol errors.
The symptoms make it look like a memory leak, i.e. something in the code
holds on to data references that are never used again. There is cause to
believe the problem arises because of unexpected SNMP responses from
devices that gDD is monitoring.
The problem is present in both versions 3.1.1 and 3.2.1.
[http://
Originator: YES
An initial session of memory profiling the gDD process (using jmp) on one
of the afflicted installations, revealed that one or more threads seemed to
get stuck in OID testing, causing the memory usage of gDD to increase even
as the logs indicated gDD was idle.
During a second profiling session, combined with ethereal sniffing for 1.4.1.437. 1.1.3.7. 1.0) OID
SNMP traffic, I've found one specific cause of an OutOfMemoryError:
The OID tester tested the c1900Bandwidth (1.3.6.
against a device, using a GETNEXT request. The device responded with an
ENDOFMIBVIEW response, i.e. the requested OID is outside the MIB tree of
the agent. gDD then proceeded to re-send the exact same GETNEXT-request,
and continued in a seemingly infinite loop, while the memory usage of the
process just kept expanding, until an OutOfMemoryError was thrown,
destabilizing the entire process.
As a result of this, I will be doing some code review; apparently the OID
tester, the SimpleSnmp library, or Drexel's SNMP library fails to interpret
the ENDOFMIBVIEW response, which in turn causes the OID tester to re-test
the same OID indefinitely.