Network Administration Visualized

pping crashes when re-creating RRD files

Bug #733115 reported by macrom on 2011-03-11

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Network Administration Visualized	Fix Released	Medium	Morten Brekkevold	Network Administration Visualized 3.8.3

Bug Description

NAV : 3.8.2
OS : RHEL 6.0

I'm trying to upgrade from nav-3.5.6 running on RHEL 5.x (32bit) to nav-3.8.2 running on RHEL 6 (64bit).
The nav installalation is fresh, and the database is cloned and upgraded from 3.5.6 to 3.8.2.

When starting pping, this error is printed to pping.log :
[2011-03-11 09:00:10] db.py:execute:146 [Critical] Throwing away update...
[2011-03-11 09:00:10] rrd.py:create:61 [Notice] Created rrd file kat-sw3.foo.tld.rrd
[2011-03-11 09:00:10] db.py:cursor:97 [Critical] Could not get cursor. Trying to reconnect...
[2011-03-11 09:00:10] db.py:connect:75 [Notice] Successfully (re)connected to NAVdb
[2011-03-11 09:00:10] db.py:execute:137 [Notice] Executing: INSERT INTO rrd_file
(rrd_fileid, path, filename, step, netboxid, subsystem) VALUES
(97812,'/usr/local/nav/var/rrd','kat-sw3.foo.tld.rrd',300,1080,'pping')
[2011-03-11 09:00:10] db.py:execute:145 [Critical] duplicate key value violates unique constraint "rrd_file_path_filename_key"

[2011-03-11 09:00:10] db.py:execute:146 [Critical] Throwing away update...
[2011-03-11 09:00:10] db.py:cursor:97 [Critical] Could not get cursor. Trying to reconnect...
[2011-03-11 09:00:10] db.py:connect:79 [Critical] Couldn't connect to db.
[2011-03-11 09:00:10] db.py:connect:80 [Critical] FATAL: connection limit exceeded for non-superusers

[2011-03-11 09:00:10] db.py:execute:148 [Critical] Could not execute statement: INSERT INTO rrd_datasource
        (rrd_fileid, name, descr, dstype, units) VALUES
        (97812, 'RESPONSETIME', 'Roundtrip time', 'GAUGE', 's')
[2011-03-11 09:00:10] db.py:execute:149 [Notice] 'NoneType' object has no attribute 'cursor'
Traceback (most recent call last):
  File "/usr/local/nav/bin/pping.py", line 271, in <module>
    start(nofork)
  File "/usr/local/nav/bin/pping.py", line 239, in start
    myPinger.main()
  File "/usr/local/nav/bin/pping.py", line 172, in main
    self.generateEvents()
  File "/usr/local/nav/bin/pping.py", line 114, in generateEvents
    rrd.update(netbox.netboxid, netbox.sysname, 'N', 'UP', rtt)
  File "/usr/local/nav/lib/python/nav/statemon/rrd.py", line 123, in update
    create(filename, netboxid, serviceid, handler)
  File "/usr/local/nav/lib/python/nav/statemon/rrd.py", line 62, in create
    register_rrd(filename, netboxid, serviceid, handler)
  File "/usr/local/nav/lib/python/nav/statemon/rrd.py", line 83, in register_rrd
    responsedescr, "GAUGE", "s")
  File "/usr/local/nav/lib/python/nav/statemon/db.py", line 325, in registerDS
    self.execute(statement)
  File "/usr/local/nav/lib/python/nav/statemon/db.py", line 151, in execute
    self.db.rollback()
AttributeError: 'NoneType' object has no attribute 'rollback'

The rrd files are not migreted over from the production server, so pping fails when it tries to create a rrd file which allready are present in the database.

Revision history for this message

Morten Brekkevold (mbrekkevold) wrote on 2011-03-15:

pping registers an RRD file in NAV's PostgreSQL database as soon as it is created by pping. If the RRD file once existed in the same location but has since been removed, it may still be referenced in the db. pping doesn't check to see if it's already in the db, and fails on an integrity error.

I guess pping should either check the db before inserting a new row, or handle the integrityerror more gracefully.

summary:	- pping crashes in nav-3.8.2 + pping crashes when re-creating RRD files
Changed in nav:
assignee:	nobody → Morten Brekkevold (mbrekkevold)
importance:	Undecided → Low

Revision history for this message

Morten Brekkevold (mbrekkevold) wrote on 2011-03-15:

Further analysis shows that the underlying problem is that pping interprets a database error as a lost database connection (!). Each time this happens, it creates a new connection. The old one is discarded, but not closed.

In your case, you likely had hundreds of RRD filed created that were already registered in your migrated database, which caused pping to quickly fill up the amount of available non-superuser connections to PostgreSQL. When it could no longer open a new connection, the daemon crashed. This was confirmed on my dev server.

This latter bug should probably be filed separately...

Changed in nav:
importance:	Low → Medium

Revision history for this message

Morten Brekkevold (mbrekkevold) wrote on 2011-03-15:

Fix here: http://metanav.uninett.no/hg/series/3.8.x/rev/146a6b56068e

Changed in nav:
milestone:	none → 3.8.3
status:	New → Confirmed
status:	Confirmed → Fix Committed

Morten Brekkevold (mbrekkevold) on 2011-03-15

Changed in nav:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.