lxml

Memory leaks updating Element.attrib dictionary

Bug #439462 reported by PatrickCD on 2009-09-30

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	scoder	lxml 2.3

Bug Description

My code needs to parse, modify then serialise medium size xml files (10-700MB). I began using cElementTree but found that lxml serialises much faster. However, I'm getting what looks like a memory leak from lxm which does not occur with cElementTree. I can't easily reproduce the bug using a smaller string input xml snippet, but it is very consistent with the same input file that does not leak memory with cElementTree.

I'm using lxml 2.2.2 and libxml 2.6.32 on ubuntu

filename = "blah.xml" #600MB

#running this script from the shell grows the process memory by about 5MB
for i in xrange(100):
    elTree = ElementTree(file="ssq.xml")
    el = elTree.getroot().getchildren()[1].getchildren()[0]
    for k,v in new_values.iteritems():
        el.set(k,v)

#This loop irreversibly increases process memory by 365MB
new_values = dict(section="s23423",title="New Title",weight="33")
for i in xrange(100):
    elTree = ElementTree(file="ssq.xml")
    el = elTree.getroot().getchildren()[1].getchildren()[0]
    el.attrib.update(new_values)

See original description

Tags:

Revision history for this message

scoder (scoder) wrote on 2009-09-30:

With "leak", do you mean it isn't given back to the system? How do you measure the memory usage? Note that the Python interpreter does not necessarily free memory that it has allocated immediately when it is no longer used, so the size of the interpreter process is not necessarily a good measure.

Also note that using "element.attrib" creates cross-referenced objects that need garbage collection. You may want to run "gc.collect()" within or after the last loop to see if the memory is really permanently "leaking" or just temporarily allocated.

Revision history for this message

PatrickCD (patrick-dobbs) wrote on 2009-09-30:

Hey, thanks for your quick response.

I'm measuring memory usage with pmap. I've attached the output of the following loop, I added the sleep to allow some garbage collection, but this didn't help:

for i in xrange(50):
    elTree = ElementTree(file="blah.xml")
    el = elTree.getroot().getchildren()[1].getchildren()[0]
    el.attrib.update(new_values)
    time.sleep(0.5)

In such a case the python process will continue to consume memory until the OS crashes (e.g. when making a cup of tea).

I tried your suggestion of adding a call to gc.collect(). This completely works - memory usage stays constant, but only if called inside the loop. I'm hoping to use lxml throughout a server side application. It seems a bit dodgy to need to clean up memory explicitly.

I'm also still not clear why attrib.update() causes a problem when element.set(key,value) doesn't. The lxml docs seem to advocate using attrib directly.

Revision history for this message

PatrickCD (patrick-dobbs) wrote on 2009-09-30:

output of bash pmap for a loop not using gc.collect() Edit (6.5 KiB, text/plain)

PatrickCD (patrick-dobbs) on 2009-09-30

description:

updated

scoder (scoder) on 2009-09-30

security vulnerability:	yes → no
visibility:	private → public

Revision history for this message

scoder (scoder) wrote on 2009-10-02:

First of all, calling "el.attrib" creates an intermediate dict-like object, so calling "el.set()" is a lot more efficient. However, people tend to use el.attrib rather carelessly and most of them do not read the docs at all. To mitigate the overhead of creating the attrib object, the Element keeps a reference to it once it's created. This leads to a cyclic reference that requires GC resolution, so the Element object will (or may) not be discarded immediately when going out of scope.

However, the GC will certainly find and clean up the cyclic reference on its next run, so I don't see why this shouldn't happen on your system. You wrote that it's running in a server environment. Maybe the GC uses a special configuration there?

Revision history for this message

scoder (scoder) wrote on 2009-10-17:

I disabled the ref-cycle for lxml 2.3. It turns out that creating a new dict-like object on each .attrib access is only slightly slower than reusing one for the lifetime of an Element. Even using a weak reference is slower than creating a new object each time. The advantage of avoiding reference cycles thus clearly outweighs the tiny performance improvement of the keep-alive .attrib reference.

https://codespeak.net/viewvc/?view=rev&revision=68567

Changed in lxml:
assignee:	nobody → Stefan Behnel (scoder)
importance:	Undecided → Medium
milestone:	none → 2.3
status:	New → Fix Committed