Memory leaks updating Element.attrib dictionary

Bug #439462 reported by PatrickCD
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
scoder

Bug Description

My code needs to parse, modify then serialise medium size xml files (10-700MB). I began using cElementTree but found that lxml serialises much faster. However, I'm getting what looks like a memory leak from lxm which does not occur with cElementTree. I can't easily reproduce the bug using a smaller string input xml snippet, but it is very consistent with the same input file that does not leak memory with cElementTree.

I'm using lxml 2.2.2 and libxml 2.6.32 on ubuntu

filename = "blah.xml" #600MB

#running this script from the shell grows the process memory by about 5MB
for i in xrange(100):
    elTree = ElementTree(file="ssq.xml")
    el = elTree.getroot().getchildren()[1].getchildren()[0]
    for k,v in new_values.iteritems():
        el.set(k,v)

#This loop irreversibly increases process memory by 365MB
new_values = dict(section="s23423",title="New Title",weight="33")
for i in xrange(100):
    elTree = ElementTree(file="ssq.xml")
    el = elTree.getroot().getchildren()[1].getchildren()[0]
    el.attrib.update(new_values)

Tags: memory
Revision history for this message
scoder (scoder) wrote :

With "leak", do you mean it isn't given back to the system? How do you measure the memory usage? Note that the Python interpreter does not necessarily free memory that it has allocated immediately when it is no longer used, so the size of the interpreter process is not necessarily a good measure.

Also note that using "element.attrib" creates cross-referenced objects that need garbage collection. You may want to run "gc.collect()" within or after the last loop to see if the memory is really permanently "leaking" or just temporarily allocated.

Revision history for this message
PatrickCD (patrick-dobbs) wrote :

Hey, thanks for your quick response.

I'm measuring memory usage with pmap. I've attached the output of the following loop, I added the sleep to allow some garbage collection, but this didn't help:

for i in xrange(50):
    elTree = ElementTree(file="blah.xml")
    el = elTree.getroot().getchildren()[1].getchildren()[0]
    el.attrib.update(new_values)
    time.sleep(0.5)

In such a case the python process will continue to consume memory until the OS crashes (e.g. when making a cup of tea).

I tried your suggestion of adding a call to gc.collect(). This completely works - memory usage stays constant, but only if called inside the loop. I'm hoping to use lxml throughout a server side application. It seems a bit dodgy to need to clean up memory explicitly.

I'm also still not clear why attrib.update() causes a problem when element.set(key,value) doesn't. The lxml docs seem to advocate using attrib directly.

Revision history for this message
PatrickCD (patrick-dobbs) wrote :
description: updated
scoder (scoder)
security vulnerability: yes → no
visibility: private → public
Revision history for this message
scoder (scoder) wrote :

First of all, calling "el.attrib" creates an intermediate dict-like object, so calling "el.set()" is a lot more efficient. However, people tend to use el.attrib rather carelessly and most of them do not read the docs at all. To mitigate the overhead of creating the attrib object, the Element keeps a reference to it once it's created. This leads to a cyclic reference that requires GC resolution, so the Element object will (or may) not be discarded immediately when going out of scope.

However, the GC will certainly find and clean up the cyclic reference on its next run, so I don't see why this shouldn't happen on your system. You wrote that it's running in a server environment. Maybe the GC uses a special configuration there?

Revision history for this message
scoder (scoder) wrote :

I disabled the ref-cycle for lxml 2.3. It turns out that creating a new dict-like object on each .attrib access is only slightly slower than reusing one for the lifetime of an Element. Even using a weak reference is slower than creating a new object each time. The advantage of avoiding reference cycles thus clearly outweighs the tiny performance improvement of the keep-alive .attrib reference.

https://codespeak.net/viewvc/?view=rev&revision=68567

Changed in lxml:
assignee: nobody → Stefan Behnel (scoder)
importance: Undecided → Medium
milestone: none → 2.3
status: New → Fix Committed
Revision history for this message
scoder (scoder) wrote :

Fixed in lxml 2.3alpha1.

Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.