alternative interface for iterative parsing that does not build a complete tree

Bug #1688805 reported by Mantas Zimnickas on 2017-05-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Wishlist
Unassigned

Bug Description

Using iterparse, lxml builds whole tree in memory instead of releasing it after each iteration.

I know, this is not really a memory leak, but rather a feature. But what is the point of having iterative parsing if in the end you still have whole tree in the memory.

There is a documentation [1] explaining how to work around memory consumption, but maybe it would be much better to have an option for iterparse to not build whole tree in memory?

It could be something similar like xmltodict streaming mode [2] where you can specify element depth. All elements with smaller depth are simply ignored. For example `depth=1` would mean, that root element should be completely ignored.

[1] http://lxml.de/parsing.html#modifying-the-tree
[2] https://github.com/martinblech/xmltodict#streaming-mode

I'm writing this bug report, because I would a lot content on the internets where the same issue is addressed over and over again:

https://www.ibm.com/developerworks/xml/library/x-hiperfparse/
http://stackoverflow.com/a/7171543/475477
https://codereview.stackexchange.com/q/2449
http://stackoverflow.com/a/9814580/475477

scoder (scoder) on 2017-05-09
summary: - iterparse memory leak
+ alternative interface for iterative parsing that does not build a
+ complete tree
Changed in lxml:
importance: Undecided → Wishlist
scoder (scoder) on 2017-11-04
Changed in lxml:
status: New → Triaged
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers