XPathEvalError when calling .xpath() on a huge xml file
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Triaged
|
Low
|
Unassigned |
Bug Description
test both in Ubuntu and win10
OS: ubuntu 18.04
Python : sys.version_
lxml.etree : (4, 4, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)
OS: win10
Python : sys.version_
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 32)
test scripts
```Python
from lxml import etree
## generate vasprun_
z0 = etree.parse(
for x in z0.xpath(
x.getparent
with open('vasprun_
fid.
## test
def test_parse(
z0 = etree.parse(
_ = z0.xpath('//array') #always pass
_ = z0.xpath(
# _ = [x for x in z0.xpath('//array') if x.get('
test_parse(
test_parse(
test_parse(
```
the files `vasprun_small.xml` and `vasprun_large.xml` could be obtained via the link below (google drive). vasprun.xml are files generated by VASP, a software widely used in physics condensed matter community.
https:/
1. vasprun_small.xml: generated by VASP, about 120MB, pass the test code above
2. vasprun_large.xml: generated by VASP, about 240MB, fail the test code above, XPathEvalError
3. vasprun_
description: | updated |
Sorry for keeping this unanswered.
Note that XPath always creates a complete list of all matches, which can require a considerable amount of memory. Also, it's often slower than the .find*() methods, especially for large documents.
I recommend using .iterfind() or even just .iter(). They are usually faster and by far more memory friendly since they return matches incrementally.
My guess is that the problem here is either the memory needed to build the complete node set result, or the size of the node set (which might be limited by the C int size, i.e. usually <2**31).