XPathEvalError when calling .xpath() on a huge xml file

Bug #1860067 reported by Chao Zhang
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Triaged
Low
Unassigned

Bug Description

test both in Ubuntu and win10

OS: ubuntu 18.04
Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 4, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)

OS: win10
Python : sys.version_info(major=3, minor=6, micro=7, releaselevel='final', serial=0)
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 32)

test scripts

```Python
from lxml import etree

## generate vasprun_large_shrink.xml
z0 = etree.parse('vasprun_large.xml', parser=etree.XMLParser(huge_tree=True))
for x in z0.xpath("//calculation"): #"//calculation" take up 99.9% size
    x.getparent().remove(x)
with open('vasprun_large_shrink.xml', 'w') as fid:
    fid.write(etree.tostring(z0, pretty_print=True, xml_declaration=True).decode('utf-8'))

## test
def test_parse(filepath, huge_tree=True):
    z0 = etree.parse(filepath, parser=etree.XMLParser(huge_tree=True))
    _ = z0.xpath('//array') #always pass
    _ = z0.xpath('//array[@name="atoms"]') #fail when large
    # _ = [x for x in z0.xpath('//array') if x.get('name')=='atoms'] #pass

test_parse('vasprun_small.xml') #pass
test_parse('vasprun_large.xml') #fail
test_parse('vasprun_large_shrink.xml') #pass
```

the files `vasprun_small.xml` and `vasprun_large.xml` could be obtained via the link below (google drive). vasprun.xml are files generated by VASP, a software widely used in physics condensed matter community.

https://drive.google.com/drive/folders/1ftz5ty5uCQNe5YzESpEiXepE6GuG2DGE?usp=sharing

1. vasprun_small.xml: generated by VASP, about 120MB, pass the test code above
2. vasprun_large.xml: generated by VASP, about 240MB, fail the test code above, XPathEvalError
3. vasprun_large_shrink.xml: generate in the code above by removing "calculation" in vasprun_large.xml, about 37KB, pass the test code above, which indicates that the reason for the XPathEvalError should be the huge filesize

description: updated
Revision history for this message
scoder (scoder) wrote :

Sorry for keeping this unanswered.

Note that XPath always creates a complete list of all matches, which can require a considerable amount of memory. Also, it's often slower than the .find*() methods, especially for large documents.

I recommend using .iterfind() or even just .iter(). They are usually faster and by far more memory friendly since they return matches incrementally.

My guess is that the problem here is either the memory needed to build the complete node set result, or the size of the node set (which might be limited by the C int size, i.e. usually <2**31).

Changed in lxml:
importance: Undecided → Low
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.