lxml

XPathEvalError when calling .xpath() on a huge xml file

Bug #1860067 reported by Chao Zhang on 2020-01-17

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Triaged	Low	Unassigned

Bug Description

test both in Ubuntu and win10

OS: ubuntu 18.04
Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 4, 2, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)

OS: win10
Python : sys.version_info(major=3, minor=6, micro=7, releaselevel='final', serial=0)
lxml.etree : (4, 4, 1, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 32)

test scripts

```Python
from lxml import etree

## generate vasprun_large_shrink.xml
z0 = etree.parse('vasprun_large.xml', parser=etree.XMLParser(huge_tree=True))
for x in z0.xpath("//calculation"): #"//calculation" take up 99.9% size
x.getparent().remove(x)
with open('vasprun_large_shrink.xml', 'w') as fid:
fid.write(etree.tostring(z0, pretty_print=True, xml_declaration=True).decode('utf-8'))

## test
def test_parse(filepath, huge_tree=True):
    z0 = etree.parse(filepath, parser=etree.XMLParser(huge_tree=True))
    _ = z0.xpath('//array') #always pass
    _ = z0.xpath('//array[@name="atoms"]') #fail when large
    # _ = [x for x in z0.xpath('//array') if x.get('name')=='atoms'] #pass

test_parse('vasprun_small.xml') #pass
test_parse('vasprun_large.xml') #fail
test_parse('vasprun_large_shrink.xml') #pass
```

the files `vasprun_small.xml` and `vasprun_large.xml` could be obtained via the link below (google drive). vasprun.xml are files generated by VASP, a software widely used in physics condensed matter community.

https://drive.google.com/drive/folders/1ftz5ty5uCQNe5YzESpEiXepE6GuG2DGE?usp=sharing

1. vasprun_small.xml: generated by VASP, about 120MB, pass the test code above
2. vasprun_large.xml: generated by VASP, about 240MB, fail the test code above, XPathEvalError
3. vasprun_large_shrink.xml: generate in the code above by removing "calculation" in vasprun_large.xml, about 37KB, pass the test code above, which indicates that the reason for the XPathEvalError should be the huge filesize

See original description

Chao Zhang (xiexiezaijian) on 2020-01-17

description:

updated

Revision history for this message

scoder (scoder) wrote on 2021-07-04:

Sorry for keeping this unanswered.

Note that XPath always creates a complete list of all matches, which can require a considerable amount of memory. Also, it's often slower than the .find*() methods, especially for large documents.

I recommend using .iterfind() or even just .iter(). They are usually faster and by far more memory friendly since they return matches incrementally.

My guess is that the problem here is either the memory needed to build the complete node set result, or the size of the node set (which might be limited by the C int size, i.e. usually <2**31).

Changed in lxml:
importance:	Undecided → Low
status:	New → Triaged

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.