Fails parsing 28MB+ files

Bug #1943487 reported by Kis Gabot
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Save these files locally (from browser)
https://www.sec.gov/Archives/edgar/data/0001308606/000119312520047353/d838367d10k_htm.xml
https://www.sec.gov/Archives/edgar/data/0001538990/000155837020001148/stor-20191231x10k_htm.xml

try to parse them:

import lxml.html
from lxml import etree as ET
tree = ET.parse('d838367d10k_htm.xml')

Gives:
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src\lxml\etree.pyx", line 3521, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1859, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1885, in lxml.etree._parseDocumentFromURL
  File "src\lxml\parser.pxi", line 1789, in lxml.etree._parseDocFromFile
  File "src\lxml\parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile
  File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
  File "file:/C:/tmp/d838367d10k_htm.xml", line 135192
lxml.etree.XMLSyntaxError: xmlSAX2Characters: huge text node, line 135192, column 362

and
  File "file:/C:/tmp/stor-20191231x10k_htm.xml", line 36534
lxml.etree.XMLSyntaxError: xmlSAX2Characters: huge text node, line 36534, column 11465481

Works without problem with built in stdlib etree. (we could fall back to that if there is no other solution)
Got 8GB RAM.

Python : sys.version_info(major=3, minor=6, micro=8, releaselevel='final', serial=0)
lxml.etree : (4, 6, 3, 0)
libxml used : (2, 9, 5)
libxml compiled : (2, 9, 5)
libxslt used : (1, 1, 30)
libxslt compiled : (1, 1, 30)

Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.