lxml xml parser won't parse large documents
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This code should run forever:
xml = "<root>%s</root>"
from bs4 import BeautifulSoup
recreated_
contents = ""
while recreated_
contents += "<a/>"
soup = BeautifulSoup(xml % contents, ["lxml", "xml"])
recreated_
print len(contents), "->", recreated_
Instead, it fails once the document gets to about 512-1024 bytes, depending on the machine. It's about 540 for me, it's 1092 for this user:
http://
On a large document, lxml just does not call any of the tree builder's callback methods. For whatever reason, the problem does not happen with HTMLParser, even though both classes inherit feed() from the same superclass.
This is a problem with XMLParser.feed() or with the way I'm using feed(). Judging from the lxml source code, any buffer size smaller than INT_MAX (usually about 2GB) is supported. Anything bigger than that is truncated. But in practice, it looks like feed() expects to be given "chunk"-sized things, a "chunk" is about 512 bytes (the size varying between machines), and if feed() gets something bigger than a "chunk" it will just ignore it.
The docstring for feed() (http://
There are alternate ways of invoking the parser, such as "etree.
The simplest thing to just fix the bug is to make feed() turn its input into a StringIO() and pass it in chunk by chunk. Hopefully 512 is a chunk size that will work for everyone.
I will try to duplicate this in a self-contained environment (e.g. without Beautiful Soup). lxml is a mature product, but this might be a bug.
This non-Beautiful Soup test code exposes the problem. It only occurs when parsing Unicode data longer than 512 bytes with XMLParser. HTMLParser works fine. Bizarrely, a document 4096 bytes long passes the test, even though 2048-byte and 8192-byte documents fail.
---
from lxml.etree import XMLParser, HTMLParser
class TestTarget:
"""This target considers the test a success if it receives a start tag."""
def __init__(self):
self.success = False
def start(self, tag, attrib):
self.success = True
def document_ of_size( size, as_unicode=False): </root> ")
"""Return an XML document of the given size, either as string or Unicode."""
size -= len("<root>
doc = "<root>%s</root>" % ("0" * size)
if as_unicode:
doc = unicode(doc)
return doc
def test_with_ document_ of_size( size, parser, as_unicode=False): target= target)
parser. feed(document_ of_size( size, as_unicode))
parser. close()
"""Create a document of the given size and see if feed() handles it."""
target = TestTarget()
parser = parser(
try:
except Exception, e:
return "Exception: %s" % e
if target.success:
return "Success"
else:
return "Failure"
for parser in (XMLParser, HTMLParser): document_ of_size( size, parser, as_unicode)
label = "u"
label = "s"
for as_unicode in (True, False):
for power in range(5,16):
size = 2**power
result = test_with_
if as_unicode:
else:
print "%.5d %s %s: %s" % (size, label, parser.__name__, result)