lxml xml parser won't parse large documents

Bug #963880 reported by Leonard Richardson
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

This code should run forever:

xml = "<root>%s</root>"
from bs4 import BeautifulSoup
recreated_markup_length = 999
contents = ""
while recreated_markup_length > 40:
    contents += "<a/>"
    soup = BeautifulSoup(xml % contents, ["lxml", "xml"])
    recreated_markup_length = len(str(soup))
    print len(contents), "->", recreated_markup_length

Instead, it fails once the document gets to about 512-1024 bytes, depending on the machine. It's about 540 for me, it's 1092 for this user:

http://groups.google.com/group/beautifulsoup/msg/f7c6d31a2f12cefe

On a large document, lxml just does not call any of the tree builder's callback methods. For whatever reason, the problem does not happen with HTMLParser, even though both classes inherit feed() from the same superclass.

This is a problem with XMLParser.feed() or with the way I'm using feed(). Judging from the lxml source code, any buffer size smaller than INT_MAX (usually about 2GB) is supported. Anything bigger than that is truncated. But in practice, it looks like feed() expects to be given "chunk"-sized things, a "chunk" is about 512 bytes (the size varying between machines), and if feed() gets something bigger than a "chunk" it will just ignore it.

The docstring for feed() (http://lxml.de/api/lxml.etree._FeedParser-class.html#feed) says "The argument should be an 8-bit string buffer containing encoded data, although Unicode is supported as long as both string types are not mixed." But passing in a string buffer gives a TypeError--it really wants a 'str' or 'unicode' object. Passing in a file-like object.

There are alternate ways of invoking the parser, such as "etree.fromstring(markup, self.parser)", but they give the ValueError "Unicode strings with encoding declaration are not supported." To use them I would have to strip out the encoding declaration after converting to Unicode.

The simplest thing to just fix the bug is to make feed() turn its input into a StringIO() and pass it in chunk by chunk. Hopefully 512 is a chunk size that will work for everyone.

I will try to duplicate this in a self-contained environment (e.g. without Beautiful Soup). lxml is a mature product, but this might be a bug.

Revision history for this message
Leonard Richardson (leonardr) wrote :

This non-Beautiful Soup test code exposes the problem. It only occurs when parsing Unicode data longer than 512 bytes with XMLParser. HTMLParser works fine. Bizarrely, a document 4096 bytes long passes the test, even though 2048-byte and 8192-byte documents fail.

---

from lxml.etree import XMLParser, HTMLParser

class TestTarget:
    """This target considers the test a success if it receives a start tag."""

    def __init__(self):
        self.success = False

    def start(self, tag, attrib):
        self.success = True

def document_of_size(size, as_unicode=False):
    """Return an XML document of the given size, either as string or Unicode."""
    size -= len("<root></root>")
    doc = "<root>%s</root>" % ("0" * size)
    if as_unicode:
        doc = unicode(doc)
    return doc

def test_with_document_of_size(size, parser, as_unicode=False):
    """Create a document of the given size and see if feed() handles it."""
    target = TestTarget()
    parser = parser(target=target)
    try:
        parser.feed(document_of_size(size, as_unicode))
        parser.close()
    except Exception, e:
        return "Exception: %s" % e
    if target.success:
        return "Success"
    else:
        return "Failure"

for parser in (XMLParser, HTMLParser):
    for as_unicode in (True, False):
        for power in range(5,16):
            size = 2**power
            result = test_with_document_of_size(size, parser, as_unicode)
            if as_unicode:
                label = "u"
            else:
                label = "s"
            print "%.5d %s %s: %s" % (size, label, parser.__name__, result)

Revision history for this message
Leonard Richardson (leonardr) wrote :

I've filed bug 963936 in lxml for this. We'll see if it's actually a bug.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Workaround is in BS 4.0.2.

Changed in beautifulsoup:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.