bs4.4.1 seems to build corrupt tree on this file

Bug #1505351 reported by Steve DeRose
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

This HTML file was a notification mail from a widely-used social media site. The feature of interest is that if I parse it (python 2.7.10, bs4 4.4.1), with either the default parser or html5lib, some bs4 requests hang, apparently forever. I can prettify() the tree, and I can do find('table'). But find_all('table') and find('some_nonexistent_element') fail (perhaps anything that hits some critical corrupted location?) In similar contexts, I've seen x.thead also fail.

I've trimmed out a lot of the excess file, but if I trim too much the problem disappears. Here's a small driver that shows the effect (the HTML is attached):

    #!/usr/bin/python
    #
    import sys, os, re
    import codecs

    import bs4
    from bs4 import BeautifulSoup as BS
    sys.stderr.write("\n*** bs4 is at version %s, from %s\n" % (bs4.__version__, bs4.__file__))

    verbose = 0
    def vMsg(lvl, msg):
        if (verbose>=lvl): print(msg+"\n")

    if (len(sys.argv) == 1):
        path = os.environ['PWD'] + '/../Botmail3/extracted_botmail/botmail_html/11-4565.html'
    else:
        path = sys.argv[1]

    vMsg(0, "\nStarting '%s'..." % (path))

    fh = codecs.open(path, mode='r', encoding='utf-8')
    parserName = "html5lib"
    vMsg(0, "Using parser '%s' on %s." % (parserName, path))
    try:
        tree = BS(fh, parserName)
    except IOError as e:
        vMsg(0, "Can't parse '%s': %s" % (path, e))
    fh.close()
    vMsg(0, " Loaded.")

    print(tree.prettify())

    vMsg(0, " Scanning for a table.")
    t = tree.find('table')

    vMsg(0, " Scanning for a td.")
    t = tree.find('td')

    vMsg(0, " Scanning for an xyzzy (non-existent).")
    t = tree.find('xyzzy')

    vMsg(0, " Scanning for tables.")
    tables = tree.find_all('table')
    vMsg(0, " Found %d tables." % (len(tables)))
    for i, t in enumerate(tables):
        if (i<0): continue
        print(t.prettify())
        print("*** %d ***" % (i))
        print(" %s" % (t.thead))
    vMsg(0, " Prettified %d tables." % (len(tables)))

Results (except for the initial prettify() output, which looks pretty normal except that there are 2 tbody elements in the initial table.
        ...

       Scanning for a table.

        Scanning for a td.

        Scanning for an xyzzy (non-existent).

    ^CTraceback (most recent call last):
      File "./bs4_hang.py", line 40, in <module>
        t = tree.find('xyzzy')
      File "/Library/Python/2.7/site-packages/bs4/element.py", line 1238, in find
        l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
      File "/Library/Python/2.7/site-packages/bs4/element.py", line 1259, in find_all
        return self._find_all(name, attrs, text, limit, generator, **kwargs)
      File "/Library/Python/2.7/site-packages/bs4/element.py", line 537, in _find_all
        found = strainer.search(i)
      File "/Library/Python/2.7/site-packages/bs4/element.py", line 1652, in search
        elif isinstance(markup, Tag):
    KeyboardInterrupt

Revision history for this message
Steve DeRose (sderose) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for the detailed bug report. I've identified this (including whitespace) as the minimal markup that reproduces the error:

<table> <tbody><tbody><ims></tbody> </table>

Similar problems with html5lib have happened before; the unusual thing in this case is the combination of an incorrectly-nested table with the invalid <ims> tag.

Revision history for this message
Leonard Richardson (leonardr) wrote :

The problem happens when the second whitespace node is reparented into the <table> tag. object_was_parsed() looks through the new parent's .contents to find where exactly the new node ended up, but it uses .index(), which does a comparison based on the == operator. In this case there are two identical whitespace nodes, and object_was_parsed() chooses the wrong one, putting the tree in an inconsistent state.

This is fixed in revision 409. I changed object_was_parsed() to start from the right side of the list (where it's more likely the reparented node will show up) and to do a comparison based on the is operator rather than ==.

Changed in beautifulsoup:
status: New → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.