bs4.4.1 seems to build corrupt tree on this file
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
This HTML file was a notification mail from a widely-used social media site. The feature of interest is that if I parse it (python 2.7.10, bs4 4.4.1), with either the default parser or html5lib, some bs4 requests hang, apparently forever. I can prettify() the tree, and I can do find('table'). But find_all('table') and find('some_
I've trimmed out a lot of the excess file, but if I trim too much the problem disappears. Here's a small driver that shows the effect (the HTML is attached):
#!/
#
import sys, os, re
import codecs
import bs4
from bs4 import BeautifulSoup as BS
sys.
verbose = 0
def vMsg(lvl, msg):
if (verbose>=lvl): print(msg+"\n")
if (len(sys.argv) == 1):
path = os.environ['PWD'] + '/../Botmail3/
else:
path = sys.argv[1]
vMsg(0, "\nStarting '%s'..." % (path))
fh = codecs.open(path, mode='r', encoding='utf-8')
parserName = "html5lib"
vMsg(0, "Using parser '%s' on %s." % (parserName, path))
try:
tree = BS(fh, parserName)
except IOError as e:
vMsg(0, "Can't parse '%s': %s" % (path, e))
fh.close()
vMsg(0, " Loaded.")
print(
vMsg(0, " Scanning for a table.")
t = tree.find('table')
vMsg(0, " Scanning for a td.")
t = tree.find('td')
vMsg(0, " Scanning for an xyzzy (non-existent).")
t = tree.find('xyzzy')
vMsg(0, " Scanning for tables.")
tables = tree.find_
vMsg(0, " Found %d tables." % (len(tables)))
for i, t in enumerate(tables):
if (i<0): continue
print("*** %d ***" % (i))
print(" %s" % (t.thead))
vMsg(0, " Prettified %d tables." % (len(tables)))
Results (except for the initial prettify() output, which looks pretty normal except that there are 2 tbody elements in the initial table.
...
Scanning for a table.
Scanning for a td.
Scanning for an xyzzy (non-existent).
^CTraceback (most recent call last):
File "./bs4_hang.py", line 40, in <module>
t = tree.find('xyzzy')
File "/Library/
l = self.find_all(name, attrs, recursive, text, 1, **kwargs)
File "/Library/
return self._find_
File "/Library/
found = strainer.search(i)
File "/Library/
elif isinstance(markup, Tag):
KeyboardInt
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Thanks for the detailed bug report. I've identified this (including whitespace) as the minimal markup that reproduces the error:
<table> <tbody> <tbody> <ims></ tbody> </table>
Similar problems with html5lib have happened before; the unusual thing in this case is the combination of an incorrectly-nested table with the invalid <ims> tag.