Many unclosed tags result in RuntimeError: maximum recursion depth exceeded while calling a Python object

Bug #1471755 reported by tgwizard on 2015-07-06
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Medium
Unassigned

Bug Description

When I do this:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(''.join(['<br>' for x in range(1000)]))

An exception is raised:

  .....
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1191, in decode_contents
    formatter))
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1191, in decode_contents
    formatter))
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1191, in decode_contents
    formatter))
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1185, in decode_contents
    for c in self:
RuntimeError: maximum recursion depth exceeded while calling a Python object

This seems to be because BeautifulSoup uses recursion to find child elements. Also, BeautifulSoup seems to treat `<br>` as a tag that should be closed or self-closed, but that is not necessarily true for HTML5. Same issue with `<img>` and unclosed `<a>` tags, as well as other tags I assume.

Tags: bug Edit Tag help
tgwizard (tgwizard) wrote :

Oh, also

> pip freeze
beautifulsoup4==4.4.0
wheel==0.24.0

Daniel (rigid-launchpad) wrote :

confirmed, this also happens with "real world" html.

on my system, sys.getrecursionlimit() is 1000 (python2.7) which is quite conservative (for a reason, I suppose).
Since python is not a functional language, recursion should be avoided for this kind of task and a stack should be used.

Connor Cook (cojoco) wrote :

This does not happen for me, using Python 2.7.6 and bs4 version 4.4.1.

sys.getrecursionlimit() was 1000 for me as well, but even when I joined 5000 '<br>' tags it ran with no problems. I got lots of '<br/>' tags, so it looks like they were all turned into self-closing tags.

It doesn't look like this was explicitly fixed in 4.4.1, so it's possible there's just something strange about my machine or that some other change fixed this.

Emil Stenström (em-u) wrote :

@cojoco: This still happens for me. It seems I have to print the result now to get the error, so maybe bs4 is lazy now. This reproduces the problem for me:

from bs4 import BeautifulSoup
html = ''.join(['<br>' for x in range(1000)])
print BeautifulSoup(html, "html.parser")

(I've also specified the parser to avoid differences in environments)

$ pip freeze
beautifulsoup4==4.4.1
six==1.10.0
wheel==0.24.0

$ python --version
Python 2.7.10

Running Mac OSX 10.11.5 (15F34).

Changed in beautifulsoup:
status: New → Confirmed
tags: added: bug
Changed in beautifulsoup:
importance: Undecided → Medium
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers