Converting a tag to a string can exceed maximum recursion depth

Bug #1471755 reported by tgwizard
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Medium
Unassigned

Bug Description

When I do this:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup(''.join(['<br>' for x in range(1000)]))

An exception is raised:

  .....
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1191, in decode_contents
    formatter))
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1191, in decode_contents
    formatter))
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1191, in decode_contents
    formatter))
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "/Users/adam/code/tictail/asdfasdf/lib/python2.7/site-packages/bs4/element.py", line 1185, in decode_contents
    for c in self:
RuntimeError: maximum recursion depth exceeded while calling a Python object

This seems to be because BeautifulSoup uses recursion to find child elements. Also, BeautifulSoup seems to treat `<br>` as a tag that should be closed or self-closed, but that is not necessarily true for HTML5. Same issue with `<img>` and unclosed `<a>` tags, as well as other tags I assume.

Tags: bug
Revision history for this message
tgwizard (tgwizard) wrote :

Oh, also

> pip freeze
beautifulsoup4==4.4.0
wheel==0.24.0

Revision history for this message
Daniel (rigid-launchpad) wrote :

confirmed, this also happens with "real world" html.

on my system, sys.getrecursionlimit() is 1000 (python2.7) which is quite conservative (for a reason, I suppose).
Since python is not a functional language, recursion should be avoided for this kind of task and a stack should be used.

Revision history for this message
Connor Cook (cojoco) wrote :

This does not happen for me, using Python 2.7.6 and bs4 version 4.4.1.

sys.getrecursionlimit() was 1000 for me as well, but even when I joined 5000 '<br>' tags it ran with no problems. I got lots of '<br/>' tags, so it looks like they were all turned into self-closing tags.

It doesn't look like this was explicitly fixed in 4.4.1, so it's possible there's just something strange about my machine or that some other change fixed this.

Revision history for this message
Emil Stenström (em-u) wrote :

@cojoco: This still happens for me. It seems I have to print the result now to get the error, so maybe bs4 is lazy now. This reproduces the problem for me:

from bs4 import BeautifulSoup
html = ''.join(['<br>' for x in range(1000)])
print BeautifulSoup(html, "html.parser")

(I've also specified the parser to avoid differences in environments)

$ pip freeze
beautifulsoup4==4.4.1
six==1.10.0
wheel==0.24.0

$ python --version
Python 2.7.10

Running Mac OSX 10.11.5 (15F34).

Changed in beautifulsoup:
status: New → Confirmed
tags: added: bug
Changed in beautifulsoup:
importance: Undecided → Medium
Revision history for this message
John Sinteur (sinteur) wrote :

Same problem in de code I use, narrowed it down to this prettify call on the html fetched from porkbun.com

def strip_irrelevant_html(html):
 soup = BeautifulSoup(html, 'html.parser')
 for tag in soup.select(','.join([
   f'{t}[nonce]' for t in ('script', 'style')])):
  del tag['nonce']
 return soup.prettify()

Revision history for this message
John Sinteur (sinteur) wrote :

(and that's running on 4.9.3)

summary: - Many unclosed tags result in RuntimeError: maximum recursion depth
- exceeded while calling a Python object
+ Converting a tag to a string can exceed maximum recursion depth
Changed in beautifulsoup:
status: Confirmed → In Progress
Revision history for this message
Leonard Richardson (leonardr) wrote :

After taking several stabs at addressing this problem over the years I have a solution I think will work. The only problem is in PageElement.decode_contents(). The issue happens when turning a very deeply nested data structure into a string (not when parsing or searching), and the recursive call is specifically in the formatting code.

I'm working on a solution based on a nonrecursive generator that yields SAX-like events that can be used to create the output string. (Just using .descendants isn't enough because .descendants doesn't announce when tags are closed, only when they're opened.)

In terms of the existing unit tests, the only difference is that pretty-printed strings will now end with a newline, which I think is an improvement over the old code. I've run into one mysterious bug around the code for deactivating pretty-printing for tags like <pre> and <textarea>. So hopefully I can get this ready for testing soon.

Revision history for this message
Leonard Richardson (leonardr) wrote :

While setting up a test suite for the new code (to verify that the new algorithm gives the same output on a wide variety of real-world data) I discovered that the nonrecursive string encoding algorithm is incredibly slow. So this needs some performance work before being released.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I merged this branch as revision 8944fe70574914cabfc9e6fb6eb048d71be39fb1.

I tested this on a sample of about 1200 web pages. I parsed each document and wrote to disk the results of BeautifulSoup.encode() and BeautifulSoup,prettify(). I did this for both the current release of Beautiful Soup and my new code. I then automatically compared each "old" document to the corresponding "new" document.

The only difference is that the current release of Beautiful Soup doesn't generally end a prettified string with a newline, and my new implementation consistently does this. I think this is an improvement and that _not_ ending a prettified string with a newline represents an oversight that no one ever pointed out.

Changed in beautifulsoup:
status: In Progress → Fix Committed
Revision history for this message
Isaac Muse (facelessuser) wrote :

How did performance end up in the end?

Revision history for this message
Leonard Richardson (leonardr) wrote (last edit ):

I fixed the big performance problem. The new algorithm is very slightly slower than recursive function calls, but not enough to worry me. I can also use the same underlying method to fix the stack overflow problem for deepcopying one of these documents, and possibly pickling as well. So it's a net win.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Fix released in 4.12.1

Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.