str(some_tag) - infinite recursion

Bug #1967610 reported by vishvAs vAsuki
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
New
Undecided
Unassigned

Bug Description

Problem code (use section_html given in the attachment):

soup = BeautifulSoup(section_html, features="html")
message_tag = soup.select_one("div[role='region']")
str(message_tag)

Was trying to parse - https://groups.google.com/g/bvparishat/c/vkOvpkrL97o

Revision history for this message
vishvAs vAsuki (vvasuki) wrote :
Revision history for this message
vishvAs vAsuki (vvasuki) wrote :

Trace:

  File "/home/vvasuki/sanskrit-coders/doc_curation/doc_curation/mail_stream/google_groups.py", line 75, in get_thread_messages_selenium
    content = md.get_md_with_pandoc(content_in=str(message_tag), source_format="html")
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1576, in __unicode__
    return self.decode()
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1683, in decode
    contents = self.decode_contents(
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1777, in decode_contents
    s.append(c.decode(indent_level, eventual_encoding,
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1683, in decode
    contents = self.decode_contents(
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1777, in decode_contents
    s.append(c.decode(indent_level, eventual_encoding,
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1683, in decode
    contents = self.decode_contents(
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1777, in decode_contents
    s.append(c.decode(indent_level, eventual_encoding,
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1683, in decode
    contents = self.decode_contents(
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1777, in decode_contents

...
  File "/usr/lib/python3.10/site-packages/bs4/element.py", line 1640, in decode
    attributes = formatter.attributes(self)
  File "/usr/lib/python3.10/site-packages/bs4/formatter.py", line 123, in attributes
    return sorted(
RecursionError: maximum recursion depth exceeded

Revision history for this message
Isaac Muse (facelessuser) wrote :

It appears to specifically be an issue with the printing of the tag or the HTML object in general. IIRC, BeautifulSoup uses recursion to print an element and all of its children. The document you linked has an extreme amount of nesting, requiring a lot of recursions. You can work around this by increasing the recursion limit:

```
sys.setrecursionlimit(10**6)
```

I've set it arbitrarily high in the above example, but you can set it to something more suitable.
This is probably the best approach unless BeautifulSoup was to rewrite its printing to greatly reduce recursion.

Revision history for this message
Leonard Richardson (leonardr) wrote :

This seems of a piece with https://bugs.launchpad.net/beautifulsoup/+bug/1709837 and https://bugs.launchpad.net/beautifulsoup/+bug/1471755; they're probably all the same issue but I'm not 100% sure.

Isaac is right that the only way to fix it is to eliminate recursive function calls from the tree traversal methods. I'd do this by keeping track of the traversal using a Python data structure instead of Python's call stack. At that point the limit would be system memory rather than the size of the stack.

This is a lot of work and a likely source of subtle bugs, so I haven't prioritized it given the existence of a pretty simple workaround.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.