BeautifulSoup strings don't implement __deepcopy__

Bug #682685 reported by Konrad on 2010-11-29
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Undecided
Unassigned

Bug Description

This is a reason of some cryptic bugs (as one would assume BSoup strings act just like regular ones). It happened to me when generating Django query.

Poll.objects.all(question__startswith=string_extracted_from_bsoup)

Leonard Richardson (leonardr) wrote :

Can you give me some standalone code that reproduces the bug? It's true that NavigableString doesn't implement __deepcopy__, but neither does string or unicode, and I can successfully call deepcopy() on a NavigableString.

Hi Leonard,

I tried to find some time to reproduce this bug, but it kept falling
down in my priorities list. Feel free to resolve this ticket as
invalid.

sorry,
Konrad

PS. And obviously: kudos for creating BeautifulSoup!

On 28 February 2011 04:32, Leonard Richardson
<email address hidden> wrote:
> Can you give me some standalone code that reproduces the bug? It's true
> that NavigableString doesn't implement __deepcopy__, but neither does
> string or unicode, and I can successfully call deepcopy() on a
> NavigableString.
>
> --
> You received this bug notification because you are a direct subscriber
> of the bug.
> https://bugs.launchpad.net/bugs/682685
>
> Title:
>  BeautifulSoup strings don't implement __deepcopy__
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/beautifulsoup/+bug/682685/+subscribe
>

Changed in beautifulsoup:
status: New → Invalid
Lev Maximov (lmaximov) wrote :

I also ran into the issue.

The workaround is

Poll.objects.all(question__startswith=unicode(string_extracted_from_bsoup))

The error arises when Django tries to deepcopy NavigableString. Under some circumstances it results in endless recursion.

Attached is an html file. It is really huge, but removing even one tag ceases reproduction of the bug.

The line that ignites the issue is:

deepcopy(BeautifulSoup(open('a.html')).find('table', 'data_ordinary').find_all('td')[1].string)

RuntimeError: maximum recursion depth exceeded in cmp

Lev Maximov (lmaximov) wrote :
Changed in beautifulsoup:
status: Invalid → Confirmed
Lev Maximov (lmaximov) wrote :
Lev Maximov (lmaximov) wrote :

The problem is that deepcopying a NavigableString results in well deep copying the entire tags tree in a such a not-so-efficient manner (via the default python deepcopy method) that it results in exceeding max recursion depth.

Apparently it should be safe to make NavigableString immutable since no one would anticipate a full tree copy from deepcopying an object that the user might not even know is not a string.

Leonard Richardson (leonardr) wrote :

I still can't duplicate this problem, but if it's a memory-related problem that would explain the variability between systems.

Your patch is very tempting, but the __deepcopy__ implementation you have won't work, because a NavigableString's components are not immutable. Consider this code:

---
from bs4 import BeautifulSoup
import copy

soup = BeautifulSoup("<a>foo</a><b>bar</b>")
soup_2 = copy.deepcopy(soup)

foo = soup.find(text="foo")
foo_2 = soup_2.find(text="foo")
foo.insert_before(soup.find(text="bar"))

print soup
print soup_2

print foo.parent
print foo_2.parent
---

Superficially, soup_2 looks fine, but it's in an internally inconsistent state. That's because foo and foo_2 are the same object. They always have the same parent, and it is always an object in soup, not in soup_2.
When insert_before() changes the parent of foo, it updates the <a> tag and the <b> tag in soup, because those are foo's old and new parents. But the <a> tag and the <b> tag found in soup_2 don't get the message.

I can believe that the default deepcopy algorithm would be inefficient enough to exceed the maximum recursion depth on a densely interconnected data structure like a Beautiful Soup parse tree, but I don't understand the algorithm well enough to fix it in Beautiful Soup, and disabling the algorithm will cause subtle problems that can't be fixed. Whereas the current problem is a blatant problem that can usually be fixed--by extracting data from the parse tree before using it outside of Beautiful Soup.

I don't see a problem with your __copy__ implementation, because the string part of a NavigableString _is_ immutable. I'll be adding that to Beautiful Soup. Hopefully that will help the default deepcopy algorithm run better.

Changed in beautifulsoup:
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments