self.extract() broken for NavigableString objects, in 3.0.7 and 3.1

Bug #397997 reported by Anthony Theocharis
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

The extract() method is broken when trying to remove a NavigableString object from its parent PageElement.

If the element.contents list contains two NavigableString objects with the same character string (eg, they both say u"click here"), and you try to extract() the second NavigableString object, you will instead break your tree.

This is because the extract() method removes the object from its parent's contents list by calling self.parent.contents.remove(self).

The remove() method, however, uses the NS.__eq__() method, to determine which object to remove from the list. Because NavigableString inherits __eq__() from the unicode object, this method will return true for any NS or unicode object with the same character string.

My solution is to add an __eq__() method to the NavigableString class that reads:

def __eq__(self, other):
    if isinstance(other, NavigableString):
        return other is self
    else:
        return unicode.__eq__(self, other)

Revision history for this message
Mikael Lepistö (elhigu) wrote :

Worked for me

from BeautifulSoup import *
doc = '<div>A<div>B</div>A</div>'
d = BeautifulSoup(doc)
d.first().next.next.next.next.extract()
>> u'A'
d
>> <div><div>B</div>A</div>

def eq_fix(self, other):
    if isinstance(other, NavigableString):
        return other is self
    else:
        return unicode.__eq__(self, other)

NavigableString.__eq__ = eq_fix
d = BeautifulSoup(doc)
d.first().next.next.next.next.extract()
>> u'A'
d
>> <div>A<div>B</div></div>

Revision history for this message
Anthony Theocharis (anthony-theocharis) wrote : Re: [Bug 397997] Re: self.extract() broken for NavigableString objects, in 3.0.7 and 3.1

Ah. Sorry for not being specific enough.

IIRC the bug manifests when you have a reference (call it t2) to the
second 'A' text node object, then you call d.extract(t2)

This will remove the first 'A' textnode object (call it t1) from d's
list of children, but remove d from t2's parent reference.

Leaving t1 thinking its parent is d, and t2 thinking it has no parent.
At the same time, leaving d thinking its children include t2, but not
t1.

-- Anthony

On 2010-05-20, at 7:03 AM, elhigu <email address hidden> wrote:
> Worked for me
>
> from BeautifulSoup import *
> doc = '<div>A<div>B</div>A</div>'
> d = BeautifulSoup(doc)
> d.first().next.next.next.next.extract()
>>> u'A'
> d
>>> <div><div>B</div>A</div>
>
> def eq_fix(self, other):
> if isinstance(other, NavigableString):
> return other is self
> else:
> return unicode.__eq__(self, other)
>
> NavigableString.__eq__ = eq_fix
> d = BeautifulSoup(doc)
> d.first().next.next.next.next.extract()
>>> u'A'
> d
>>> <div>A<div>B</div></div>
>
> --
> self.extract() broken for NavigableString objects, in 3.0.7 and 3.1
> https://bugs.launchpad.net/bugs/397997
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Beautiful Soup: New
>
> Bug description:
> The extract() method is broken when trying to remove a
> NavigableString object from its parent PageElement.
>
> If the element.contents list contains two NavigableString objects
> with the same character string (eg, they both say u"click here"),
> and you try to extract() the second NavigableString object, you will
> instead break your tree.
>
> This is because the extract() method removes the object from its
> parent's contents list by calling self.parent.contents.remove(self).
>
> The remove() method, however, uses the NS.__eq__() method, to
> determine which object to remove from the list. Because
> NavigableString inherits __eq__() from the unicode object, this
> method will return true for any NS or unicode object with the same
> character string.
>
> My solution is to add an __eq__() method to the NavigableString
> class that reads:
>
> def __eq__(self, other):
> if isinstance(other, NavigableString):
> return other is self
> else:
> return unicode.__eq__(self, other)
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/beautifulsoup/+bug/397997/+subscribe

Revision history for this message
Leonard Richardson (leonardr) wrote :

This is fixed in Beautiful Soup 4 : we use "del self.parent.contents[self.parent.index(self)]" instead.

Not that it matters anymore, but I don't like the __eq__ solution, because Beautiful Soup considers two PageElements equal if they correspond to the same markup. Where they live in the tree doesn't matter.

Changed in beautifulsoup:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.