self.extract() broken for NavigableString objects, in 3.0.7 and 3.1
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
The extract() method is broken when trying to remove a NavigableString object from its parent PageElement.
If the element.contents list contains two NavigableString objects with the same character string (eg, they both say u"click here"), and you try to extract() the second NavigableString object, you will instead break your tree.
This is because the extract() method removes the object from its parent's contents list by calling self.parent.
The remove() method, however, uses the NS.__eq__() method, to determine which object to remove from the list. Because NavigableString inherits __eq__() from the unicode object, this method will return true for any NS or unicode object with the same character string.
My solution is to add an __eq__() method to the NavigableString class that reads:
def __eq__(self, other):
if isinstance(other, NavigableString):
return other is self
else:
return unicode.
Worked for me
from BeautifulSoup import * div>B</ div>A</ div>' ).next. next.next. next.extract( ) B</div> A</div>
doc = '<div>A<
d = BeautifulSoup(doc)
d.first(
>> u'A'
d
>> <div><div>
def eq_fix(self, other): __eq__( self, other)
if isinstance(other, NavigableString):
return other is self
else:
return unicode.
NavigableString .__eq__ = eq_fix ).next. next.next. next.extract( ) div>B</ div></div>
d = BeautifulSoup(doc)
d.first(
>> u'A'
d
>> <div>A<