Passing a Tag into Tag.extend() affects only half of the original tag's children.
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
python3.8
beautifulsoup 4.9.1
Code to reproduce:
```
>>> from bs4 import BeautifulSoup, Tag
>>> soup = BeautifulSoup(
>>> soup
<html><body><p id="0"></p><p id="1"></p><p id="2"></p><p id="3"></p><p id="4"></p><p id="5"></p><p id="6"></p><p id="7"></p><p id="8"></p><p id="9">
>>> fakebody = Tag(name='body')
>>> fakebody.
>>> fakebody
<body><p id="0"></p><p id="2"></p><p id="4"></p><p id="6"></p><p id="8"></p><p id="1"></p><p id="5"></p><p id="9"></p><p id="0"></p><p id="2"></p><p id="4"></p><p id="6"></p><p id="8"></p></body>
>>> soup.body
<body><p id="1"></p><p id="3"></p><p id="5"></p><p id="7"></p><p id="9"></p></body>
```
In docs is written (https:/
Starting in Beautiful Soup 4.7.0, Tag also supports a method called .extend(), which works just like calling .extend() on a Python list:
But list doesn't work in this way
```
>>> a = [1, 2, 3, 4]
>>> b = []
>>> b.extend(a)
>>> b
[1, 2, 3, 4]
>>> a
[1, 2, 3, 4]
```
Maybe copy of elements should be not destructive?
Ideas for fix in bs4/elements.py:
# Use of this source code is governed by the MIT license.
__license__ = "MIT"
from copy import deepcopy
...
def extend(self, tags):
"""Appends the given PageElements to this one's contents.
:param tags: A list of PageElements.
"""
for tag in tags:
...
Or maybe should be patch Tag.insert?
description: | updated |
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
This is fixed in revision 587.
Tag.extend() iterates over the given list and calls Tag.append() on each element of the list. When the 'list' it was given is another Tag (let's call it t1), this means iterating over the PageElement objects found in the list "t1.contents".
The problem you found stems from the fact that if Tag.append() is given a PageElement already associated with a Beautiful Soup parse tree, it will uproot that PageElement and move it to a new location. This changes t1.contents, causing the iterator to skip the next item. That's why, in your example, only half of the <p> tags are re-homed.
You're right to point out that this behavior differs from Python's list.extend(), which works by copying references. Tag objects can only exist in one place in a tree, so copying references doesn't work for them. I've changed the documentation to spell out what is happening rather than making a comparison to list.extend().