Beautiful Soup

Provide a public clone() method for elements

Bug #2065120 reported by Chris Papademetrious on 2024-05-08

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	New	Undecided	Unassigned

Bug Description

I needed a way to create a copy of a bare Tag without its contents, and the hidden _clone() method worked perfectly!

My use case was that I need to extract an arbitrary element from an HTML document, then replicate its enclosing hierarchy all the way to the top. The _clone() method was perfect for creating the chain of parent elements and inserting the lower element into each successive higher element.

I'm doing various other types of slicing and dicing of HTML content, and _clone() has been useful for that too.

Would you consider creating a public version of the _clone() method? It could be named clone() or something else you prefer. I'd be happy to take a shot at a merge request (including documentation) if you want.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-05-10:

I ran into an issue using _clone() in my own code. For whatever bizarre reason, code running inside a "pytest" test does not see the _clone() method.

For example, consider the following "test_clone.py" file:

====
#!/usr/bin/env python
import bs4

# my own copy of _clone()
def _myclone(self):
    clone = type(self)(
        None, None, self.name, self.namespace,
        self.prefix, self.attrs, is_xml=self._is_xml,
        sourceline=self.sourceline, sourcepos=self.sourcepos,
        can_be_empty_element=self.can_be_empty_element,
        cdata_list_attributes=self.cdata_list_attributes,
        preserve_whitespace_tags=self.preserve_whitespace_tags,
        interesting_string_types=self.interesting_string_types
    )
    for attr in ('can_be_empty_element', 'hidden'):
        setattr(clone, attr, getattr(self, attr))
    return clone

def test_foo():
    body = bs4.BeautifulSoup('<body foo="bar"/>', 'lxml').find("body").extract()
    print(f'1: {body}')
    print(f'2: {_myclone(body)}')
    print(f"3: {type(body._clone)}")
    print(f'4: {body._clone()}')

test_foo()
====

If I run this script manually, it works as expected:

====
$ test_clone.py
1: <body foo="bar"></body>
2: <body foo="bar"></body>
3: <class 'method'>
4: <body foo="bar"></body>
====

But if I run it via pytest, the _clone() method is undefined:

====
============== ERRORS ==============
__ ERROR collecting test_clone.py __
test_clone.py:26: in <module>
test_foo()
test_clone.py:24: in test_foo
print(f'4: {body._clone()}')
E TypeError: 'NoneType' object is not callable
--------- Captured stdout ----------
1: <body foo="bar"></body>
2: <body foo="bar"></body>
3: <class 'NoneType'>
===== short test summary info ======
ERROR test_clone.py - TypeError: 'NoneType' object is not callable
====

It took me awhile to figure this out... and I still don't understand the why behind it...

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-27:

I'm not sure why pytest is behaving the way it is but I wouldn't be surprised if it makes "private" Python methods in bs4 unavailable to tests of another package.

After thinking about this for a while I'm OK with exposing this method publicly, but "clone" is a bad name for it as a public method. Most people will think of "clone" as doing something closer to what copy.deepcopy does. And other terms like "copy" and "shallow copy" are taken by Python already. I'd be interested in any naming ideas you have.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-05-28:

Speaking of Python's native shallow/deep copy methods, is a copy.copy() of a BeautifulSoup tag considered "shallow" or "deep"?

It seems to always be deep, but I just did a quick test and I think I found a bug.

In this code, I save a copy.copy() of soup1 as soup2, the modify soup1 in various ways:

====
import bs4
import copy

html_doc = '<p class="foo" style="orig">text</p>'
soup1 = bs4.BeautifulSoup(html_doc, "lxml")
soup2 = copy.copy(soup1)

p = soup1.find("p")
p.attrs["data-added"] = "TRUE" # soup1
p.attrs["style"] = "NEW" # soup1
p.append("-HELLO") # soup1
p.attrs["class"].append("BAR") # soup1 and soup2 <--

print(f"soup1: {soup1}")
print(f"soup2: {soup2}")
====

Notice that the addition of "BAR" to soup1.p's class attribute also affects soup2. Perhaps the lists will need to be deep-copied with [:]. I tried copy.deepcopy() and got the same behavior. Do you want me to file a separate bug for this?

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-05-28:

The multi-value cloning bug can be reproduced more simply with this:

====
import bs4
import copy

html_doc = '<p class="foo"/>'
soup = bs4.BeautifulSoup(html_doc, "lxml")
p1 = soup.find("p")
p2 = p1._clone()
p1.attrs["class"].append("BAR") # <-- also affects p2

print(f"p1: {p1}")
print(f"p2: {p2}")
====

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-05-28:

I decided this deserved its own issue:

====
2067412: When a Tag is copied, multi-valued attribute value lists are not deeply copied
https://bugs.launchpad.net/beautifulsoup/+bug/2067412
====

Putting this bug aside, is my understanding correct that for BeautifulSoup tag objects, copy.copy() and copy.deepcopy() both always result in a deep copy?

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-06-14 (last edit on 2024-06-14):

Yes, copy and deepcopy both create deep copies. There are no shallow copies in Beautiful Soup because a given PageElement can only be in one tree at a time; otherwise we'd get behavior so subtle as to be indistinguishable from a bug.

Maybe 'twin' would work as a method name here. It doesn't imply a detailed recreation of the entire data structure, the way 'clone' does, because it ties into the metaphor of the DOM as a family tree, and we don't expect twins to have the same children.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-06-15:

'twin' doesn't feel right to me - I wouldn't have an intuitive sense of it when listed as an available method in VS Code, and the word 'twin' doesn't suggest anything about the contents being copied or not.

What about having two new methods on Tag objects?

mytag.copy() - alias for copy.copy()
mytag.copy_self() - alias for _clone()

Providing a copy() method alias for copy.copy() yields a UI that is more consistent with other methods:

====
my_body = soup.find("body").extract()
my_body = soup.find("body").copy()
my_body = soup.find("body").copy_self()
====

Plus, seeing both methods listed alongside each other in the VS Code method list tooltip would encourage an understanding of both ("if one is self-only, the other must be deep").

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.