Provide a public clone() method for elements

Bug #2065120 reported by Chris Papademetrious
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
New
Undecided
Unassigned

Bug Description

I needed a way to create a copy of a bare Tag without its contents, and the hidden _clone() method worked perfectly!

My use case was that I need to extract an arbitrary element from an HTML document, then replicate its enclosing hierarchy all the way to the top. The _clone() method was perfect for creating the chain of parent elements and inserting the lower element into each successive higher element.

I'm doing various other types of slicing and dicing of HTML content, and _clone() has been useful for that too.

Would you consider creating a public version of the _clone() method? It could be named clone() or something else you prefer. I'd be happy to take a shot at a merge request (including documentation) if you want.

Revision history for this message
Chris Papademetrious (chrispitude) wrote :

I ran into an issue using _clone() in my own code. For whatever bizarre reason, code running inside a "pytest" test does not see the _clone() method.

For example, consider the following "test_clone.py" file:

====
#!/usr/bin/env python
import bs4

# my own copy of _clone()
def _myclone(self):
    clone = type(self)(
        None, None, self.name, self.namespace,
        self.prefix, self.attrs, is_xml=self._is_xml,
        sourceline=self.sourceline, sourcepos=self.sourcepos,
        can_be_empty_element=self.can_be_empty_element,
        cdata_list_attributes=self.cdata_list_attributes,
        preserve_whitespace_tags=self.preserve_whitespace_tags,
        interesting_string_types=self.interesting_string_types
    )
    for attr in ('can_be_empty_element', 'hidden'):
        setattr(clone, attr, getattr(self, attr))
    return clone

def test_foo():
    body = bs4.BeautifulSoup('<body foo="bar"/>', 'lxml').find("body").extract()
    print(f'1: {body}')
    print(f'2: {_myclone(body)}')
    print(f"3: {type(body._clone)}")
    print(f'4: {body._clone()}')

test_foo()
====

If I run this script manually, it works as expected:

====
$ test_clone.py
1: <body foo="bar"></body>
2: <body foo="bar"></body>
3: <class 'method'>
4: <body foo="bar"></body>
====

But if I run it via pytest, the _clone() method is undefined:

====
============== ERRORS ==============
__ ERROR collecting test_clone.py __
test_clone.py:26: in <module>
    test_foo()
test_clone.py:24: in test_foo
    print(f'4: {body._clone()}')
E TypeError: 'NoneType' object is not callable
--------- Captured stdout ----------
1: <body foo="bar"></body>
2: <body foo="bar"></body>
3: <class 'NoneType'>
===== short test summary info ======
ERROR test_clone.py - TypeError: 'NoneType' object is not callable
====

It took me awhile to figure this out... and I still don't understand the why behind it...

Revision history for this message
Leonard Richardson (leonardr) wrote :

I'm not sure why pytest is behaving the way it is but I wouldn't be surprised if it makes "private" Python methods in bs4 unavailable to tests of another package.

After thinking about this for a while I'm OK with exposing this method publicly, but "clone" is a bad name for it as a public method. Most people will think of "clone" as doing something closer to what copy.deepcopy does. And other terms like "copy" and "shallow copy" are taken by Python already. I'd be interested in any naming ideas you have.

Revision history for this message
Chris Papademetrious (chrispitude) wrote :

Speaking of Python's native shallow/deep copy methods, is a copy.copy() of a BeautifulSoup tag considered "shallow" or "deep"?

It seems to always be deep, but I just did a quick test and I think I found a bug.

In this code, I save a copy.copy() of soup1 as soup2, the modify soup1 in various ways:

====
import bs4
import copy

html_doc = '<p class="foo" style="orig">text</p>'
soup1 = bs4.BeautifulSoup(html_doc, "lxml")
soup2 = copy.copy(soup1)

p = soup1.find("p")
p.attrs["data-added"] = "TRUE" # soup1
p.attrs["style"] = "NEW" # soup1
p.append("-HELLO") # soup1
p.attrs["class"].append("BAR") # soup1 and soup2 <--

print(f"soup1: {soup1}")
print(f"soup2: {soup2}")
====

Notice that the addition of "BAR" to soup1.p's class attribute also affects soup2. Perhaps the lists will need to be deep-copied with [:]. I tried copy.deepcopy() and got the same behavior. Do you want me to file a separate bug for this?

Revision history for this message
Chris Papademetrious (chrispitude) wrote :

The multi-value cloning bug can be reproduced more simply with this:

====
import bs4
import copy

html_doc = '<p class="foo"/>'
soup = bs4.BeautifulSoup(html_doc, "lxml")
p1 = soup.find("p")
p2 = p1._clone()
p1.attrs["class"].append("BAR") # <-- also affects p2

print(f"p1: {p1}")
print(f"p2: {p2}")
====

Revision history for this message
Chris Papademetrious (chrispitude) wrote :

I decided this deserved its own issue:

====
2067412: When a Tag is copied, multi-valued attribute value lists are not deeply copied
https://bugs.launchpad.net/beautifulsoup/+bug/2067412
====

Putting this bug aside, is my understanding correct that for BeautifulSoup tag objects, copy.copy() and copy.deepcopy() both always result in a deep copy?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.