Beautiful Soup

Provide a method to wrap some/all children of an element

Bug #2044284 reported by Chris Papademetrious on 2023-11-22

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Triaged	Wishlist	Unassigned

Bug Description

This is a wishlist item.

Beautiful Soup has a wrap() method that wraps a single element in a tag. Super!

There are various Beautiful Soup requests for wrapping all elements contained *inside* a parent element (wrapping the inside instead of the outside):

https://stackoverflow.com/questions/20789798/how-to-use-beautifulsoup-to-wrap-body-contents-with-div-container

https://stackoverflow.com/questions/22632355/wrap-the-contents-of-a-tag-with-beautifulsoup

https://stackoverflow.com/questions/26448605/how-to-wrap-multiple-tags-under-a-new-tag-in-beautifulsoup

There are even more requests to wrap sequences of elements in a parent element that match a given criteria:

https://stackoverflow.com/questions/17605801/wrap-all-next-elements-in-beautifulsoup

https://stackoverflow.com/questions/73902333/wrap-groupings-of-tags-with-python-beautifulsoup

https://stackoverflow.com/questions/73913938/how-to-wrap-a-new-tag-around-multiple-tags-with-beautifulsoup

https://stackoverflow.com/questions/32274222/wrap-multiple-tags-with-beautifulsoup

https://stackoverflow.com/questions/59033884/wrap-multiple-list-items-in-a-new-tag-ul-ol-using-beautiful-soup

https://stackoverflow.com/questions/45009059/how-to-wrap-with-adjacent-tag-with-beautiful-soup

Most of the latter requests are about rebuilding hierarchical structure from flat HTML content using heading (<h1> through <h6>) elements:

####
html_doc = """
<body>
  <h1>ABC Topic</h1>
  <p/>
  <h2>AB Subtopic</h2>
  <p/>
  <h3>AB Subsubtopic</h2>
  <p/>
  <h2>C Subtopic</h2>
  <p/>
  <h1>XYZ Topic</h1>
  <p/>
  <h2>XY Subtopic</h2>
  <p/>
  <h2>Z Subtopic</h2>
</body>
"""
####

It would be great if Beautiful Soup had some kind of clever wrap_children() method to wrap sequences of elements meeting some kind of criteria.

To wrap all contents, the child element criteria would simply be True.

For more complex cases, the criteria could be a tag list or a function -- the usual Soupy ways. With this, you could build structured HTML from flat HTML using a simple bottom-up loop:

####
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

# h6 sections starts at h6, stops at not(h1-h6)
# h5 sections starts at h5, stops at not(h1-h5)
# h4 sections starts at h4, stops at not(h1-h4)
# ...etc...
for h in reversed(range(1, 6+1)):
soup.body.wrap_children(***MAGIC***, 'article')

print(soup.prettify())
####

In addition to any user-specified arguments, the function would also somehow need (1) the current candidate object and (2) the current set of accumulated objects (if any), so that the proper decisions could be made. These could be passed to the function using a documented **kwargs convention ("candidate", "accumulated").

See original description

Chris Papademetrious (chrispitude) on 2023-11-22

description:

updated

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-22:

I'm trying to... wrap my head around the request here.

The basic wrap_children() idea seems simple enough. You're inserting a tag in between a parent and its children.

My main question is whether this functionality is part of jQuery, because if so I want to reuse the name. The term "wrap" itself comes from jQuery, and it looks like the jQuery equivalent of this is wrapInner:

https://api.jquery.com/wrapInner/

So I'd probably call the method wrap_inner, although wrap_children sounds more "Beautiful Soup"-ish.

Anyway, where I start to lose the plot is the idea of doing this selectively. That seems like a new level of complexity being added to the core Beautiful Soup methods. I'm mainly looking at https://stackoverflow.com/questions/73902333/wrap-groupings-of-tags-with-python-beautifulsoup since that expresses the problem clearly for me.

We want to go from this:

<h1>Heading for Sec 1</h1>
<p>some text sec 1</p>
<p>some text sec 1</p>

<h1>Heading for Sec 2</h1>
<p>some text sec 2</p>
<p>some text sec 2</p>

To this:

<div>
<h1>Heading for Sec 1</h1>
<p>some text sec 1</p>
<p>some text sec 1</p>
</div>

<div>
<h1>Heading for Sec 2</h1>
<p>some text sec 2</p>
<p>some text sec 2</p>
</div>

Assuming there's a <div> or <body> that encompasses all the markup, there could be a method on that tag which does that. And the arguments to that method would be some way of telling Beautiful Soup how to group the tags together. But this wouldn't be like anything else in Beautiful Soup, because we're dividing the children of a tag into groups and then operating on each group, inside the method call.

When I think about accomplishing this task, I envision selecting some text in a text editor and then right-clicking on the selection to wrap it. In programming terms, I'd create an object that represents a contiguous selection and then call a method on that object. Applying this to Beautiful Soup, I'd want to keep any iterative logic (such as "do this to each group") outside of the method calls.

Let's hypothesize a method which works like find() but which returns the thing you were searching for, *plus* a ResultSet of everything that the iterator found up to that point. Then you could write code like this:

next_h1 = body.find('h1')
while next_h1:
selection, next_h1 = next_h1.until_next_sibling("h1")
selection.wrap(soup.new_tag("div"))

The sleight-of-hand here is, what does it mean to call wrap() on a ResultSet? I think it means:

* Reparent every item in the ResultSet to the new tag, effectively making the ResultSet that tag's .contents.
* Place the new tag at the same position in the tree where the _first_ item in the ResultSet was originally found.

This would work even if the ResultSet didn't represent a contiguous selection, though the most likely usages of it would be operating on a contiguous selection.

What do you think of this? To put it more concretely, can you sketch out the ***MAGIC*** that you had in your example code? Because that's the core of the issue, I think.

I'm trying to... wrap my head around the request here.

The basic wrap_children() idea seems simple enough. You're inserting a tag in between a parent and its children.

https://api.jquery.com/wrapInner/

So I'd probably call the method wrap_inner, although wrap_children sounds more "Beautiful Soup"-ish.

We want to go from this:

<h1>Heading for Sec 1</h1>
    <p>some text sec 1</p>
    <p>some text sec 1</p>

<h1>Heading for Sec 2</h1>
    <p>some text sec 2</p>
    <p>some text sec 2</p>

To this:

<div>
<h1>Heading for Sec 1</h1>
    <p>some text sec 1</p>
    <p>some text sec 1</p>
</div>

<div>
<h1>Heading for Sec 2</h1>
    <p>some text sec 2</p>
    <p>some text sec 2</p>
</div>

next_h1 = body.find('h1')
while next_h1:
    selection, next_h1 = next_h1.until_next_sibling("h1")
    selection.wrap(soup.new_tag("div"))

The sleight-of-hand here is, what does it mean to call wrap() on a ResultSet? I think it means:

This would work even if the ResultSet didn't represent a contiguous selection, though the most likely usages of it would be operating on a contiguous selection.

What do you think of this? To put it more concretely, can you sketch out the ***MAGIC*** that you had in your example code? Because that's the core of the issue, I think.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-22:

Here's another possibility, using Python's string.split and re.split as an analogy:

for selection in body.split("h1"):
selection.wrap(soup.new_tag("div"))

Leonard Richardson (leonardr) on 2024-02-13

Changed in beautifulsoup:
importance:	Undecided → Wishlist

Leonard Richardson (leonardr) on 2024-05-27

Changed in beautifulsoup:
status:	New → Triaged

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-06-08:

Hi Leonard,

I got excited about the string.split() analogy until I remembered that the separators are discarded. For tags, the desired behavior could vary - keep the separating tag at the end of the previous sequence, the beginning of the next sequence, or discard.

That got me to thinking about your comment about keeping the iterative logic outside the method, and having done more Beautiful Soup coding since filing this original request, I agree with that.

Getting back to building blocks... a common pattern is to accumulate some set of children objects. How about something like this?

  Tag.find_next_siblings_while(...)
    or
  Tag.find_next_siblings(..., contiguous=True)

Sketching it out a bit:

====
next_h2 = body.find('h2')
while next_h2:
    selection = next_h2.find_next_siblings(re.compile(r"^(?!h[1-2])", contiguous=True)
    div = selection.wrap(soup.new_tag("div"))
    next_h2 = div.find_next_sibling(True) # must be <h1>, <h2>, or None
====

where the tag match is a regular expression using negative lookahead to NOT match heading element levels between h1 and the current grouping level.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.