Beautiful Soup

Provide self and parent-or-self search functionality

Bug #2052936 reported by Chris Papademetrious on 2024-02-12

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Committed	Wishlist	Unassigned

Bug Description

This is an enhancement request.

Coming from XML/XSLT and XML::Twig in Perl, I have two enhancement requests for element-matching functionality in Beautiful Soup.

1. Implement a matches() method that tests the element itself using the standard name/attribute/string filtering UI:

====
if mynode.matches(...filter-y stuff...):
# ...
====

The name/attribute/string filter mechanisms can be coded in their various ways, but having code consistency in search-matching and self-matching is preferred.

If the element matches, it should return itself; otherwise it should return None.

This is analogous to a direct node filtering predicate in XSLT:

====
$mynode[...filter-y stuff...]
====

2. Implement variants of find_parent() and find_parents() that *also* consider the element itself. For example,

====
mynode.find_parent_or_self(...filter-y stuff...)
mynode.find_parents_or_self(...filter-y stuff...)

# or

mynode.find_parent(...filter-y stuff..., include_self=True)
mynode.find_parents(...filter-y stuff..., include_self=True)
====

This is analogous to the ancestor-or-self axis in XSLT:

====
$mynode/ancestor-or-self::*[...filter-y stuff...]
====

Again, this can be implemented explicitly in code, but it would be cleaner and clearer to use UI-consistent methods in content processing algorithms.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-02-13 (last edit on 2024-02-13):

First, let's take inventory of what we have in the 4.13 branch that's in the vicinity of this wishlist item:

ElementFilter.match(PageElement) -> bool
PageElement.match(Iterator[PageElement], ElementFilter) -> ResultSet

I'm OK with adding an Iterator that is the equivalent of ancestor-or-self. The name would probably be PageElement.self_and_parents.

    @property
    def self_and_parents(self) -> Iterator[PageElement]:
        yield self
        for i in self.parents:
            yield i

I don't want to add more find_* methods, but since we already have find_parents, I'm OK with adding an include_self argument to it.

A PageElement.matches() method that takes the find_* arguments and returns a boolean is a little troubling for two reasons. First, it's another method with a very complicated signature. But more importantly, you're probably not calling PageElement.matches() just once. You're probably iterating over elements somehow and calling matches() each time. If you do that you're creating an identical ElementFilter for every element in the iterator, which is incredibly inefficient. It'd be much better to build the ElementFilter once. And there is a solution that works that way: either pass the ElementFilter into one of the find_* methods, or use PageElement.match(). So I'm not really sold on PageElement.matches().

However, we probably want to rename PageElement.match() before releasing 4.13, since the name makes it look like it returns a boolean, like ElementFilter.match() does. Maybe it should be called PageElement.filter().

Changed in beautifulsoup:
importance:	Undecided → Wishlist

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-03-24:

The PageElement.self_and_parents Iterator and an include_self argument for find_parents() would be great, if you're able to squeeze them in!

For PageElement.matches(), my primary use would be code where I make decisions about restructuring content based on what exists. This would involve more random control logic than looping. (I am careful to use loops where appropriate, as runtime is important for the amount of content I process.)

I'm really just looking for a variant of find_*() that works on that element itself. It can even return the element itself on a positive match. For example, I might do something like this:

====
if this_tag.matches('div', class_='prolog') and
        this_tag.next_sibling and this_tag.next_sibling.matches('div', class_='abstract'):
    # do some stuff
    pass
====

Actually, XML::Twig has variants of matches() like next_sibling_matches() which elegantly handles the case where the next thing doesn't exist at all, but I have a feeling I'm wearing out my welcome by mentioning it. :) Maybe I should consider extending bs4 and adding all these convenience methods that I tend to use.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-03-25:

What about a single iterator PageElement.self_and(other_iterator)?

e.g.

tag.self_and(tag.next_siblings)

Changed in beautifulsoup:
status:	New → Fix Committed

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-04-18:

For the self-and-* matching functionality, for the following command:

====
tag.find_parent(True, {'data-base-uri': True}, include_self=True)
====

what would the tag.self_and() version of this command be?

For the self-only matching functionality, I really hope you consider implementing the matches() method. It would be extremely useful for our processing code. I understand your concerns about runtime, but runtime is not a factor for our application, whereas code clarity and maintainability is.

For example, we have code that processes HTML hierarchically like this:

====
for tag in reversed(soup.find_all(True)):
  if tag.matches(True, href=re.compile(...)):
    # do some stuff
  elif tag.matches(['div', 'body'], class_=['abstract', 'summary']):
    # do some stuff
  elif tag.matches(...):
    # ...
  elif ...
====

and the clarity and ease of modifying the filtering tests is paramount. The actual implementation of this code is much more complicated and harder to follow than the code above.

On a side note, the reason I iterate through the find_all() tags in reverse is that it allows me to modify the contents of a tag as much as I want without inadvertently breaking the "for" iteration (because I never traverse downward into the stuff I just modified). This code pattern works very well for hierarchical processing!

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-05-21:

Following up on this, are you agreeable to implementing these options (along with their relevant underlying generators)?

====
find (..., include_self=True)
find_parent (..., include_self=True)
find_parents (..., include_self=True)
find_next_sibling (..., include_self=True)
find_next_siblings (..., include_self=True)
find_previous_sibling (..., include_self=True)
find_previous_siblings(..., include_self=True)
find_next (..., include_self=True)
find_all_next (..., include_self=True)
find_previous (..., include_self=True)
find_all_previous (..., include_self=True)
====

If so, do you want me to attempt a pull request for this?

Getting back to the Tag.matches() proposal, I wanted to see how much overhead there was in SoupStrainer construction. I wrote some code to create one million <p class="foo"> tags as follows:

====
import bs4
import time

# make a document with one million <p class="foo"> tags:
# <html>
# <body>
# <div><p class="foo">1</p></div>
# <div><p class="foo">1</p></div>
# ...
# <div><p class="foo">1000000</p></div>
# </body>
# </html>
body = bs4.BeautifulSoup("<html><body/></html>", "lxml").find("body")
for i in range(1_000_000):
    div = bs4.Tag(name="div")
    body.append(div)
    p = bs4.Tag(name="p", attrs={"class": "foo"})
    div.append(p)
    p.append(str(i))
====

Finding all <p class="foo"> tags with find_all() took 3.5 seconds:

====
# find <p class="foo"> elements in batch
start_time = time.time()
result = body.find_all("p", attrs={"class": "foo"})
end_time = time.time()

print(f"{len(result)} tags bulk-matched with find_all() in {end_time - start_time} seconds.")
====

Finding all <p class="foo"> tags one at a time with find() took 6.1 seconds:

====
# find <p class="foo"> elements individually
result = []
divs = body.find_all("div")

start_time = time.time()
for div in divs:
result.append(div.find("p", attrs={"class": "foo"}, recursive=False))
end_time = time.time()

print(f"{len(result)} tags individually matched with find() in {end_time - start_time} seconds.")
====

So while there is overhead in SoupStrainer construction, it would not be burdensome for discrete tag tests in control logic (such as analyzing the topic/subtopic structure of a document).

The type signature of this method I'm requesting is:

====
bs4.Tag.matches(...) -> [bs4.Tag | None]
====

where matches() would return the tag itself if it matches, else None.

Why implement Tag.matches() when we already have Tag.css.match()? The reason is that I prefer to use a consistent selection convention in the code, and intermixing the BS4 and CSS conventions can reduce code clarity.

Following up on this, are you agreeable to implementing these options (along with their relevant underlying generators)?

====
find                  (..., include_self=True)
find_parent           (..., include_self=True)
find_parents          (..., include_self=True)
find_next_sibling     (..., include_self=True)
find_next_siblings    (..., include_self=True)
find_previous_sibling (..., include_self=True)
find_previous_siblings(..., include_self=True)
find_next             (..., include_self=True)
find_all_next         (..., include_self=True)
find_previous         (..., include_self=True)
find_all_previous     (..., include_self=True)
====

If so, do you want me to attempt a pull request for this?

Getting back to the Tag.matches() proposal, I wanted to see how much overhead there was in SoupStrainer construction. I wrote some code to create one million <p class="foo"> tags as follows:

====
import bs4
import time

# make a document with one million <p class="foo"> tags:
# <html>
#  <body>
#   <div><p class="foo">1</p></div>
#   <div><p class="foo">1</p></div>
#   ...
#   <div><p class="foo">1000000</p></div>
#  </body>
# </html>
body = bs4.BeautifulSoup("<html><body/></html>", "lxml").find("body")
for i in range(1_000_000):
    div = bs4.Tag(name="div")
    body.append(div)
    p = bs4.Tag(name="p", attrs={"class": "foo"})
    div.append(p)
    p.append(str(i))
====

Finding all <p class="foo"> tags with find_all() took 3.5 seconds:

====
# find <p class="foo"> elements in batch
start_time = time.time()
result = body.find_all("p", attrs={"class": "foo"})
end_time = time.time()

print(f"{len(result)} tags bulk-matched with find_all() in {end_time - start_time} seconds.")
====

Finding all <p class="foo"> tags one at a time with find() took 6.1 seconds:

====
# find <p class="foo"> elements individually
result = []
divs = body.find_all("div")

start_time = time.time()
for div in divs:
    result.append(div.find("p", attrs={"class": "foo"}, recursive=False))
end_time = time.time()

print(f"{len(result)} tags individually matched with find() in {end_time - start_time} seconds.")
====

So while there is overhead in SoupStrainer construction, it would not be burdensome for discrete tag tests in control logic (such as analyzing the topic/subtopic structure of a document).

The type signature of this method I'm requesting is:

====
bs4.Tag.matches(...) -> [bs4.Tag | None]
====

where matches() would return the tag itself if it matches, else None.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-05-24:

Here is a real-world example of the find*(include_self=True) usefulness.

In our document processor, <article> elements typically contain each topic, but the <body> element can also serve as the top-level topic container:

====
import bs4

html_doc = """
<html>
  <body> 
    <h1>Topic 1</h1>
    <p>text 1</p>
    <article> 
      <h2>Topic 1-1</h2>
      <p>text 1-1</p>
    </article>
  </body>
</html>
"""
soup = mybs4.BeautifulSoup(html_doc, "lxml")
====

I have a function that processes all topics hierarchically. I can call this function with the top-level BeautifulSoup object or with any Tag within it. The function must find all topic tags, *including* the Tag itself if it qualifies.

I want to be able to do something like this:

====
# all <body> or <article> elements within soup (inclusive)
print(
    "Should be 2:",
    len(soup.find_all(["body", "article"], include_self=True)),
    # ^^^^ ^^^^^^^^^^^^^^^^^
)

# all <body> or <article> elements within <body> (inclusive)
print(
    "Should be 2:",
    len(soup.find("body").find_all(["body", "article"], include_self=True)),
    # ^^^^^^^^^^^^^^^^^
)

# all <body> or <article> elements within <article> (inclusive)
print(
    "Should be 1:",
    len(soup.find("article").find_all(["body", "article"], include_self=True)),
    # ^^^^^^^^^^^^^^^^^^^^
)
====

With include_self, I could write this:

====
def process_topics(thing: bs4.PageElement):
for topic in thing.find_all(["body", "article"], include_self=True):
pass
====

Currently, I write this:

====
def process_topics(thing: bs4.PageElement):
    all_topics = []
    if thing.matches(["body", "article"]):
        all_articles.append(thing)
    all_articles.extend(thing.find_all(["body", "article"]))

for topic in all_topics:
pass
====

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-30:

Coming back to this ticket to add a few comments because I remember there was some unfinished business here.

> For the self-and-* matching functionality, for the following command:
>
> ====
> tag.find_parent(True, {'data-base-uri': True}, include_self=True)
> ====
>
> what would the tag.self_and() version of this command be?

The new PageElement.filter() method (already in 4.13) is the key here. You pass it a SoupStrainer and an iterator, and it performs the iteration and yields only the elements that match. So with self_and() you could write it like this:

strainer = SoupStrainer(True, {'data-base-uri': True})
for match in tag.filter(tag.self_and(tag.parents), strainer):
...

filter() plus self_and() (which, again, is still hypothetical) would also let you implement process_topics() easily:

def process_topics(thing: bs4.PageElement):
    article_or_body = SoupStrainer(["body", "article"])
    for topic in thing.filter(thing.self_and(thing.descendants), article_or_body):
        pass

For your Tag.matches() example, can flip the code around and use the new SoupStrainer.match() method:

> ====
> for tag in reversed(soup.find_all(True)):
> if tag.matches(True, href=re.compile(...)):
> # do some stuff
> elif tag.matches(['div', 'body'], class_=['abstract', 'summary']):
> # do some stuff
> elif tag.matches(...):
> # ...
> elif ...
> ====

regex_href = SoupStrainer(True, href=re.compile(...))
abstract_or_summary = SoupStrainer(['div', 'body'], class_=['abstract', 'summary'])
for tag in reversed(soup.find_all(True)):
  if regex_href.matches(tag)
    # do some stuff
  elif abstract_or_summary.matches(tag):
    # do some stuff
  elif ...

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-30:

I filed bug 2067634, but I noticed something while going over your examples. I think the only bits of self_and() generator functionality you need are PageElement.self_and_parents and PageElement.self_and_descendants, which already exist. Do you have a positive need for new generators like self_and_next_elements, or can we hold off on adding any more stuff to the API?

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-30 (last edit on 2024-05-30):

#10

So to rewrite your example code using only bits of the API that already exist in 4.13:

strainer = SoupStrainer(True, {'data-base-uri': True})
for match in soup.filter(tag.self_and_parents, strainer):
...

def process_topics(thing: bs4.PageElement):
    article_or_body = SoupStrainer(["body", "article"])
    for topic in soup.filter(thing.self_and_descendants, article_or_body):
        pass

This is also making me realize that filter() doesn't need to be an instance method and/or should actually live in ElementFilter.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.