Beautiful Soup

Bug #2047713
Comment #4

Comment 4 for bug 2047713

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-18: Re: enhance find*() methods to filter through all object types

In your merge request you mention that you find the solution inelegant. The inelegance is architectural, I think. The Beautiful Soup API (as opposed to the class inheritance) has always considered tag elements and text elements to be very different.

After looking this over, I can see a very elegant solution, but it doesn't look like the current system of find_* methods. It looks more like XPath, where the strategy for traversing the tree is decoupled from the strategy for matching PageElements.

Beautiful Soup already has both of these components. The tree traversal strategy is encapsulated in the generators, and the match strategy is encapsulated in SoupStrainer. But we don't have one method that takes both of those encapsulated things. The closest is PageElement._find_all, which takes a generator + lots of other arguments. It uses those arguments to build a SoupStrainer and then runs the SoupStrainer against the generator.

SoupStrainer.search() is basically the method you're looking for: it takes a PageElement and returns the PageElement (if there's a match) or None. My suggestion is that we make a subclass/superclass/alternate implementation of SoupStrainer which delegates the go/no go decision to a function passed into the constructor. Rather than trying to fit it in as another option in the SoupStrainer class.

Beyond that we have a couple of options. We can create a new public API method based on _find_all which accepts a generator and a SoupStainer. Very elegant.

But, it's not necessary to go that far, because we can also take advantage of an existing feature that is barely documented:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.SoupStrainer

"You can also pass a SoupStrainer into any of the methods covered in Searching the tree. This probably isn’t terribly useful, but I thought I’d mention it."

Basically, if you pass a SoupStrainer object in as `name`, all of the other arguments are ignored and we match the SoupStrainer instead. So it's possible to implement the code you want without any changes to Beautiful Soup itself, by subclassing and overriding SoupStrainer.search:

#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString, SoupStrainer

class MatchNonWhitespace(SoupStrainer):
    def search(self, element):
        if isinstance(element, NavigableString) and element.text.isspace():
            return None
        return element

html_doc = """

 bold
 italic
 and
 underline
 

"""
soup = BeautifulSoup(html_doc, 'lxml')

is_non_whitespace = MatchNonWhitespace()
# get the first non-whitespace thing in 
this_thing = soup.find('p').find(is_non_whitespace, recursive=False)

# print all following non-whitespace sibling elements in 
while this_thing:
 next_thing = this_thing.find_next_sibling(is_non_whitespace)
 print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
 this_thing = next_thing

Beyond that we have a couple of options. We can create a new public API method based on _find_all which accepts a generator and a SoupStainer. Very elegant.

But, it's not necessary to go that far, because we can also take advantage of an existing feature that is barely documented:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.SoupStrainer

"You can also pass a SoupStrainer into any of the methods covered in Searching the tree. This probably isn’t terribly useful, but I thought I’d mention it."

#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString, SoupStrainer

class MatchNonWhitespace(SoupStrainer):
 def search(self, element):
 if isinstance(element, NavigableString) and element.text.isspace():
 return None
 return element
 
html_doc = """

 bold
 italic
 and
 underline
 

"""
soup = BeautifulSoup(html_doc, 'lxml')

is_non_whitespace = MatchNonWhitespace()
# get the first non-whitespace thing in 
this_thing = soup.find('p').find(is_non_whitespace, recursive=False)