Beautiful Soup

Create an easy way to apply a filter to any kind of PageElement

Bug #2047713 reported by Chris Papademetrious on 2023-12-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	In Progress	Undecided	Unassigned

Bug Description

Beautiful Soup and XSLT/XQuery object types correlate as follows:

* Tag is like * (element nodes)
* NavigableString is like text() (text nodes)
* Comment is like comment() (comment nodes)
* ProcessingInstruction is like processing-instruction() (PI nodes)

In XSLT, a node() object type matches *any* object type that can be contained in the document.

For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do:

====
preceding-sibling::node()[not(mine:is-whitespace-text(.))][1]
following-sibling::node()[not(mine:is-whitespace-text(.))][1]
====

where is-whitespace-text() is a function that returns true() for whitespace text() nodes.

I want to similarly filter through arbitrary object types in Beautiful Soup. But if I define a custom filter function:

====
def is_whitespace_text(tag) -> bool:
return isinstance(tag, NavigableString) and tag.text.isspace()

def is_not_whitespace_text(tag) -> bool:
return not is_whitespace_text(tag)
====

then I would need some kind of "node" argument that considers all object types via my filter function:

====
prev_thing = this_thing.find_previous_sibling(node=is_not_whitespace)
next_thing = this_thing.find_next_sibling(node=is_not_whitespace)
====

The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior).

This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstruction, and so on. Possible argument names for this filter type could be:

node=
object=

I think this new argument should accept only the following:

  Callable - return matching objects
  True - return all objects
  False - return no objects
  None - (??? not sure what makes sense here ???)

Here is an example testcase:

====
#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString

html_doc = """

 bold
 italic
 and
 underline
 

"""
soup = BeautifulSoup(html_doc, 'lxml')

# this is the filter I want to use
def is_non_whitespace(thing) -> bool:
return not (isinstance(thing, NavigableString) and thing.text.isspace())

# this is workaround function #1
def workaround_find_next_sibling_non_whitespace(thing) -> bool:
    for next_thing in thing.next_siblings:
        if is_non_whitespace(next_thing):
            return next_thing
    return None

# this is workaround function #2
def workaround_find_first_child_non_whitespace(thing) -> bool:
    for next_thing in thing.contents:
        if is_non_whitespace(next_thing):
            return next_thing
    return None

# get the first non-whitespace thing in 
#this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False)
this_thing = workaround_find_first_child_non_whitespace(soup.find('p'))

# print all following non-whitespace sibling elements in 
while this_thing:
 #next_thing = this_thing.find_next_sibling(node=is_non_whitespace)
 next_thing = workaround_find_next_sibling_non_whitespace(this_thing)
 print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
 this_thing = next_thing
====

See original description

Chris Papademetrious (chrispitude) on 2023-12-29

description:

updated

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2023-12-29:

The Beautiful Soup architecture quite elegantly centralizes all the searching logic into the SoupStrainer class, so adding this functionality to all relevant methods that could benefit should be straightforward.

However, I don't yet understand the SoupStrainer class well enough to propose the change. If anyone has suggestions on where to start, I'd be happy to hear it!

Chris Papademetrious (chrispitude) on 2023-12-30

description:

updated

Revision history for this message

Leonard Richardson (leonardr) wrote on 2023-12-30:

Check out the 4.13 development branch. This is a new feature so it would go in the next feature release anyway, and as part of my 4.13 work I fixed a number of inconsistencies exposed by the addition of type hints. In particular, the SoupStrainer API was rationalized. It's been a while so I don't remember the details, but I believe you should be able to put all of the logic for this into the _matches() method.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-01-11:

I submitted the following merge request for consideration:

https://code.launchpad.net/~chrispitude/beautifulsoup/+git/beautifulsoup/+merge/457783

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-18:

In your merge request you mention that you find the solution inelegant. The inelegance is architectural, I think. The Beautiful Soup API (as opposed to the class inheritance) has always considered tag elements and text elements to be very different.

After looking this over, I can see a very elegant solution, but it doesn't look like the current system of find_* methods. It looks more like XPath, where the strategy for traversing the tree is decoupled from the strategy for matching PageElements.

Beautiful Soup already has both of these components. The tree traversal strategy is encapsulated in the generators, and the match strategy is encapsulated in SoupStrainer. But we don't have one method that takes both of those encapsulated things. The closest is PageElement._find_all, which takes a generator + lots of other arguments. It uses those arguments to build a SoupStrainer and then runs the SoupStrainer against the generator.

SoupStrainer.search() is basically the method you're looking for: it takes a PageElement and returns the PageElement (if there's a match) or None. My suggestion is that we make a subclass/superclass/alternate implementation of SoupStrainer which delegates the go/no go decision to a function passed into the constructor. Rather than trying to fit it in as another option in the SoupStrainer class.

Beyond that we have a couple of options. We can create a new public API method based on _find_all which accepts a generator and a SoupStainer. Very elegant.

But, it's not necessary to go that far, because we can also take advantage of an existing feature that is barely documented:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.SoupStrainer

"You can also pass a SoupStrainer into any of the methods covered in Searching the tree. This probably isn’t terribly useful, but I thought I’d mention it."

Basically, if you pass a SoupStrainer object in as `name`, all of the other arguments are ignored and we match the SoupStrainer instead. So it's possible to implement the code you want without any changes to Beautiful Soup itself, by subclassing and overriding SoupStrainer.search:

#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString, SoupStrainer

class MatchNonWhitespace(SoupStrainer):
    def search(self, element):
        if isinstance(element, NavigableString) and element.text.isspace():
            return None
        return element

html_doc = """

 bold
 italic
 and
 underline
 

"""
soup = BeautifulSoup(html_doc, 'lxml')

is_non_whitespace = MatchNonWhitespace()
# get the first non-whitespace thing in 
this_thing = soup.find('p').find(is_non_whitespace, recursive=False)

# print all following non-whitespace sibling elements in 
while this_thing:
 next_thing = this_thing.find_next_sibling(is_non_whitespace)
 print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
 this_thing = next_thing

Beyond that we have a couple of options. We can create a new public API method based on _find_all which accepts a generator and a SoupStainer. Very elegant.

But, it's not necessary to go that far, because we can also take advantage of an existing feature that is barely documented:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#bs4.SoupStrainer

"You can also pass a SoupStrainer into any of the methods covered in Searching the tree. This probably isn’t terribly useful, but I thought I’d mention it."

#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString, SoupStrainer

class MatchNonWhitespace(SoupStrainer):
 def search(self, element):
 if isinstance(element, NavigableString) and element.text.isspace():
 return None
 return element
 
html_doc = """

 bold
 italic
 and
 underline
 

"""
soup = BeautifulSoup(html_doc, 'lxml')

is_non_whitespace = MatchNonWhitespace()
# get the first non-whitespace thing in 
this_thing = soup.find('p').find(is_non_whitespace, recursive=False)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-18:

To summarize my actual suggestion:

My suggestion is that we make a subclass/superclass/alternate implementation of SoupStrainer which delegates the go/no go decision to a function passed into the constructor. Rather than trying to fit it in as another option in the SoupStrainer class.

Then let's document the interaction between SoupStrainer and the find* methods in more detail, rather than leaving it as an afterthought.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-19:

Take a look at https://code.launchpad.net/~leonardr/beautifulsoup/+git/beautifulsoup/+merge/459082. I'd want to play around with terminology, and make the base class capable of being passed into the BeautifulSoup constructor as parse_only. But I'm pretty happy with this overall. It would let you write code that looked like this:

from bs4 import BeautifulSoup, NavigableString
from bs4.strainer import ElementMatcher

def non_whitespace(element):
return not (isinstance(element, NavigableString) and element.text.isspace())

match = ElementMatcher(non_whitespace)

html_doc = """

 bold
 italic
 and
 underline
 

"""
soup = BeautifulSoup(html_doc, 'lxml')

# get the first non-whitespace thing in 
this_thing = soup.find('p').find(match, recursive=False)

# print all following non-whitespace sibling elements in 
while this_thing:
 next_thing = this_thing.find_next_sibling(match)
 print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
 this_thing = next_thing

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-01-19:

I pulled your branch and played with this. Wow, this is super cool!!

And for simple tests (like the whitespace one), you can put an anonymous lambda function right in the ElementClass() call:

====
match_non_whitespace = ElementMatcher(lambda element:
not (isinstance(element, NavigableString) and element.text.isspace())
)
====

I like the ElementMatcher name. It intuitively describes its purpose.

This is just awesome. Please let me know how I can help.

Leonard Richardson (leonardr) on 2024-01-22

summary:

- enhance find*() methods to filter through all object types
+ Create an easy way to apply a filter to any kind of PageElement

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-22 (last edit on 2024-01-22):

I've written some tests and done some updates and there are three places where I'd like your help.

1. I added an "Advanced search techniques" section to the documentation but it's missing examples. If you can come up with a simple, useful example of what to do with this functionality I'd appreciate it.

2. I'm uncertain about having this technique use hook functions passed into the constructor, instead of requiring that the user subclass the base class. Basically because there's no other functionality in the class. This especially applies to allow_tag_creation_function and allow_string_creation_function, but also applies to match_function to some extent. If I get rid of the other two match functions then it definitely applies to match_function as well, for consistency reasons.

How would you feel if using this functionality required subclassing instead of passing a (possibly anonymous) function into a constructor?

3. I renamed ElementMatcher to ElementSelector, since that's a term used elsewhere, but I'm not so sure about that idea for a couple of reasons:

* The term "selector" seems associated mainly with CSS selectors in peoples' minds, so it might be too technology-specific.
* Specifically, there's a method called Tag.select which has nothing to do with ElementSelector, and instead uses CSS selectors. So ElementSelector might be _misleadingly_ technology-specific.

I'm still not wild about ElementMatcher but I think I might go back to it. What do you think?

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-01-31:

Thank you, this is exciting progress!

For #1, I will pull your latest updates and come up with an example that flows well.

For #2, either method works fine for me as long as the functionality is there. I'm still a Python novice; is this effectively the same as the constructor/function approach but with a bit more wrapper code to subclass them?

For #3, how about ElementFilter? I agree that "selector" has a CSS connotation that could blur the mental model and cause confusion. And, we already use the term "filter" in the BS4 documentation for this functional area.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-02-01:

#10

#2: Yeah, basically you can't pass in an anonymous function to something, you have to subclass. We can always add that sort of functionality later if this is used enough to justify it.

#3: ElementFilter is a great name. I've adopted that in the code base, so be careful if you pull again.

Leonard Richardson (leonardr) on 2024-02-02

Changed in beautifulsoup:
status:	New → Fix Committed
status:	Fix Committed → In Progress

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-05-30:

#11

While working on bug #2067634 I realized it was more elegant for the filter(iterator) method to be on ElementFilter, rather than PageElement. That change is in revision 007049a on the 4.13 branch. I haven't changed/updated the docs yet.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.