In your merge request you mention that you find the solution inelegant. The inelegance is architectural, I think. The Beautiful Soup API (as opposed to the class inheritance) has always considered tag elements and text elements to be very different.
After looking this over, I can see a very elegant solution, but it doesn't look like the current system of find_* methods. It looks more like XPath, where the strategy for traversing the tree is decoupled from the strategy for matching PageElements.
Beautiful Soup already has both of these components. The tree traversal strategy is encapsulated in the generators, and the match strategy is encapsulated in SoupStrainer. But we don't have one method that takes both of those encapsulated things. The closest is PageElement._find_all, which takes a generator + lots of other arguments. It uses those arguments to build a SoupStrainer and then runs the SoupStrainer against the generator.
SoupStrainer.search() is basically the method you're looking for: it takes a PageElement and returns the PageElement (if there's a match) or None. My suggestion is that we make a subclass/superclass/alternate implementation of SoupStrainer which delegates the go/no go decision to a function passed into the constructor. Rather than trying to fit it in as another option in the SoupStrainer class.
Beyond that we have a couple of options. We can create a new public API method based on _find_all which accepts a generator and a SoupStainer. Very elegant.
"You can also pass a SoupStrainer into any of the methods covered in Searching the tree. This probably isn’t terribly useful, but I thought I’d mention it."
Basically, if you pass a SoupStrainer object in as `name`, all of the other arguments are ignored and we match the SoupStrainer instead. So it's possible to implement the code you want without any changes to Beautiful Soup itself, by subclassing and overriding SoupStrainer.search:
#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString, SoupStrainer
class MatchNonWhitespace(SoupStrainer):
def search(self, element):
if isinstance(element, NavigableString) and element.text.isspace():
return None
return element
is_non_whitespace = MatchNonWhitespace()
# get the first non-whitespace thing in <p>
this_thing = soup.find('p').find(is_non_whitespace, recursive=False)
# print all following non-whitespace sibling elements in <p>
while this_thing:
next_thing = this_thing.find_next_sibling(is_non_whitespace)
print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
this_thing = next_thing
In your merge request you mention that you find the solution inelegant. The inelegance is architectural, I think. The Beautiful Soup API (as opposed to the class inheritance) has always considered tag elements and text elements to be very different.
After looking this over, I can see a very elegant solution, but it doesn't look like the current system of find_* methods. It looks more like XPath, where the strategy for traversing the tree is decoupled from the strategy for matching PageElements.
Beautiful Soup already has both of these components. The tree traversal strategy is encapsulated in the generators, and the match strategy is encapsulated in SoupStrainer. But we don't have one method that takes both of those encapsulated things. The closest is PageElement. _find_all, which takes a generator + lots of other arguments. It uses those arguments to build a SoupStrainer and then runs the SoupStrainer against the generator.
SoupStrainer. search( ) is basically the method you're looking for: it takes a PageElement and returns the PageElement (if there's a match) or None. My suggestion is that we make a subclass/ superclass/ alternate implementation of SoupStrainer which delegates the go/no go decision to a function passed into the constructor. Rather than trying to fit it in as another option in the SoupStrainer class.
Beyond that we have a couple of options. We can create a new public API method based on _find_all which accepts a generator and a SoupStainer. Very elegant.
But, it's not necessary to go that far, because we can also take advantage of an existing feature that is barely documented: /www.crummy. com/software/ BeautifulSoup/ bs4/doc/ #bs4.SoupStrain er
https:/
"You can also pass a SoupStrainer into any of the methods covered in Searching the tree. This probably isn’t terribly useful, but I thought I’d mention it."
Basically, if you pass a SoupStrainer object in as `name`, all of the other arguments are ignored and we match the SoupStrainer instead. So it's possible to implement the code you want without any changes to Beautiful Soup itself, by subclassing and overriding SoupStrainer. search:
#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString, SoupStrainer
class MatchNonWhitesp ace(SoupStraine r): text.isspace( ):
def search(self, element):
if isinstance(element, NavigableString) and element.
return None
return element
html_doc = """ html_doc, 'lxml')
<p>
<b>bold</b>
<i>italic</i>
and
<u>underline</u>
<br />
</p>
"""
soup = BeautifulSoup(
is_non_whitespace = MatchNonWhitesp ace() 'p').find( is_non_ whitespace, recursive=False)
# get the first non-whitespace thing in <p>
this_thing = soup.find(
# print all following non-whitespace sibling elements in <p> find_next_ sibling( is_non_ whitespace) f"{repr( this_thing) } is followed by {repr(next_ thing)} ")
while this_thing:
next_thing = this_thing.
print(
this_thing = next_thing