Beautiful Soup

Bug #2047713
Comment #6

Comment 6 for bug 2047713

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-19: Re: enhance find*() methods to filter through all object types

Take a look at https://code.launchpad.net/~leonardr/beautifulsoup/+git/beautifulsoup/+merge/459082. I'd want to play around with terminology, and make the base class capable of being passed into the BeautifulSoup constructor as parse_only. But I'm pretty happy with this overall. It would let you write code that looked like this:

from bs4 import BeautifulSoup, NavigableString
from bs4.strainer import ElementMatcher

def non_whitespace(element):
return not (isinstance(element, NavigableString) and element.text.isspace())

match = ElementMatcher(non_whitespace)

html_doc = """

 bold
 italic
 and
 underline
 

"""
soup = BeautifulSoup(html_doc, 'lxml')

# get the first non-whitespace thing in 
this_thing = soup.find('p').find(match, recursive=False)

# print all following non-whitespace sibling elements in 
while this_thing:
 next_thing = this_thing.find_next_sibling(match)
 print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
 this_thing = next_thing