Beautiful Soup

Bug #2047713
Comment #0

Comment 0 for bug 2047713

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2023-12-29: enhance find*() methods to filter through all object types

Beautiful Soup and XSLT/XQuery object types correlate as follows:

* Tag is like * (element nodes)
* NavigableString is like text() (text nodes)
* Comment is like comment() (comment nodes)
* ProcessingInstruction is like processing-instruction() (PI nodes)

In XSLT, a node() object type matches *any* object type that can be contained in the document.

For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do:

====
preceding-sibling::node()[not(mine:is-whitespace-text(.))][1]
following-sibling::node()[not(mine:is-whitespace-text(.))][1]
====

where is-whitespace-text() is a function that returns true() for whitespace text() nodes.

I want to similarly filter through arbitrary object types in Beautiful Soup too. But if I define a custom filter function:

====
def is_whitespace_text(tag) -> bool:
return isinstance(tag, NavigableString) and tag.text.isspace()

def is_not_whitespace_text(tag) -> bool:
return not is_whitespace_text(tag)
====

there is not a "node" argument that considers all object types that I can pass my filter function to:

====
prev_thing = this_thing.find_previous_sibling(node=is_not_whitespace)
next_thing = this_thing.find_next_sibling(node=is_not_whitespace)
====

The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior).

This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstruction, and so on. Possible argument names for this filter type could be:

node=
object=

I think this new argument should accept only the following:

  Callable - return matching objects
  True - return all objects
  False - return no objects
  None - (??? not sure what makes sense here ???)

Here is an example testcase:

====
#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString

html_doc = """

 bold
 italic
 and
 underline

"""
soup = BeautifulSoup(html_doc, 'lxml')

# this is the filter I want to use
def is_non_whitespace(thing) -> bool:
return not (isinstance(thing, NavigableString) and thing.text.isspace())

# this is workaround function #1
def workaround_find_next_sibling_non_whitespace(thing) -> bool:
    for next_thing in thing.next_siblings:
        if is_non_whitespace(next_thing):
            return next_thing
    return None

# this is workaround function #2
def workaround_find_first_child_non_whitespace(thing) -> bool:
    for next_thing in thing.contents:
        if is_non_whitespace(next_thing):
            return next_thing
    return None

# get the first non-whitespace thing in 
#this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False)
this_thing = workaround_find_first_child_non_whitespace(soup.find('p'))

# print all following non-whitespace sibling elements in 
while this_thing:
 #next_thing = this_thing.find_next_sibling(node=is_non_whitespace)
 next_thing = workaround_find_next_sibling_non_whitespace(this_thing)
 print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
 this_thing = next_thing
====