Beautiful Soup and XSLT/XQuery object types correlate as follows:
* Tag is like * (element nodes)
* NavigableString is like text() (text nodes)
* Comment is like comment() (comment nodes)
* ProcessingInstruction is like processing-instruction() (PI nodes)
In XSLT, a node() object type matches *any* object type that can be contained in the document.
For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do:
The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior).
This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstruction, and so on. Possible argument names for this filter type could be:
node=
object=
I think this new argument should accept only the following:
Callable - return matching objects
True - return all objects
False - return no objects
None - (??? not sure what makes sense here ???)
Here is an example testcase:
====
#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString
# this is the filter I want to use
def is_non_whitespace(thing) -> bool:
return not (isinstance(thing, NavigableString) and thing.text.isspace())
# this is workaround function #1
def workaround_find_next_sibling_non_whitespace(thing) -> bool:
for next_thing in thing.next_siblings:
if is_non_whitespace(next_thing):
return next_thing
return None
# this is workaround function #2
def workaround_find_first_child_non_whitespace(thing) -> bool:
for next_thing in thing.contents:
if is_non_whitespace(next_thing):
return next_thing
return None
# get the first non-whitespace thing in <p>
#this_thing = soup.find('p').find(node=is_non_whitespace, recursive=False)
this_thing = workaround_find_first_child_non_whitespace(soup.find('p'))
# print all following non-whitespace sibling elements in <p>
while this_thing:
#next_thing = this_thing.find_next_sibling(node=is_non_whitespace)
next_thing = workaround_find_next_sibling_non_whitespace(this_thing)
print(f"{repr(this_thing)} is followed by {repr(next_thing)}")
this_thing = next_thing
====
Beautiful Soup and XSLT/XQuery object types correlate as follows:
* Tag is like * (element nodes) uction is like processing- instruction( ) (PI nodes)
* NavigableString is like text() (text nodes)
* Comment is like comment() (comment nodes)
* ProcessingInstr
In XSLT, a node() object type matches *any* object type that can be contained in the document.
For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do:
==== sibling: :node() [not(mine: is-whitespace- text(.) )][1] sibling: :node() [not(mine: is-whitespace- text(.) )][1]
preceding-
following-
====
where is-whitespace- text() is a function that returns true() for whitespace text() nodes.
I want to similarly filter through arbitrary object types in Beautiful Soup too. But if I define a custom filter function:
==== text(tag) -> bool:
def is_whitespace_
return isinstance(tag, NavigableString) and tag.text.isspace()
def is_not_ whitespace_ text(tag) -> bool: text(tag)
return not is_whitespace_
====
there is not a "node" argument that considers all object types that I can pass my filter function to:
==== find_previous_ sibling( node=is_ not_whitespace) find_next_ sibling( node=is_ not_whitespace)
prev_thing = this_thing.
next_thing = this_thing.
====
The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior).
This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstr uction, and so on. Possible argument names for this filter type could be:
node=
object=
I think this new argument should accept only the following:
Callable - return matching objects
True - return all objects
False - return no objects
None - (??? not sure what makes sense here ???)
Here is an example testcase:
====
#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString
html_doc = """ html_doc, 'lxml')
<p>
<b>bold</b>
<i>italic</i>
and
<u>underline</u>
</p>
"""
soup = BeautifulSoup(
# this is the filter I want to use whitespace( thing) -> bool: isspace( ))
def is_non_
return not (isinstance(thing, NavigableString) and thing.text.
# this is workaround function #1 find_next_ sibling_ non_whitespace( thing) -> bool: siblings: whitespace( next_thing) :
def workaround_
for next_thing in thing.next_
if is_non_
return next_thing
return None
# this is workaround function #2 find_first_ child_non_ whitespace( thing) -> bool: whitespace( next_thing) :
def workaround_
for next_thing in thing.contents:
if is_non_
return next_thing
return None
# get the first non-whitespace thing in <p> 'p').find( node=is_ non_whitespace, recursive=False) find_first_ child_non_ whitespace( soup.find( 'p'))
#this_thing = soup.find(
this_thing = workaround_
# print all following non-whitespace sibling elements in <p> find_next_ sibling( node=is_ non_whitespace) find_next_ sibling_ non_whitespace( this_thing) f"{repr( this_thing) } is followed by {repr(next_ thing)} ")
while this_thing:
#next_thing = this_thing.
next_thing = workaround_
print(
this_thing = next_thing
====