Create an easy way to apply a filter to any kind of PageElement
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
In Progress
|
Undecided
|
Unassigned |
Bug Description
Beautiful Soup and XSLT/XQuery object types correlate as follows:
* Tag is like * (element nodes)
* NavigableString is like text() (text nodes)
* Comment is like comment() (comment nodes)
* ProcessingInstr
In XSLT, a node() object type matches *any* object type that can be contained in the document.
For example, to get the previous or following object (be it an element, string, comment, PI, etc.) of a given object while skipping over whitespace-only text() nodes, I can do:
====
preceding-
following-
====
where is-whitespace-
I want to similarly filter through arbitrary object types in Beautiful Soup. But if I define a custom filter function:
====
def is_whitespace_
return isinstance(tag, NavigableString) and tag.text.isspace()
def is_not_
return not is_whitespace_
====
then I would need some kind of "node" argument that considers all object types via my filter function:
====
prev_thing = this_thing.
next_thing = this_thing.
====
The Beautiful Soup find*() methods support simultaneous specification of Tag and NavigableString filters, but that is different (they are an AND condition, plus the string filters also apply an inheritance behavior).
This enhancement request is to add a new filter type that considers all possible objects that could be in a document - Tag, NavigableString, Comment, ProcessingInstr
node=
object=
I think this new argument should accept only the following:
Callable - return matching objects
True - return all objects
False - return no objects
None - (??? not sure what makes sense here ???)
Here is an example testcase:
====
#!/usr/bin/env python
from bs4 import BeautifulSoup, NavigableString
html_doc = """
<p>
<b>bold</b>
<i>italic</i>
and
<u>underline</u>
<br />
</p>
"""
soup = BeautifulSoup(
# this is the filter I want to use
def is_non_
return not (isinstance(thing, NavigableString) and thing.text.
# this is workaround function #1
def workaround_
for next_thing in thing.next_
if is_non_
return next_thing
return None
# this is workaround function #2
def workaround_
for next_thing in thing.contents:
if is_non_
return next_thing
return None
# get the first non-whitespace thing in <p>
#this_thing = soup.find(
this_thing = workaround_
# print all following non-whitespace sibling elements in <p>
while this_thing:
#next_thing = this_thing.
next_thing = workaround_
print(
this_thing = next_thing
====
description: | updated |
description: | updated |
summary: |
- enhance find*() methods to filter through all object types + Create an easy way to apply a filter to any kind of PageElement |
Changed in beautifulsoup: | |
status: | New → Fix Committed |
status: | Fix Committed → In Progress |
The Beautiful Soup architecture quite elegantly centralizes all the searching logic into the SoupStrainer class, so adding this functionality to all relevant methods that could benefit should be straightforward.
However, I don't yet understand the SoupStrainer class well enough to propose the change. If anyone has suggestions on where to start, I'd be happy to hear it!