I might have a solution to this. The idea is to keep accumulating NavigableString fragments into a "current string item" as long as we're inside the same lowest-level containing block element. If we move into a new block element, then we start a new string item and accumulate into that.
The behavior can be controlled by considering a "block_elements" argument that specifies the granularity of block context inference.
====
def my_all_strings (soup, block_elements=True):
strings = []
last_block_container = None
for element in soup.descendants:
# determine if we have entered a new string context or not
if isinstance(element, NavigableString):
if (block_elements is True):
# separate *every* string (current behavior) new_container = True
elif (block_elements):
# must be a list; use block-element semantics this_block_container = element.find_parent(block_elements) new_container = (this_block_container is not last_block_container) last_block_container = this_block_container
else:
# return one big string new_container = False
if new_container or not strings:
# start a new string strings.append("")
My first version was more compact (~6 lines) but the logic was obfuscated by ternary operators and sneaky short-circuits. This version is more friendly to the human and should execute just as fast.
block_elements can default to True, which matches the current behavior today.
If you're agreeable to the approach, I could try to submit a merge request that uses it in the _all_strings method for Tag objects.
I might have a solution to this. The idea is to keep accumulating NavigableString fragments into a "current string item" as long as we're inside the same lowest-level containing block element. If we move into a new block element, then we start a new string item and accumulate into that.
The behavior can be controlled by considering a "block_elements" argument that specifies the granularity of block context inference.
If I have the following input document:
==== /b>d!</ p><p>Test< /p> html_doc, 'lxml')
from bs4 import BeautifulSoup, NavigableString
html_doc = """
<body>
<p>sentence one.</p><p>sentence two.</p>
<p>Hello W<b>orl<
</body>
"""
soup = BeautifulSoup(
====
and I evaluate the following function:
==== True): block_container = None
def my_all_strings (soup, block_elements=
strings = []
last_
for element in soup.descendants:
# determine if we have entered a new string context or not
new_container = True
this_ block_container = element. find_parent( block_elements)
new_container = (this_block_ container is not last_block_ container)
last_ block_container = this_block_ container
new_container = False
if isinstance(element, NavigableString):
if (block_elements is True):
# separate *every* string (current behavior)
elif (block_elements):
# must be a list; use block-element semantics
else:
# return one big string
if new_container or not strings:
strings. append( "")
# start a new string
return strings
block_elements = ['address', 'article', 'aside' ,'blockquote' , 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video']
print(f" {'default: ':>32s} {repr(my_ all_strings( soup))} ") {'block_ elements = True:':>32s} {repr(my_ all_strings( soup, block_elements= True))} ") {'block_ elements = <HTML blocks>:':>32s} {repr(my_ all_strings( soup, block_elements= block_elements) )}") {'block_ elements = []:':>32s} {repr(my_ all_strings( soup, block_elements= []))}") {'block_ elements = False:':>32s} {repr(my_ all_strings( soup, block_elements= False)) }") {'block_ elements = None:':>32s} {repr(my_ all_strings( soup, block_elements= None))} ")
print(f"
print(f"
print(f"
print(f"
print(f"
====
I get this:
====
default: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
block_ elements = True: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
block_ elements = []: ['\nsentence one.sentence two.\nHello World!Test\n\n']
block_ elements = False: ['\nsentence one.sentence two.\nHello World!Test\n\n']
block_ elements = None: ['\nsentence one.sentence two.\nHello World!Test\n\n']
block_elements = <HTML blocks>: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello World!', 'Test', '\n\n']
====
My first version was more compact (~6 lines) but the logic was obfuscated by ternary operators and sneaky short-circuits. This version is more friendly to the human and should execute just as fast.
block_elements can default to True, which matches the current behavior today.
If you're agreeable to the approach, I could try to submit a merge request that uses it in the _all_strings method for Tag objects.