Comment 18 for bug 1768330

Revision history for this message
Chris Papademetrious (chrispitude) wrote :

I might have a solution to this. The idea is to keep accumulating NavigableString fragments into a "current string item" as long as we're inside the same lowest-level containing block element. If we move into a new block element, then we start a new string item and accumulate into that.

The behavior can be controlled by considering a "block_elements" argument that specifies the granularity of block context inference.

If I have the following input document:

====
from bs4 import BeautifulSoup, NavigableString
html_doc = """
<body>
 <p>sentence one.</p><p>sentence two.</p>
 <p>Hello W<b>orl</b>d!</p><p>Test</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'lxml')
====

and I evaluate the following function:

====
def my_all_strings (soup, block_elements=True):
    strings = []
    last_block_container = None
    for element in soup.descendants:

        # determine if we have entered a new string context or not
        if isinstance(element, NavigableString):
            if (block_elements is True):
                # separate *every* string (current behavior)
                new_container = True
            elif (block_elements):
                # must be a list; use block-element semantics
                this_block_container = element.find_parent(block_elements)
                new_container = (this_block_container is not last_block_container)
                last_block_container = this_block_container
            else:
                # return one big string
                new_container = False

            if new_container or not strings:
                # start a new string
                strings.append("")

            strings[-1] += element.text
    return strings

block_elements = ['address', 'article', 'aside','blockquote', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video']

print(f"{'default:':>32s} {repr(my_all_strings(soup))}")
print(f"{'block_elements = True:':>32s} {repr(my_all_strings(soup, block_elements=True))}")
print(f"{'block_elements = <HTML blocks>:':>32s} {repr(my_all_strings(soup, block_elements=block_elements))}")
print(f"{'block_elements = []:':>32s} {repr(my_all_strings(soup, block_elements=[]))}")
print(f"{'block_elements = False:':>32s} {repr(my_all_strings(soup, block_elements=False))}")
print(f"{'block_elements = None:':>32s} {repr(my_all_strings(soup, block_elements=None))}")
====

I get this:

====
                        default: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
          block_elements = True: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
 block_elements = <HTML blocks>: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello World!', 'Test', '\n\n']
            block_elements = []: ['\nsentence one.sentence two.\nHello World!Test\n\n']
         block_elements = False: ['\nsentence one.sentence two.\nHello World!Test\n\n']
          block_elements = None: ['\nsentence one.sentence two.\nHello World!Test\n\n']
====

My first version was more compact (~6 lines) but the logic was obfuscated by ternary operators and sneaky short-circuits. This version is more friendly to the human and should execute just as fast.

block_elements can default to True, which matches the current behavior today.

If you're agreeable to the approach, I could try to submit a merge request that uses it in the _all_strings method for Tag objects.