Bug #1768330 “get_text() doesn't retain semantic separation of t...” : Bugs : Beautiful Soup

Revision history for this message

ellie (et1234567) wrote on 2018-05-01:

#1

Sorry, I forgot to specify the version: python-beautifulsoup4-4.6.0-2.fc27.1

Revision history for this message

ellie (et1234567) wrote on 2018-05-01:

#2

I just stumbled upon separator=' ', but sadly that option is also useless / semantically nonsense:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("Hello World!", "html.parser")
>>> soup.get_text(separator=' ')
'Hello W orl d!'
>>> soup = BeautifulSoup("Hello World!Test", "html.parser")
>>> soup.get_text(separator=' ')
'Hello W orl d! Test'
>>>

Of course, here the expected result would be: "Hello World! Test"

Isn't there any way to have BeautifulSoup apply a proper understanding of whitespace like a web browser? (That is, text contained in completely separate BLOCK tags like "p" is always separated with whitespace, while separation by INLINE tags like "b" won't cause spurious, incorrect whitespace.)

I know CSS can break all of this, but only on bad sites that don't use proper semantic HTML. But as BeautifulSoup works now, I find no good option to even parse the *proper* instances of semantic HTML in a correct way to text, which seems quite limiting.

Or is there some hidden module / extension that handles this correctly?

By the way, this bug looks like the same problem, just another instance: https://bugs.launchpad.net/beautifulsoup/+bug/1767999 (failure of BeautifulSoup to understand what is semantically - per default, without CSS changes - an inline and not a block tag, where you can't just slap in whitespace with the same visual result)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-07-16:

#3

Thanks for your thoughtful bug report. I would say you've run into the limitations of what Beautiful Soup is designed to do. The library has just enough understanding of the semantics of HTML that parsing a valid document and writing it back out won't alter the document's semantics. This means Beautiful Soup needs to know which HTML elements are void elements, which elements give significance to the whitespace inside them, and so on.

Adding an understanding of inline tags versus block tags would make Beautiful Soup more like a web browser. It's a reasonable thing to ask for, but I try to keep my maintenance work on this project to a couple of weekends a year, so I'm probably not going to write it. Based on my experience with the CSS selector subsystem, I'd be reluctant to even accept such a contribution (though it would depend on how big it really was--and I realize you're not offering to write it).

get_text() is designed as a quick-and-dirty way to rip all the text out of a document, when you'd rather do text processing than HTML processing. I'm going to put this issue in a "confirmed" state and think about how much work it would be to implement the feature you're requesting. I could make a list of the block tags easily enough, and make a method like get_blocks() which tried to group the strings appropriately, but I believe it--like the CSS selectors--would fail in a thousand tiny edge cases, and I don't have time to investigate them.

Changed in beautifulsoup:
status:	New → Confirmed

Leonard Richardson (leonardr) on 2018-07-19

tags:	added: featire
tags:	added: feature removed: featire

Leonard Richardson (leonardr) on 2018-07-21

Changed in beautifulsoup:
importance:	Undecided → Wishlist

Revision history for this message

Leonard Richardson (leonardr) wrote on 2018-08-02:

#4

I find this problem interesting so I spent a little time investigating it. I put a list of the HTML block elements into HTMLTreeBuilder and wrote some code like this:

def contains_any_blocks(tag):
    for i in tag.descendants:
        if i.name in HTMLTreeBuilder.block_elements:
            return True
        return False
used = set()
for block in soup.find_all(HTMLTreeBuilder.block_elements):
    if contains_any_blocks(block):
        continue
    if any(x in used for x in block.parents):
        continue
    used.add(block)
    print block.name, block.get_text(separator=' ').encode("utf-8")
    print "-" * 80

The idea is to find the largest set of non-overlapping block tags. This would minimize the risk that get_text() will return too much or too little text.

This is a decent start, but the code is very slow and sometimes misses obvious cases (e.g. it sometimes treats everything in a <ul> tag as one block, even though <li> is also a block tag).

Overall this has confirmed my belief that I don't want Beautiful Soup to be in this business, but it's very tempting to think that a clever, simple (but probably still slow) solution is nearby.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-05-08:

#5

Another way of doing this would be to use currently unused operators for this purpose. Then you could get syntax like:

soup % 'a' / 'href'

This avoids most of the problems I mentioned, but most of the currently unused operators are math operators. There's no intuitive connection between the meaning of the operator and what the operator does to a Tag or a ResultSet. It could just as easily look like this:

soup / 'a' % 'href'

So the resulting system would be hard to learn and remember. The dot operator (generally used to move from a Python object to one of its attributes) and the square-brackets operator (generally used to index a Python array or dictionary) don't have this problem. Their Beautiful Soup uses are similar to their normal Python uses.

Overall I think list comprehensions are the right tool for this sort of thing -- that's the syntax the Python devs came up with and even if I could do slightly better, the fact that it's different from normal Python would itself be a negative.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-05-08:

#6

Ignore the last comment -- I meant to post it on bug #1828188

Revision history for this message

Leonard Richardson (leonardr) wrote on 2019-07-22:

#7

I came back to this following the 4.8 release and I think I have an efficient algorithm that groups text blocks together. The catch is there's no way to get the text blocks in a nice list, because block elements can contain other block elements. That's why my original plan fell apart when I started looking at nested lists. I was trying to turn a nested data structure into a list, and there's no general way to do that. Any given strategy will look good on some pages (or parts of pages) and bad on others.

My algorithm focuses on removing 'junk' and presenting the text nodes in a way that reflects the structure of the original tree.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2020-09-26:

#8

I've marked bug 1882067 as a duplicate of this issue, although they're not directly related, because I think they come from the same place: a desire to use Beautiful Soup as a text preprocessor that can strip away "useless" markup.

In this case the concern is that some of the "useless" markup isn't so useless -- it conveys conceptual separations that are lost when you just extract all the text. In the case of 1882067, the concern is that some of the *text* is useless -- it's just whitespace and newlines that won't render in a web browser and ought to be collapsed for reading.

The challenge in both cases is distinguishing the "useless" stuff from the "useful" stuff.

Revision history for this message

Tofu (turfurken) wrote on 2023-06-30:

#9

Just want to chime in with a couple of observations.

1. when using automatic splitting, i.e. .get_text('\n') or .stripped_strings(), it splits by every element. e.g. gets split out instead of kept in-line with the rest of the line

2. when using .get_text(), line breaks just follow the source code rather than elements. e.g. sometext is returned as one line, but
some
text
is returned as two lines.
in other words, using .get_text() can produce two different outputs depending on whether the source has been prettified or not.

Revision history for this message

Tofu (turfurken) wrote on 2023-06-30:

#10

more info:

this also affects .prettify()

e.g.

>>> markup = r'lorem <a href="#">ipsum</a> dolor sit amet'
>>> soup = BeautifulSoup(markup, 'html.parser')
>>> print(soup.prettify())

lorem
<a href="#">
ipsum
</a>
dolor

sit

amet

both <a> and are split into their own lines

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2023-11-21 (last edit on 2023-11-21):

#11

Long-time Perl/XML::Twig user, new to Python/Beautiful Soup. @leonardr, this is quite an impressive piece of work you've put together!

<li> and <entry> elements are particularly messy because they can contain mixtures of block and inline elements, such as

<li>Here is plaintext adjacent to a nested list:
 <ul>
 <li>
 Item 1
 </li>
 ...

I had to solve similar block/inline issues here:

https://blog.oxygenxml.com/topics/refactoring_inserting_reformatting.html

Perhaps you could just prepend/append a "block-separation" space around every HTML5 block element (I did not remove tags that .get_text() ignores):

['address', 'article', 'aside','blockquote', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video']

then

* Collapse multiple "block-separation" spaces into a single "block-separation" space.
* Strip any "block-separation" spaces at the beginning and end away.

This would also handle directly-adjacent block elements such as

word1word2

If control is desired, this could be implemented as a block_separator=' ' parameter (separation being the default). I personally do not see the need for control; block elements are intrinsically textually separated.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2023-11-22 (last edit on 2023-11-22):

#12

For space-normalized text scraping, an example workaround is space-separation around block elements, followed by space normalization using split/join:

####
from bs4 import BeautifulSoup

html = """\
<html lang="en">
<body>
<h1>H1</h1>A<ol><li>1</li></ol>BC
</body>
</html>
"""

soup = BeautifulSoup(html, features='lxml')

for b in soup.find_all(['article', 'blockquote', 'dd', 'div', 'dl', 'dt', 'figcaption', 'figure', 'footer', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'li', 'main', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul']):
b.insert_before(' ')
b.insert_after(' ')

print(" ".join(soup.get_text().split()))
####

The split/join approach collapses all forms of whitespace (spaces, tabs, newlines, even those funny Unicode non-breaking/narrow/etc. spaces).

However, the space-insertion workaround alters the original document and I'd rather see a native solution inside get_text() that leaves the document as-is.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2023-11-24 (last edit on 2023-11-24):

#13

I guess we would also need a block_elements argument to know which elements are block versus inline:

p.get_text(block_separator=' ', block_elements = ['article', 'blockquote', ..., 'ul'])

And a cleaner solution than the dedicated block_separator argument I previously suggested is just to simply apply the default separator to block elements only:

p.get_text(' ', block_elements = ['article', 'blockquote', ..., 'ul'])

Also, the prettify() method could also support block_elements, which would only indent elements in that list. The default for the block_elements would be True, to maintain compatibility. That would be pretty cool.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-01-06:

#18

Download full text (3.5 KiB)

I might have a solution to this. The idea is to keep accumulating NavigableString fragments into a "current string item" as long as we're inside the same lowest-level containing block element. If we move into a new block element, then we start a new string item and accumulate into that.

The behavior can be controlled by considering a "block_elements" argument that specifies the granularity of block context inference.

If I have the following input document:

====
from bs4 import BeautifulSoup, NavigableString
html_doc = """
<body>
sentence one.sentence two.
Hello World!Test
</body>
"""
soup = BeautifulSoup(html_doc, 'lxml')
====

and I evaluate the following function:

====
def my_all_strings (soup, block_elements=True):
    strings = []
    last_block_container = None
    for element in soup.descendants:

        # determine if we have entered a new string context or not
        if isinstance(element, NavigableString):
            if (block_elements is True):
                # separate *every* string (current behavior)
                new_container = True
            elif (block_elements):
                # must be a list; use block-element semantics
                this_block_container = element.find_parent(block_elements)
                new_container = (this_block_container is not last_block_container)
                last_block_container = this_block_container
            else:
                # return one big string
                new_container = False

            if new_container or not strings:
                # start a new string
                strings.append("")

strings[-1] += element.text
return strings

block_elements = ['address', 'article', 'aside','blockquote', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video']

print(f"{'default:':>32s} {repr(my_all_strings(soup))}")
print(f"{'block_elements = True:':>32s} {repr(my_all_strings(soup, block_elements=True))}")
print(f"{'block_elements = <HTML blocks>:':>32s} {repr(my_all_strings(soup, block_elements=block_elements))}")
print(f"{'block_elements = []:':>32s} {repr(my_all_strings(soup, block_elements=[]))}")
print(f"{'block_elements = False:':>32s} {repr(my_all_strings(soup, block_elements=False))}")
print(f"{'block_elements = None:':>32s} {repr(my_all_strings(soup, block_elements=None))}")
====

I get this:

====
 default: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
 block_elements = True: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
block_elements = <HTML blocks>: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello World!', 'Test', '\n\n']
 block_elements = []: ['\nsentence one.sentence two.\nHello World!Test\n\n']
 block_elements = False: ['\nsentence one.sentence two.\nHello World!Test\n\n']
 block_elements = None: ['\nsentence one.sentence two.\nHello World!Test\n\n']
====

My first ver...

I might have a solution to this. The idea is to keep accumulating NavigableString fragments into a "current string item" as long as we're inside the same lowest-level containing block element. If we move into a new block element, then we start a new string item and accumulate into that.

The behavior can be controlled by considering a "block_elements" argument that specifies the granularity of block context inference.

If I have the following input document:

====
from bs4 import BeautifulSoup, NavigableString
html_doc = """
<body>
 sentence one.sentence two.
 Hello World!Test
</body>
"""
soup = BeautifulSoup(html_doc, 'lxml')
====

and I evaluate the following function:

====
def my_all_strings (soup, block_elements=True):
    strings = []
    last_block_container = None
    for element in soup.descendants:

# determine if we have entered a new string context or not
        if isinstance(element, NavigableString):
            if (block_elements is True):
                # separate *every* string (current behavior)
                new_container = True
            elif (block_elements):
                # must be a list; use block-element semantics
                this_block_container = element.find_parent(block_elements)
                new_container = (this_block_container is not last_block_container)
                last_block_container = this_block_container
            else:
                # return one big string
                new_container = False

if new_container or not strings:
                # start a new string
                strings.append("")

strings[-1] += element.text
    return strings

block_elements = ['address', 'article', 'aside','blockquote', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video']

print(f"{'default:':>32s} {repr(my_all_strings(soup))}")
print(f"{'block_elements = True:':>32s} {repr(my_all_strings(soup, block_elements=True))}")
print(f"{'block_elements = <HTML blocks>:':>32s} {repr(my_all_strings(soup, block_elements=block_elements))}")
print(f"{'block_elements = []:':>32s} {repr(my_all_strings(soup, block_elements=[]))}")
print(f"{'block_elements = False:':>32s} {repr(my_all_strings(soup, block_elements=False))}")
print(f"{'block_elements = None:':>32s} {repr(my_all_strings(soup, block_elements=None))}")
====

I get this:

====
 default: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
 block_elements = True: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
 block_elements = <HTML blocks>: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello World!', 'Test', '\n\n']
 block_elements = []: ['\nsentence one.sentence two.\nHello World!Test\n\n']
 block_elements = False: ['\nsentence one.sentence two.\nHello World!Test\n\n']
 block_elements = None: ['\nsentence one.sentence two.\nHello World!Test\n\n']
====

My first version was more compact (~6 lines) but the logic was obfuscated by ternary operators and sneaky short-circuits. This version is more friendly to the human and should execute just as fast.

block_elements can default to True, which matches the current behavior today.

If you're agreeable to the approach, I could try to submit a merge request that uses it in the _all_strings method for Tag objects.

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-06:

#19

This is promising. I will want to play around with the API for maximum forwards compatibility, but the output of this algorithm definitely looks good. I have two questions before you do any more work, and I think you can answer both of them at the same time.

First, can you try this on some real web pages and see if it gives you the results you want? A simple case that's also very common would be extracting the "meat" of a web page's content: the product information or the news article on a page that also contains a lot of peripheral stuff.

Second, I'm a bit concerned about any code that looks like this:

for element in soup.descendants:
...
this_block_container = element.find_parent(block_elements)

Because you're calling a tree navigation method inside another tree navigation method, which is very bad for performance. However, find_parent is the least-bad tree navigation method to call in this situation, so it might not be that bad. Basically, if you try this on real web pages, also gather some timing information so I can compare it to the current implementation of get_text().

Revision history for this message

Tofu (turfurken) wrote on 2024-01-06:

#20

two comments about block elements:

1. i wonder if there's any nicer way of determining if an element is a block element other than just maintaining a static list which the package maintainers will have to keep updated manually

2. how is, for example, going to be treated? it is not a block element but is meant to introduce a line break

i think that depends on which of the two general approaches you want to take:

a. try to extract the plaintext as they would appear to a user in a browser (make a new line)

b. try to extract the plaintext as they logically fit in the markup (ignore the )

i believe that even with the (a) approach, it would still be out of scope to try to render the document with CSS and all, but maybe it's not too much to conform to the default html markup behaviour

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-01-07:

#21

That list of HTML block elements is taken from HTMLTreeBuilder, where there's a comment saying it comes from the HTML spec. But that list must be pretty old because that language is not in the HTML spec anymore. It looks like the concept of elements being intrinsically "block" or "inline" has been replaced by a CSS concept called "formatting context" that I don't currently understand. (https://www.w3.org/TR/CSS2/visuren.html#normal-flow)

So in the worst case, as you say, rendering the content in a CSS-aware way would seem to be necessary to see which text nodes are relevant. That's definitely off the table. As an approximation, something that uses notions from the current HTML spec such as "flow content" and "phrasing content" might work in most situations. (https://html.spec.whatwg.org/#kinds-of-content)

However it works, for the text extraction algorithm I would implement a kind of strategy pattern, similar to the pattern used to choose the markup parser. I am very interested in providing a way for people to plug in their own text extraction algorithms, and not very interested in supporting any particular algorithm indefinitely.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-01-10 (last edit on 2024-01-10):

#22

Our current algorithm to solve this problem was to use copy.copy() to make a copy of the soup, iterate through all block elements and insert special NavigableString separator strings ("<<BLOCK>>") before and after each one, call soup.text, then search-and-replace any sequence of one or more separator strings with a single space. I didn't like this approach because I had to copy and destructively modify the soup.

On a test set of about 30k HTML files, this new algorithm returns 100% identical results to our current algorithm, and no copy/modification was needed.

A note on how I arrived at this algorithm... Originally I thought about trying to iterate through Tag descendants and keep track of what I entered and left to maintain a "current" block context. Knowing when we cross start tags was fine - that's precisely what the iterator is - but knowing when we cross end tags was difficult. I could derive it by comparing this start tag to the last start tag, but to determine if the next Tag was *inside* or *after* the previous start tag, I had to query the ancestry of containing elements. And if I'm going to do that each time, I might as well skip the Tags and just check the block context of each NavigableString object.

Would inlining a hardcoded (and simplified!) version of find_parent() in this algorithm resolve your concerns? Is it the algorithmic inclusion of an inner loop within an outer loop, or is it some technical aspect of nesting iterable things that have to maintain context as the interpreter jumps between the execution locations? (And fortunately, there are typically very few (if any) levels separating NavigableStrings from their containing block element.)

I am glad you are looking to provide a generalized solution that can be configured. For example, if I am working with DITA XML content:

https://www.oxygenxml.com/dita/1.3/specs/index.html

then the lists of block and inline elements will be different:

https://blog.oxygenxml.com/topics/refactoring_inserting_reformatting.html

@turfurken - with this algorithm, "ABC DEF" would return "ABCDEF". The algorithm would need an "elif" branch to push a single space on the strings list for elements. As @leonardr said, we don't want to hardcode format-specific heuristics, so hopefully whatever handler configurability magic he comes up with can support this type of thing.

Our current algorithm to solve this problem was to use copy.copy() to make a copy of the soup, iterate through all block elements and insert special NavigableString separator strings ("<<BLOCK>>") before and after each one, call soup.text, then search-and-replace any sequence of one or more separator strings with a single space. I didn't like this approach because I had to copy and destructively modify the soup.

On a test set of about 30k HTML files, this new algorithm returns 100% identical results to our current algorithm, and no copy/modification was needed.

A note on how I arrived at this algorithm... Originally I thought about trying to iterate through Tag descendants and keep track of what I entered and left to maintain a "current" block context. Knowing when we cross start tags was fine - that's precisely what the iterator is - but knowing when we cross end tags was difficult. I could derive it by comparing this start tag to the last start tag, but to determine if the next Tag was *inside* or *after* the previous start tag, I had to query the ancestry of containing elements. And if I'm going to do that each time, I might as well skip the Tags and just check the block context of each NavigableString object.

Would inlining a hardcoded (and simplified!) version of find_parent() in this algorithm resolve your concerns? Is it the algorithmic inclusion of an inner loop within an outer loop, or is it some technical aspect of nesting iterable things that have to maintain context as the interpreter jumps between the execution locations? (And fortunately, there are typically very few (if any) levels separating NavigableStrings from their containing block element.)

I am glad you are looking to provide a generalized solution that can be configured. For example, if I am working with DITA XML content:

https://www.oxygenxml.com/dita/1.3/specs/index.html

then the lists of block and inline elements will be different:

https://blog.oxygenxml.com/topics/refactoring_inserting_reformatting.html

@turfurken - with this algorithm, "ABC DEF" would return "ABCDEF". The algorithm would need an "elif" branch to push a single space on the strings list for elements. As @leonardr said, we don't want to hardcode format-specific heuristics, so hopefully whatever handler configurability magic he comes up with can support this type of thing.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-03-19:

#24

The call to find_parent() could be rewritten using next() as a more lightweight way to find the closest (lowest-level) enclosing block element (the rest of the code is unchanged):

====
def my_all_strings (soup, block_elements=True):
    strings = []
    last_block_container = None
    for element in soup.descendants:

        # determine if we have entered a new string context or not
        if isinstance(element, NavigableString):
            if (block_elements is True):
                # separate *every* string (current behavior)
                new_container = True
            elif (block_elements):
                # must be a list; use block-element semantics
                try:
                    this_block_container = next(parent for parent in element.parents if parent.name in block_elements)
                except StopIteration:
                    this_block_container = None
                new_container = (this_block_container is not last_block_container)
                last_block_container = this_block_container
            else:
                # return one big string
                new_container = False

            if new_container or not strings:
                # start a new string
                strings.append("")

strings[-1] += element.text
return strings
====

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-04-21:

#25

There is one more refinement needed for my suggestion.

The code above gets the individual block strings well enough, but the problem remains of how to concatenate them into a single result string. If I directly concatenate them with

====
strings = my_all_strings(blcok_elements=block_elements)
text = "".join(strings)
print(f"###{my_text(soup, block_elements=block_elements)}###")
====

then I get both missing separation (between Hello World! and Text) and extra newlines that are not needed:

====
###
sentence one.sentence two.
Hello World!Test

###
====

To resolve this, we can concatenate block strings with newlines, then remove leading/trailing and internally-redundant newlines in the string. Here is a my_get_text() that puts all the pieces together:

====
from bs4 import BeautifulSoup, NavigableString
import re

def my_get_text(soup: BeautifulSoup, block_elements=True):
    strings = []
    last_block_container = None
    for element in soup.descendants:
        # determine if we have entered a new string context or not
        if isinstance(element, NavigableString):
            if block_elements is True:
                # separate *every* string (current behavior)
                new_container = True
            elif block_elements:
                # must be a list; use block-element semantics
                try:
                    this_block_container = next(
                        parent
                        for parent in element.parents
                        if parent.name in block_elements
                    )
                except StopIteration:
                    this_block_container = None
                new_container = this_block_container is not last_block_container
                last_block_container = this_block_container
            else:
                # return one big string
                new_container = False

            if new_container or not strings:
                # start a new string
                strings.append("")

strings[-1] += element.text

    text = "\n".join(strings)
    text = re.sub(r"^\n+|\n(?=\n)|\n+$", "", text)
    return text
====

If I run the following code:

====
html_doc = """
<body>
sentence one.sentence two.
Hello World!Test
</body>
"""
soup = BeautifulSoup(html_doc, "lxml")
print(f"###{my_get_text(soup, block_elements=block_elements)}###")
====

I get:

====
###sentence one.
sentence two.
Hello World!
Test###
====

which looks quite reasonable to me.

There is one more refinement needed for my suggestion.

The code above gets the individual block strings well enough, but the problem remains of how to concatenate them into a single result string. If I directly concatenate them with

====
strings = my_all_strings(blcok_elements=block_elements)
text = "".join(strings)
print(f"###{my_text(soup, block_elements=block_elements)}###")
====

then I get both missing separation (between Hello World! and Text) and extra newlines that are not needed:

====
###
sentence one.sentence two.
Hello World!Test

###
====

To resolve this, we can concatenate block strings with newlines, then remove leading/trailing and internally-redundant newlines in the string. Here is a my_get_text() that puts all the pieces together:

====
from bs4 import BeautifulSoup, NavigableString
import re

def my_get_text(soup: BeautifulSoup, block_elements=True):
    strings = []
    last_block_container = None
    for element in soup.descendants:
        # determine if we have entered a new string context or not
        if isinstance(element, NavigableString):
            if block_elements is True:
                # separate *every* string (current behavior)
                new_container = True
            elif block_elements:
                # must be a list; use block-element semantics
                try:
                    this_block_container = next(
                        parent
                        for parent in element.parents
                        if parent.name in block_elements
                    )
                except StopIteration:
                    this_block_container = None
                new_container = this_block_container is not last_block_container
                last_block_container = this_block_container
            else:
                # return one big string
                new_container = False

if new_container or not strings:
                # start a new string
                strings.append("")

strings[-1] += element.text

text = "\n".join(strings)
    text = re.sub(r"^\n+|\n(?=\n)|\n+$", "", text)
    return text
====

If I run the following code:

====
html_doc = """
<body>
 sentence one.sentence two.
 Hello World!Test
</body>
"""
soup = BeautifulSoup(html_doc, "lxml")
print(f"###{my_get_text(soup, block_elements=block_elements)}###")
====

I get:

====
###sentence one.
sentence two.
Hello World!
Test###
====

which looks quite reasonable to me.

Revision history for this message

Chris Papademetrious (chrispitude) wrote on 2024-04-21:

#26

Eep, I forgot about preserving newlines in <pre> blocks:

====
html_doc = """
<body>
line 1
<pre>line 2

line 3

line 4</pre>
line 5
</body>
"""

soup = BeautifulSoup(html_doc, "lxml")
print(f"###{my_get_text(soup, block_elements=block_elements)}###")
====

====
###line 1
line 2
line 3
line 4
line 5###
====

so if we want to preserve newlines inside block elements, we'll need to write a manual concatenation loop that considers the end of the previous string and the beginning of the next string. It's a solvable problem, we just need to decide what the desired behavior is, then implement it. My guess is to insert a newline between any two block strings where non-newline characters would come together.

Beautiful Soup

get_text() doesn't retain semantic separation of two sentences in separate paragraphs

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches