get_text() doesn't retain semantic separation of two sentences in separate paragraphs

Bug #1768330 reported by ellie
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Confirmed
Wishlist
Unassigned

Bug Description

get_text() doesn't retain semantic separation of two sentences in separate paragraphs.

Here an example that shows the issue with both python's html parser and html5 parser:

>>> from bs4 import BeautifulSoup
>>> result = BeautifulSoup("<p>this is sentence one.</p><p>this is sentence two.</p>", "html.parser")
>>> result.get_text()
'this is sentence one.this is sentence two.'
>>> result = BeautifulSoup("<p>this is sentence one.</p><p>this is sentence two.</p>", "html5lib")
>>> result.get_text()
'this is sentence one.this is sentence two.'

The expected result would be: 'this is sentence one. this is sentence two.'

With the current behavior, I would argue get_text() isn't really useful as a generic function to extract the semantic text of a document.

Tags: feature
Revision history for this message
ellie (et1234567) wrote :

Sorry, I forgot to specify the version: python-beautifulsoup4-4.6.0-2.fc27.1

Revision history for this message
ellie (et1234567) wrote :

I just stumbled upon separator=' ', but sadly that option is also useless / semantically nonsense:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>Hello W<b>orl</b>d!</p>", "html.parser")
>>> soup.get_text(separator=' ')
'Hello W orl d!'
>>> soup = BeautifulSoup("<p>Hello W<b>orl</b>d!</p><p>Test</p>", "html.parser")
>>> soup.get_text(separator=' ')
'Hello W orl d! Test'
>>>

Of course, here the expected result would be: "Hello World! Test"

Isn't there any way to have BeautifulSoup apply a proper understanding of whitespace like a web browser? (That is, text contained in completely separate BLOCK tags like "p" is always separated with whitespace, while separation by INLINE tags like "b" won't cause spurious, incorrect whitespace.)

I know CSS can break all of this, but only on bad sites that don't use proper semantic HTML. But as BeautifulSoup works now, I find no good option to even parse the *proper* instances of semantic HTML in a correct way to text, which seems quite limiting.

Or is there some hidden module / extension that handles this correctly?

By the way, this bug looks like the same problem, just another instance: https://bugs.launchpad.net/beautifulsoup/+bug/1767999 (failure of BeautifulSoup to understand what is semantically - per default, without CSS changes - an inline and not a block tag, where you can't just slap in whitespace with the same visual result)

Revision history for this message
Leonard Richardson (leonardr) wrote :

Thanks for your thoughtful bug report. I would say you've run into the limitations of what Beautiful Soup is designed to do. The library has just enough understanding of the semantics of HTML that parsing a valid document and writing it back out won't alter the document's semantics. This means Beautiful Soup needs to know which HTML elements are void elements, which elements give significance to the whitespace inside them, and so on.

Adding an understanding of inline tags versus block tags would make Beautiful Soup more like a web browser. It's a reasonable thing to ask for, but I try to keep my maintenance work on this project to a couple of weekends a year, so I'm probably not going to write it. Based on my experience with the CSS selector subsystem, I'd be reluctant to even accept such a contribution (though it would depend on how big it really was--and I realize you're not offering to write it).

get_text() is designed as a quick-and-dirty way to rip all the text out of a document, when you'd rather do text processing than HTML processing. I'm going to put this issue in a "confirmed" state and think about how much work it would be to implement the feature you're requesting. I could make a list of the block tags easily enough, and make a method like get_blocks() which tried to group the strings appropriately, but I believe it--like the CSS selectors--would fail in a thousand tiny edge cases, and I don't have time to investigate them.

Changed in beautifulsoup:
status: New → Confirmed
tags: added: featire
tags: added: feature
removed: featire
Changed in beautifulsoup:
importance: Undecided → Wishlist
Revision history for this message
Leonard Richardson (leonardr) wrote :

I find this problem interesting so I spent a little time investigating it. I put a list of the HTML block elements into HTMLTreeBuilder and wrote some code like this:

def contains_any_blocks(tag):
    for i in tag.descendants:
        if i.name in HTMLTreeBuilder.block_elements:
            return True
        return False
used = set()
for block in soup.find_all(HTMLTreeBuilder.block_elements):
    if contains_any_blocks(block):
        continue
    if any(x in used for x in block.parents):
        continue
    used.add(block)
    print block.name, block.get_text(separator=' ').encode("utf-8")
    print "-" * 80

The idea is to find the largest set of non-overlapping block tags. This would minimize the risk that get_text() will return too much or too little text.

This is a decent start, but the code is very slow and sometimes misses obvious cases (e.g. it sometimes treats everything in a <ul> tag as one block, even though <li> is also a block tag).

Overall this has confirmed my belief that I don't want Beautiful Soup to be in this business, but it's very tempting to think that a clever, simple (but probably still slow) solution is nearby.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Another way of doing this would be to use currently unused operators for this purpose. Then you could get syntax like:

soup % 'a' / 'href'

This avoids most of the problems I mentioned, but most of the currently unused operators are math operators. There's no intuitive connection between the meaning of the operator and what the operator does to a Tag or a ResultSet. It could just as easily look like this:

soup / 'a' % 'href'

So the resulting system would be hard to learn and remember. The dot operator (generally used to move from a Python object to one of its attributes) and the square-brackets operator (generally used to index a Python array or dictionary) don't have this problem. Their Beautiful Soup uses are similar to their normal Python uses.

Overall I think list comprehensions are the right tool for this sort of thing -- that's the syntax the Python devs came up with and even if I could do slightly better, the fact that it's different from normal Python would itself be a negative.

Revision history for this message
Leonard Richardson (leonardr) wrote :

Ignore the last comment -- I meant to post it on bug #1828188

Revision history for this message
Leonard Richardson (leonardr) wrote :

I came back to this following the 4.8 release and I think I have an efficient algorithm that groups text blocks together. The catch is there's no way to get the text blocks in a nice list, because block elements can contain other block elements. That's why my original plan fell apart when I started looking at nested lists. I was trying to turn a nested data structure into a list, and there's no general way to do that. Any given strategy will look good on some pages (or parts of pages) and bad on others.

My algorithm focuses on removing 'junk' and presenting the text nodes in a way that reflects the structure of the original tree.

Revision history for this message
Leonard Richardson (leonardr) wrote :

I've marked bug 1882067 as a duplicate of this issue, although they're not directly related, because I think they come from the same place: a desire to use Beautiful Soup as a text preprocessor that can strip away "useless" markup.

In this case the concern is that some of the "useless" markup isn't so useless -- it conveys conceptual separations that are lost when you just extract all the text. In the case of 1882067, the concern is that some of the *text* is useless -- it's just whitespace and newlines that won't render in a web browser and ought to be collapsed for reading.

The challenge in both cases is distinguishing the "useless" stuff from the "useful" stuff.

Revision history for this message
Tofu (turfurken) wrote :

Just want to chime in with a couple of observations.

1. when using automatic splitting, i.e. .get_text('\n') or .stripped_strings(), it splits by every element. e.g. <span> gets split out instead of kept in-line with the rest of the line

2. when using .get_text(), line breaks just follow the source code rather than elements. e.g. <p>some</p><p>text</p> is returned as one line, but
<p>some
text</p>
is returned as two lines.
in other words, using .get_text() can produce two different outputs depending on whether the source has been prettified or not.

Revision history for this message
Tofu (turfurken) wrote :

more info:

this also affects .prettify()

e.g.

>>> markup = r'<p>lorem <a href="#">ipsum</a> dolor <span>sit</span> amet</p>'
>>> soup = BeautifulSoup(markup, 'html.parser')
>>> print(soup.prettify())
<p>
 lorem
 <a href="#">
  ipsum
 </a>
 dolor
 <span>
  sit
 </span>
 amet
</p>

both <a> and <span> are split into their own lines

Revision history for this message
Chris Papademetrious (chrispitude) wrote (last edit ):

Long-time Perl/XML::Twig user, new to Python/Beautiful Soup. @leonardr, this is quite an impressive piece of work you've put together!

<li> and <entry> elements are particularly messy because they can contain mixtures of block and inline elements, such as

<li>Here is plaintext adjacent to a nested list:
  <ul>
    <li>
      <p>Item 1</p>
    </li>
    ...

I had to solve similar block/inline issues here:

https://blog.oxygenxml.com/topics/refactoring_inserting_reformatting.html

Perhaps you could just prepend/append a "block-separation" space around every HTML5 block element (I did not remove tags that .get_text() ignores):

['address', 'article', 'aside','blockquote', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video']

then

* Collapse multiple "block-separation" spaces into a single "block-separation" space.
* Strip any "block-separation" spaces at the beginning and end away.

This would also handle directly-adjacent block elements such as

<p>word1</p><p>word2</p>

If control is desired, this could be implemented as a block_separator=' ' parameter (separation being the default). I personally do not see the need for control; block elements are intrinsically textually separated.

Revision history for this message
Chris Papademetrious (chrispitude) wrote (last edit ):

For space-normalized text scraping, an example workaround is space-separation around block elements, followed by space normalization using split/join:

####
from bs4 import BeautifulSoup

html = """\
<html lang="en">
 <body>
  <h1>H1</h1><p>A</p><ol><li>1</li></ol><p>B</p><p>C</p>
 </body>
</html>
"""

soup = BeautifulSoup(html, features='lxml')

for b in soup.find_all(['article', 'blockquote', 'dd', 'div', 'dl', 'dt', 'figcaption', 'figure', 'footer', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'li', 'main', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul']):
    b.insert_before(' ')
    b.insert_after(' ')

print(" ".join(soup.get_text().split()))
####

The split/join approach collapses all forms of whitespace (spaces, tabs, newlines, even those funny Unicode non-breaking/narrow/etc. spaces).

However, the space-insertion workaround alters the original document and I'd rather see a native solution inside get_text() that leaves the document as-is.

Revision history for this message
Chris Papademetrious (chrispitude) wrote (last edit ):

I guess we would also need a block_elements argument to know which elements are block versus inline:

p.get_text(block_separator=' ', block_elements = ['article', 'blockquote', ..., 'ul'])

And a cleaner solution than the dedicated block_separator argument I previously suggested is just to simply apply the default separator to block elements only:

p.get_text(' ', block_elements = ['article', 'blockquote', ..., 'ul'])

Also, the prettify() method could also support block_elements, which would only indent elements in that list. The default for the block_elements would be True, to maintain compatibility. That would be pretty cool.

Revision history for this message
Chris Papademetrious (chrispitude) wrote :
Download full text (3.5 KiB)

I might have a solution to this. The idea is to keep accumulating NavigableString fragments into a "current string item" as long as we're inside the same lowest-level containing block element. If we move into a new block element, then we start a new string item and accumulate into that.

The behavior can be controlled by considering a "block_elements" argument that specifies the granularity of block context inference.

If I have the following input document:

====
from bs4 import BeautifulSoup, NavigableString
html_doc = """
<body>
 <p>sentence one.</p><p>sentence two.</p>
 <p>Hello W<b>orl</b>d!</p><p>Test</p>
</body>
"""
soup = BeautifulSoup(html_doc, 'lxml')
====

and I evaluate the following function:

====
def my_all_strings (soup, block_elements=True):
    strings = []
    last_block_container = None
    for element in soup.descendants:

        # determine if we have entered a new string context or not
        if isinstance(element, NavigableString):
            if (block_elements is True):
                # separate *every* string (current behavior)
                new_container = True
            elif (block_elements):
                # must be a list; use block-element semantics
                this_block_container = element.find_parent(block_elements)
                new_container = (this_block_container is not last_block_container)
                last_block_container = this_block_container
            else:
                # return one big string
                new_container = False

            if new_container or not strings:
                # start a new string
                strings.append("")

            strings[-1] += element.text
    return strings

block_elements = ['address', 'article', 'aside','blockquote', 'canvas', 'dd', 'div', 'dl', 'dt', 'fieldset', 'figcaption', 'figure', 'footer', 'form', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'header', 'hr', 'li', 'main', 'nav', 'noscript', 'ol', 'p', 'pre', 'section', 'table', 'tfoot', 'ul', 'video']

print(f"{'default:':>32s} {repr(my_all_strings(soup))}")
print(f"{'block_elements = True:':>32s} {repr(my_all_strings(soup, block_elements=True))}")
print(f"{'block_elements = <HTML blocks>:':>32s} {repr(my_all_strings(soup, block_elements=block_elements))}")
print(f"{'block_elements = []:':>32s} {repr(my_all_strings(soup, block_elements=[]))}")
print(f"{'block_elements = False:':>32s} {repr(my_all_strings(soup, block_elements=False))}")
print(f"{'block_elements = None:':>32s} {repr(my_all_strings(soup, block_elements=None))}")
====

I get this:

====
                        default: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
          block_elements = True: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello W', 'orl', 'd!', 'Test', '\n', '\n']
 block_elements = <HTML blocks>: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello World!', 'Test', '\n\n']
            block_elements = []: ['\nsentence one.sentence two.\nHello World!Test\n\n']
         block_elements = False: ['\nsentence one.sentence two.\nHello World!Test\n\n']
          block_elements = None: ['\nsentence one.sentence two.\nHello World!Test\n\n']
====

My first ver...

Read more...

Revision history for this message
Leonard Richardson (leonardr) wrote :

This is promising. I will want to play around with the API for maximum forwards compatibility, but the output of this algorithm definitely looks good. I have two questions before you do any more work, and I think you can answer both of them at the same time.

First, can you try this on some real web pages and see if it gives you the results you want? A simple case that's also very common would be extracting the "meat" of a web page's content: the product information or the news article on a page that also contains a lot of peripheral stuff.

Second, I'm a bit concerned about any code that looks like this:

for element in soup.descendants:
   ...
   this_block_container = element.find_parent(block_elements)

Because you're calling a tree navigation method inside another tree navigation method, which is very bad for performance. However, find_parent is the least-bad tree navigation method to call in this situation, so it might not be that bad. Basically, if you try this on real web pages, also gather some timing information so I can compare it to the current implementation of get_text().

Revision history for this message
Tofu (turfurken) wrote :

two comments about block elements:

1. i wonder if there's any nicer way of determining if an element is a block element other than just maintaining a static list which the package maintainers will have to keep updated manually

2. how is, for example, <br/> going to be treated? it is not a block element but is meant to introduce a line break

i think that depends on which of the two general approaches you want to take:

a. try to extract the plaintext as they would appear to a user in a browser (make a new line)

b. try to extract the plaintext as they logically fit in the markup (ignore the <br/>)

i believe that even with the (a) approach, it would still be out of scope to try to render the document with CSS and all, but maybe it's not too much to conform to the default html markup behaviour

Revision history for this message
Leonard Richardson (leonardr) wrote :

That list of HTML block elements is taken from HTMLTreeBuilder, where there's a comment saying it comes from the HTML spec. But that list must be pretty old because that language is not in the HTML spec anymore. It looks like the concept of elements being intrinsically "block" or "inline" has been replaced by a CSS concept called "formatting context" that I don't currently understand. (https://www.w3.org/TR/CSS2/visuren.html#normal-flow)

So in the worst case, as you say, rendering the content in a CSS-aware way would seem to be necessary to see which text nodes are relevant. That's definitely off the table. As an approximation, something that uses notions from the current HTML spec such as "flow content" and "phrasing content" might work in most situations. (https://html.spec.whatwg.org/#kinds-of-content)

However it works, for the text extraction algorithm I would implement a kind of strategy pattern, similar to the pattern used to choose the markup parser. I am very interested in providing a way for people to plug in their own text extraction algorithms, and not very interested in supporting any particular algorithm indefinitely.

Revision history for this message
Chris Papademetrious (chrispitude) wrote (last edit ):

Our current algorithm to solve this problem was to use copy.copy() to make a copy of the soup, iterate through all block elements and insert special NavigableString separator strings ("<<BLOCK>>") before and after each one, call soup.text, then search-and-replace any sequence of one or more separator strings with a single space. I didn't like this approach because I had to copy and destructively modify the soup.

On a test set of about 30k HTML files, this new algorithm returns 100% identical results to our current algorithm, and no copy/modification was needed.

A note on how I arrived at this algorithm... Originally I thought about trying to iterate through Tag descendants and keep track of what I entered and left to maintain a "current" block context. Knowing when we cross start tags was fine - that's precisely what the iterator is - but knowing when we cross end tags was difficult. I could derive it by comparing this start tag to the last start tag, but to determine if the next Tag was *inside* or *after* the previous start tag, I had to query the ancestry of containing elements. And if I'm going to do that each time, I might as well skip the Tags and just check the block context of each NavigableString object.

Would inlining a hardcoded (and simplified!) version of find_parent() in this algorithm resolve your concerns? Is it the algorithmic inclusion of an inner loop within an outer loop, or is it some technical aspect of nesting iterable things that have to maintain context as the interpreter jumps between the execution locations? (And fortunately, there are typically very few (if any) levels separating NavigableStrings from their containing block element.)

I am glad you are looking to provide a generalized solution that can be configured. For example, if I am working with DITA XML content:

https://www.oxygenxml.com/dita/1.3/specs/index.html

then the lists of block and inline elements will be different:

https://blog.oxygenxml.com/topics/refactoring_inserting_reformatting.html

@turfurken - with this algorithm, "<p>ABC<br/>DEF</p>" would return "ABCDEF". The algorithm would need an "elif" branch to push a single space on the strings list for <br/> elements. As @leonardr said, we don't want to hardcode format-specific heuristics, so hopefully whatever handler configurability magic he comes up with can support this type of thing.

Revision history for this message
Chris Papademetrious (chrispitude) wrote :

The call to find_parent() could be rewritten using next() as a more lightweight way to find the closest (lowest-level) enclosing block element (the rest of the code is unchanged):

====
def my_all_strings (soup, block_elements=True):
    strings = []
    last_block_container = None
    for element in soup.descendants:

        # determine if we have entered a new string context or not
        if isinstance(element, NavigableString):
            if (block_elements is True):
                # separate *every* string (current behavior)
                new_container = True
            elif (block_elements):
                # must be a list; use block-element semantics
                try:
                    this_block_container = next(parent for parent in element.parents if parent.name in block_elements)
                except StopIteration:
                    this_block_container = None
                new_container = (this_block_container is not last_block_container)
                last_block_container = this_block_container
            else:
                # return one big string
                new_container = False

            if new_container or not strings:
                # start a new string
                strings.append("")

            strings[-1] += element.text
    return strings
====

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.