get_text() doesn't retain semantic separation of two sentences in separate paragraphs

Bug #1768330 reported by ellie on 2018-05-01
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Wishlist
Unassigned

Bug Description

get_text() doesn't retain semantic separation of two sentences in separate paragraphs.

Here an example that shows the issue with both python's html parser and html5 parser:

>>> from bs4 import BeautifulSoup
>>> result = BeautifulSoup("<p>this is sentence one.</p><p>this is sentence two.</p>", "html.parser")
>>> result.get_text()
'this is sentence one.this is sentence two.'
>>> result = BeautifulSoup("<p>this is sentence one.</p><p>this is sentence two.</p>", "html5lib")
>>> result.get_text()
'this is sentence one.this is sentence two.'

The expected result would be: 'this is sentence one. this is sentence two.'

With the current behavior, I would argue get_text() isn't really useful as a generic function to extract the semantic text of a document.

ellie (et1234567) wrote :

Sorry, I forgot to specify the version: python-beautifulsoup4-4.6.0-2.fc27.1

ellie (et1234567) wrote :

I just stumbled upon separator=' ', but sadly that option is also useless / semantically nonsense:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>Hello W<b>orl</b>d!</p>", "html.parser")
>>> soup.get_text(separator=' ')
'Hello W orl d!'
>>> soup = BeautifulSoup("<p>Hello W<b>orl</b>d!</p><p>Test</p>", "html.parser")
>>> soup.get_text(separator=' ')
'Hello W orl d! Test'
>>>

Of course, here the expected result would be: "Hello World! Test"

Isn't there any way to have BeautifulSoup apply a proper understanding of whitespace like a web browser? (That is, text contained in completely separate BLOCK tags like "p" is always separated with whitespace, while separation by INLINE tags like "b" won't cause spurious, incorrect whitespace.)

I know CSS can break all of this, but only on bad sites that don't use proper semantic HTML. But as BeautifulSoup works now, I find no good option to even parse the *proper* instances of semantic HTML in a correct way to text, which seems quite limiting.

Or is there some hidden module / extension that handles this correctly?

By the way, this bug looks like the same problem, just another instance: https://bugs.launchpad.net/beautifulsoup/+bug/1767999 (failure of BeautifulSoup to understand what is semantically - per default, without CSS changes - an inline and not a block tag, where you can't just slap in whitespace with the same visual result)

Leonard Richardson (leonardr) wrote :

Thanks for your thoughtful bug report. I would say you've run into the limitations of what Beautiful Soup is designed to do. The library has just enough understanding of the semantics of HTML that parsing a valid document and writing it back out won't alter the document's semantics. This means Beautiful Soup needs to know which HTML elements are void elements, which elements give significance to the whitespace inside them, and so on.

Adding an understanding of inline tags versus block tags would make Beautiful Soup more like a web browser. It's a reasonable thing to ask for, but I try to keep my maintenance work on this project to a couple of weekends a year, so I'm probably not going to write it. Based on my experience with the CSS selector subsystem, I'd be reluctant to even accept such a contribution (though it would depend on how big it really was--and I realize you're not offering to write it).

get_text() is designed as a quick-and-dirty way to rip all the text out of a document, when you'd rather do text processing than HTML processing. I'm going to put this issue in a "confirmed" state and think about how much work it would be to implement the feature you're requesting. I could make a list of the block tags easily enough, and make a method like get_blocks() which tried to group the strings appropriately, but I believe it--like the CSS selectors--would fail in a thousand tiny edge cases, and I don't have time to investigate them.

Changed in beautifulsoup:
status: New → Confirmed
tags: added: featire
tags: added: feature
removed: featire
Changed in beautifulsoup:
importance: Undecided → Wishlist
Leonard Richardson (leonardr) wrote :

I find this problem interesting so I spent a little time investigating it. I put a list of the HTML block elements into HTMLTreeBuilder and wrote some code like this:

def contains_any_blocks(tag):
    for i in tag.descendants:
        if i.name in HTMLTreeBuilder.block_elements:
            return True
        return False
used = set()
for block in soup.find_all(HTMLTreeBuilder.block_elements):
    if contains_any_blocks(block):
        continue
    if any(x in used for x in block.parents):
        continue
    used.add(block)
    print block.name, block.get_text(separator=' ').encode("utf-8")
    print "-" * 80

The idea is to find the largest set of non-overlapping block tags. This would minimize the risk that get_text() will return too much or too little text.

This is a decent start, but the code is very slow and sometimes misses obvious cases (e.g. it sometimes treats everything in a <ul> tag as one block, even though <li> is also a block tag).

Overall this has confirmed my belief that I don't want Beautiful Soup to be in this business, but it's very tempting to think that a clever, simple (but probably still slow) solution is nearby.

Leonard Richardson (leonardr) wrote :

Another way of doing this would be to use currently unused operators for this purpose. Then you could get syntax like:

soup % 'a' / 'href'

This avoids most of the problems I mentioned, but most of the currently unused operators are math operators. There's no intuitive connection between the meaning of the operator and what the operator does to a Tag or a ResultSet. It could just as easily look like this:

soup / 'a' % 'href'

So the resulting system would be hard to learn and remember. The dot operator (generally used to move from a Python object to one of its attributes) and the square-brackets operator (generally used to index a Python array or dictionary) don't have this problem. Their Beautiful Soup uses are similar to their normal Python uses.

Overall I think list comprehensions are the right tool for this sort of thing -- that's the syntax the Python devs came up with and even if I could do slightly better, the fact that it's different from normal Python would itself be a negative.

Leonard Richardson (leonardr) wrote :

Ignore the last comment -- I meant to post it on bug #1828188

Leonard Richardson (leonardr) wrote :

I came back to this following the 4.8 release and I think I have an efficient algorithm that groups text blocks together. The catch is there's no way to get the text blocks in a nice list, because block elements can contain other block elements. That's why my original plan fell apart when I started looking at nested lists. I was trying to turn a nested data structure into a list, and there's no general way to do that. Any given strategy will look good on some pages (or parts of pages) and bad on others.

My algorithm focuses on removing 'junk' and presenting the text nodes in a way that reflects the structure of the original tree.

Leonard Richardson (leonardr) wrote :

I've marked bug 1882067 as a duplicate of this issue, although they're not directly related, because I think they come from the same place: a desire to use Beautiful Soup as a text preprocessor that can strip away "useless" markup.

In this case the concern is that some of the "useless" markup isn't so useless -- it conveys conceptual separations that are lost when you just extract all the text. In the case of 1882067, the concern is that some of the *text* is useless -- it's just whitespace and newlines that won't render in a web browser and ought to be collapsed for reading.

The challenge in both cases is distinguishing the "useless" stuff from the "useful" stuff.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers