Comment 2 for bug 1768330

Revision history for this message
ellie (et1234567) wrote :

I just stumbled upon separator=' ', but sadly that option is also useless / semantically nonsense:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>Hello W<b>orl</b>d!</p>", "html.parser")
>>> soup.get_text(separator=' ')
'Hello W orl d!'
>>> soup = BeautifulSoup("<p>Hello W<b>orl</b>d!</p><p>Test</p>", "html.parser")
>>> soup.get_text(separator=' ')
'Hello W orl d! Test'
>>>

Of course, here the expected result would be: "Hello World! Test"

Isn't there any way to have BeautifulSoup apply a proper understanding of whitespace like a web browser? (That is, text contained in completely separate BLOCK tags like "p" is always separated with whitespace, while separation by INLINE tags like "b" won't cause spurious, incorrect whitespace.)

I know CSS can break all of this, but only on bad sites that don't use proper semantic HTML. But as BeautifulSoup works now, I find no good option to even parse the *proper* instances of semantic HTML in a correct way to text, which seems quite limiting.

Or is there some hidden module / extension that handles this correctly?

By the way, this bug looks like the same problem, just another instance: https://bugs.launchpad.net/beautifulsoup/+bug/1767999 (failure of BeautifulSoup to understand what is semantically - per default, without CSS changes - an inline and not a block tag, where you can't just slap in whitespace with the same visual result)