get_text() doesn't retain semantic separation of two sentences in separate paragraphs
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
get_text() doesn't retain semantic separation of two sentences in separate paragraphs.
Here an example that shows the issue with both python's html parser and html5 parser:
>>> from bs4 import BeautifulSoup
>>> result = BeautifulSoup(
>>> result.get_text()
'this is sentence one.this is sentence two.'
>>> result = BeautifulSoup(
>>> result.get_text()
'this is sentence one.this is sentence two.'
The expected result would be: 'this is sentence one. this is sentence two.'
With the current behavior, I would argue get_text() isn't really useful as a generic function to extract the semantic text of a document.
ellie (et1234567) wrote : | #1 |
ellie (et1234567) wrote : | #2 |
I just stumbled upon separator=' ', but sadly that option is also useless / semantically nonsense:
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(
>>> soup.get_
'Hello W orl d!'
>>> soup = BeautifulSoup(
>>> soup.get_
'Hello W orl d! Test'
>>>
Of course, here the expected result would be: "Hello World! Test"
Isn't there any way to have BeautifulSoup apply a proper understanding of whitespace like a web browser? (That is, text contained in completely separate BLOCK tags like "p" is always separated with whitespace, while separation by INLINE tags like "b" won't cause spurious, incorrect whitespace.)
I know CSS can break all of this, but only on bad sites that don't use proper semantic HTML. But as BeautifulSoup works now, I find no good option to even parse the *proper* instances of semantic HTML in a correct way to text, which seems quite limiting.
Or is there some hidden module / extension that handles this correctly?
By the way, this bug looks like the same problem, just another instance: https:/
Leonard Richardson (leonardr) wrote : | #3 |
Thanks for your thoughtful bug report. I would say you've run into the limitations of what Beautiful Soup is designed to do. The library has just enough understanding of the semantics of HTML that parsing a valid document and writing it back out won't alter the document's semantics. This means Beautiful Soup needs to know which HTML elements are void elements, which elements give significance to the whitespace inside them, and so on.
Adding an understanding of inline tags versus block tags would make Beautiful Soup more like a web browser. It's a reasonable thing to ask for, but I try to keep my maintenance work on this project to a couple of weekends a year, so I'm probably not going to write it. Based on my experience with the CSS selector subsystem, I'd be reluctant to even accept such a contribution (though it would depend on how big it really was--and I realize you're not offering to write it).
get_text() is designed as a quick-and-dirty way to rip all the text out of a document, when you'd rather do text processing than HTML processing. I'm going to put this issue in a "confirmed" state and think about how much work it would be to implement the feature you're requesting. I could make a list of the block tags easily enough, and make a method like get_blocks() which tried to group the strings appropriately, but I believe it--like the CSS selectors--would fail in a thousand tiny edge cases, and I don't have time to investigate them.
Changed in beautifulsoup: | |
status: | New → Confirmed |
tags: | added: featire |
tags: |
added: feature removed: featire |
Changed in beautifulsoup: | |
importance: | Undecided → Wishlist |
Leonard Richardson (leonardr) wrote : | #4 |
I find this problem interesting so I spent a little time investigating it. I put a list of the HTML block elements into HTMLTreeBuilder and wrote some code like this:
def contains_
for i in tag.descendants:
if i.name in HTMLTreeBuilder
return True
return False
used = set()
for block in soup.find_
if contains_
continue
if any(x in used for x in block.parents):
continue
used.add(block)
print block.name, block.get_
print "-" * 80
The idea is to find the largest set of non-overlapping block tags. This would minimize the risk that get_text() will return too much or too little text.
This is a decent start, but the code is very slow and sometimes misses obvious cases (e.g. it sometimes treats everything in a <ul> tag as one block, even though <li> is also a block tag).
Overall this has confirmed my belief that I don't want Beautiful Soup to be in this business, but it's very tempting to think that a clever, simple (but probably still slow) solution is nearby.
Leonard Richardson (leonardr) wrote : | #5 |
Another way of doing this would be to use currently unused operators for this purpose. Then you could get syntax like:
soup % 'a' / 'href'
This avoids most of the problems I mentioned, but most of the currently unused operators are math operators. There's no intuitive connection between the meaning of the operator and what the operator does to a Tag or a ResultSet. It could just as easily look like this:
soup / 'a' % 'href'
So the resulting system would be hard to learn and remember. The dot operator (generally used to move from a Python object to one of its attributes) and the square-brackets operator (generally used to index a Python array or dictionary) don't have this problem. Their Beautiful Soup uses are similar to their normal Python uses.
Overall I think list comprehensions are the right tool for this sort of thing -- that's the syntax the Python devs came up with and even if I could do slightly better, the fact that it's different from normal Python would itself be a negative.
Leonard Richardson (leonardr) wrote : | #6 |
Ignore the last comment -- I meant to post it on bug #1828188
Leonard Richardson (leonardr) wrote : | #7 |
I came back to this following the 4.8 release and I think I have an efficient algorithm that groups text blocks together. The catch is there's no way to get the text blocks in a nice list, because block elements can contain other block elements. That's why my original plan fell apart when I started looking at nested lists. I was trying to turn a nested data structure into a list, and there's no general way to do that. Any given strategy will look good on some pages (or parts of pages) and bad on others.
My algorithm focuses on removing 'junk' and presenting the text nodes in a way that reflects the structure of the original tree.
Leonard Richardson (leonardr) wrote : | #8 |
I've marked bug 1882067 as a duplicate of this issue, although they're not directly related, because I think they come from the same place: a desire to use Beautiful Soup as a text preprocessor that can strip away "useless" markup.
In this case the concern is that some of the "useless" markup isn't so useless -- it conveys conceptual separations that are lost when you just extract all the text. In the case of 1882067, the concern is that some of the *text* is useless -- it's just whitespace and newlines that won't render in a web browser and ought to be collapsed for reading.
The challenge in both cases is distinguishing the "useless" stuff from the "useful" stuff.
Tofu (turfurken) wrote : | #9 |
Just want to chime in with a couple of observations.
1. when using automatic splitting, i.e. .get_text('\n') or .stripped_
2. when using .get_text(), line breaks just follow the source code rather than elements. e.g. <p>some<
<p>some
text</p>
is returned as two lines.
in other words, using .get_text() can produce two different outputs depending on whether the source has been prettified or not.
Tofu (turfurken) wrote : | #10 |
more info:
this also affects .prettify()
e.g.
>>> markup = r'<p>lorem <a href="#">ipsum</a> dolor <span>sit</span> amet</p>'
>>> soup = BeautifulSoup(
>>> print(soup.
<p>
lorem
<a href="#">
ipsum
</a>
dolor
<span>
sit
</span>
amet
</p>
both <a> and <span> are split into their own lines
Chris Papademetrious (chrispitude) wrote (last edit ): | #11 |
Long-time Perl/XML::Twig user, new to Python/Beautiful Soup. @leonardr, this is quite an impressive piece of work you've put together!
<li> and <entry> elements are particularly messy because they can contain mixtures of block and inline elements, such as
<li>Here is plaintext adjacent to a nested list:
<ul>
<li>
<p>Item 1</p>
</li>
...
I had to solve similar block/inline issues here:
https:/
Perhaps you could just prepend/append a "block-separation" space around every HTML5 block element (I did not remove tags that .get_text() ignores):
['address', 'article', 'aside'
then
* Collapse multiple "block-separation" spaces into a single "block-separation" space.
* Strip any "block-separation" spaces at the beginning and end away.
This would also handle directly-adjacent block elements such as
<p>word1<
If control is desired, this could be implemented as a block_separator=' ' parameter (separation being the default). I personally do not see the need for control; block elements are intrinsically textually separated.
Chris Papademetrious (chrispitude) wrote (last edit ): | #12 |
For space-normalized text scraping, an example workaround is space-separation around block elements, followed by space normalization using split/join:
####
from bs4 import BeautifulSoup
html = """\
<html lang="en">
<body>
<h1>H1<
</body>
</html>
"""
soup = BeautifulSoup(html, features='lxml')
for b in soup.find_
b.insert_
b.insert_
print(" ".join(
####
The split/join approach collapses all forms of whitespace (spaces, tabs, newlines, even those funny Unicode non-breaking/
However, the space-insertion workaround alters the original document and I'd rather see a native solution inside get_text() that leaves the document as-is.
Chris Papademetrious (chrispitude) wrote (last edit ): | #13 |
I guess we would also need a block_elements argument to know which elements are block versus inline:
p.get_text(
And a cleaner solution than the dedicated block_separator argument I previously suggested is just to simply apply the default separator to block elements only:
p.get_text(' ', block_elements = ['article', 'blockquote', ..., 'ul'])
Also, the prettify() method could also support block_elements, which would only indent elements in that list. The default for the block_elements would be True, to maintain compatibility. That would be pretty cool.
Chris Papademetrious (chrispitude) wrote : | #18 |
I might have a solution to this. The idea is to keep accumulating NavigableString fragments into a "current string item" as long as we're inside the same lowest-level containing block element. If we move into a new block element, then we start a new string item and accumulate into that.
The behavior can be controlled by considering a "block_elements" argument that specifies the granularity of block context inference.
If I have the following input document:
====
from bs4 import BeautifulSoup, NavigableString
html_doc = """
<body>
<p>sentence one.</p><p>sentence two.</p>
<p>Hello W<b>orl<
</body>
"""
soup = BeautifulSoup(
====
and I evaluate the following function:
====
def my_all_strings (soup, block_elements=
strings = []
last_
for element in soup.descendants:
# determine if we have entered a new string context or not
if isinstance(element, NavigableString):
if (block_elements is True):
# separate *every* string (current behavior)
elif (block_elements):
# must be a list; use block-element semantics
else:
# return one big string
if new_container or not strings:
# start a new string
return strings
block_elements = ['address', 'article', 'aside'
print(f"
print(f"
print(f"
print(f"
print(f"
print(f"
====
I get this:
====
block_elements = <HTML blocks>: ['\n', 'sentence one.', 'sentence two.', '\n', 'Hello World!', 'Test', '\n\n']
====
My first ver...
Leonard Richardson (leonardr) wrote : | #19 |
This is promising. I will want to play around with the API for maximum forwards compatibility, but the output of this algorithm definitely looks good. I have two questions before you do any more work, and I think you can answer both of them at the same time.
First, can you try this on some real web pages and see if it gives you the results you want? A simple case that's also very common would be extracting the "meat" of a web page's content: the product information or the news article on a page that also contains a lot of peripheral stuff.
Second, I'm a bit concerned about any code that looks like this:
for element in soup.descendants:
...
this_
Because you're calling a tree navigation method inside another tree navigation method, which is very bad for performance. However, find_parent is the least-bad tree navigation method to call in this situation, so it might not be that bad. Basically, if you try this on real web pages, also gather some timing information so I can compare it to the current implementation of get_text().
Tofu (turfurken) wrote : | #20 |
two comments about block elements:
1. i wonder if there's any nicer way of determining if an element is a block element other than just maintaining a static list which the package maintainers will have to keep updated manually
2. how is, for example, <br/> going to be treated? it is not a block element but is meant to introduce a line break
i think that depends on which of the two general approaches you want to take:
a. try to extract the plaintext as they would appear to a user in a browser (make a new line)
b. try to extract the plaintext as they logically fit in the markup (ignore the <br/>)
i believe that even with the (a) approach, it would still be out of scope to try to render the document with CSS and all, but maybe it's not too much to conform to the default html markup behaviour
Leonard Richardson (leonardr) wrote : | #21 |
That list of HTML block elements is taken from HTMLTreeBuilder, where there's a comment saying it comes from the HTML spec. But that list must be pretty old because that language is not in the HTML spec anymore. It looks like the concept of elements being intrinsically "block" or "inline" has been replaced by a CSS concept called "formatting context" that I don't currently understand. (https:/
So in the worst case, as you say, rendering the content in a CSS-aware way would seem to be necessary to see which text nodes are relevant. That's definitely off the table. As an approximation, something that uses notions from the current HTML spec such as "flow content" and "phrasing content" might work in most situations. (https:/
However it works, for the text extraction algorithm I would implement a kind of strategy pattern, similar to the pattern used to choose the markup parser. I am very interested in providing a way for people to plug in their own text extraction algorithms, and not very interested in supporting any particular algorithm indefinitely.
Chris Papademetrious (chrispitude) wrote (last edit ): | #22 |
Our current algorithm to solve this problem was to use copy.copy() to make a copy of the soup, iterate through all block elements and insert special NavigableString separator strings ("<<BLOCK>>") before and after each one, call soup.text, then search-and-replace any sequence of one or more separator strings with a single space. I didn't like this approach because I had to copy and destructively modify the soup.
On a test set of about 30k HTML files, this new algorithm returns 100% identical results to our current algorithm, and no copy/modification was needed.
A note on how I arrived at this algorithm... Originally I thought about trying to iterate through Tag descendants and keep track of what I entered and left to maintain a "current" block context. Knowing when we cross start tags was fine - that's precisely what the iterator is - but knowing when we cross end tags was difficult. I could derive it by comparing this start tag to the last start tag, but to determine if the next Tag was *inside* or *after* the previous start tag, I had to query the ancestry of containing elements. And if I'm going to do that each time, I might as well skip the Tags and just check the block context of each NavigableString object.
Would inlining a hardcoded (and simplified!) version of find_parent() in this algorithm resolve your concerns? Is it the algorithmic inclusion of an inner loop within an outer loop, or is it some technical aspect of nesting iterable things that have to maintain context as the interpreter jumps between the execution locations? (And fortunately, there are typically very few (if any) levels separating NavigableStrings from their containing block element.)
I am glad you are looking to provide a generalized solution that can be configured. For example, if I am working with DITA XML content:
https:/
then the lists of block and inline elements will be different:
https:/
@turfurken - with this algorithm, "<p>ABC<
Chris Papademetrious (chrispitude) wrote : | #24 |
The call to find_parent() could be rewritten using next() as a more lightweight way to find the closest (lowest-level) enclosing block element (the rest of the code is unchanged):
====
def my_all_strings (soup, block_elements=
strings = []
last_
for element in soup.descendants:
# determine if we have entered a new string context or not
if isinstance(element, NavigableString):
if (block_elements is True):
# separate *every* string (current behavior)
elif (block_elements):
# must be a list; use block-element semantics
else:
# return one big string
if new_container or not strings:
# start a new string
return strings
====
Chris Papademetrious (chrispitude) wrote : | #25 |
There is one more refinement needed for my suggestion.
The code above gets the individual block strings well enough, but the problem remains of how to concatenate them into a single result string. If I directly concatenate them with
====
strings = my_all_
text = "".join(strings)
print(f"
====
then I get both missing separation (between <p>Hello World!</p> and <p>Text</p>) and extra newlines that are not needed:
====
###
sentence one.sentence two.
Hello World!Test
###
====
To resolve this, we can concatenate block strings with newlines, then remove leading/trailing and internally-
====
from bs4 import BeautifulSoup, NavigableString
import re
def my_get_text(soup: BeautifulSoup, block_elements=
strings = []
last_
for element in soup.descendants:
# determine if we have entered a new string context or not
if isinstance(element, NavigableString):
if block_elements is True:
# separate *every* string (current behavior)
elif block_elements:
# must be a list; use block-element semantics
else:
# return one big string
if new_container or not strings:
# start a new string
text = "\n".join(strings)
text = re.sub(
return text
====
If I run the following code:
====
html_doc = """
<body>
<p>sentence one.</p><p>sentence two.</p>
<p>Hello W<b>orl<
</body>
"""
soup = BeautifulSoup(
print(f"
====
I get:
====
###sentence one.
sentence two.
Hello World!
Test###
====
which looks quite reasonable to me.
Chris Papademetrious (chrispitude) wrote : | #26 |
Eep, I forgot about preserving newlines in <pre> blocks:
====
html_doc = """
<body>
<p>line 1</p>
<pre>line 2
line 3
line 4</pre>
<p>line 5</p>
</body>
"""
soup = BeautifulSoup(
print(f"
====
====
###line 1
line 2
line 3
line 4
line 5###
====
so if we want to preserve newlines inside block elements, we'll need to write a manual concatenation loop that considers the end of the previous string and the beginning of the next string. It's a solvable problem, we just need to decide what the desired behavior is, then implement it. My guess is to insert a newline between any two block strings where non-newline characters would come together.
Sorry, I forgot to specify the version: python- beautifulsoup4- 4.6.0-2. fc27.1