Beautiful Soup

text separated by is joined by get_text() without spaces in between

Bug #2058695 reported by Andrés Herrera on 2024-03-21

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	New	Undecided	Unassigned

Bug Description

There are cases when you find something like this on html pages:
URL 1: www.example-url-1.com URL 2: www.example-url-2.com
You can find a real example on this link (I am also attaching the html) (look for "Blog" in the html): https://www.linkedin.com/posts/sakana-ai_introducing-evolutionary-model-merge-a-new-activity-7176384016978178048-izIp?utm_source=share&utm_medium=member_desktop
(when I see that page with the inspector tool, the html is prettified, so the is surrounded by new lines, but when downloading the html, it is a case like the above).

The current implementation of text acquisition (get_text()) ignores tags, resulting in a string like the following (for the example above): URL 1: www.example-url-1.comURL 2: www.example-url-2.com

Here is a code to reproduce it:
```
from bs4 import BeautifulSoup
bs = BeautifulSoup("URL 1: www.example-url-1.com URL 2: www.example-url-2.com")
print(bs.text)
# URL 1: www.example-url-1.comURL 2: www.example-url-2.com
```

I did a merge proposal trying to fix this: https://code.launchpad.net/~andres-he/beautifulsoup/+git/beautifulsoup/+merge/462910

Additional information:
- Python version: 3.12.2
- html5lib installed: no
- lxml installed: no
- parser used: 'html.parser'

Thank you for your review

Revision history for this message

Andrés Herrera (andres-he) wrote on 2024-03-21:

html containing a pattern as described Edit (1.1 MiB, text/html)

Revision history for this message

Leonard Richardson (leonardr) wrote on 2024-03-22:

Download full text (3.5 KiB)

Thanks for taking the time to file this issue and write a patch.

The problem you've found is that certain markup tags should have whitespace appended when converting them to a string. In HTML, this includes the tag and possibly others. In XML, it depends on the XML vocabulary in use--there are probably no tags with this property. So we can't solve this problem by changing the Tag class: a specific tag might or might not be equivalent to a newline.

This is similar to other situations where certain HTML tags or attributes have special semantics that don't apply to XML. For instance, the tag is an empty-element tag in HTML, but in XML it's generally a regular tag that can contain other tags.

We deal with this by customizing the TreeBuilder object, either through a constructor argument (as with multi_valued_attributes and preserve_whitespace_tags) or as something that can be overridden in a subclass (as with empty_element_tags).

So, there are two open questions:

1. Is this customizable often enough that it should be made a constructor argument? Or is the difference mainly between HTML semantics and a default set of XML semantics? Probably the latter, so I think treating it like empty_element_tags is fine.
2. What are the relevant semantics as defined in the HTML spec, and what other HTML tags have semantics of the same sort?

The section on the element itself says "The br element represents a line break."
https://html.spec.whatwg.org/#the-br-element

There's a step specific to in the algorithm for generating innerText and outerText (the closest thing in the HTML spec to get_text())
https://html.spec.whatwg.org/#the-innertext-idl-attribute
"If node is a br element, then append a string containing a single U+000A LF code point to items."
(U+000A is the newline character)

Are there any other elements like this? Superficially, I don't think so; otherwise the HTML spec would define a type of element that should be treated this way, rather than giving special treatment in the innerText algorithm. The tag "represents a line break opportunity" but you don't have to take that opportunity.

You could argue that the <hr> element should also be turned into a newline, but I think that is covered by these other steps in the innerText algorithm:

8. If node is a p element, then append 2 (a required line break count) at the beginning and end of items.
9. If node's used value of 'display' is block-level or 'table-caption', then append 1 (a required line break count) at the beginning and end of items.

That's saying that a tag should be preceded and followed by a newline, and any other block-level tag should be followed by a newline. (As usual, Beautiful Soup is ignoring the effects of CSS on these rules.) <hr> is a block-level tag, for example. So the real special case is that is followed by a newline even though it's *not* a block-level tag.

Right now get_text() doesn't go through those steps:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("foo<hr>bar").get_text()
'foobar'

Beautiful Soup knows which HTML tags are defined as block-level tags (or at least which ones were defined as block-leve...

Other bug subscribers

Bug attachments

html containing a pattern as described Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Beautiful Soup

text separated by <br> is joined by get_text() without spaces in between

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches