text separated by <br> is joined by get_text() without spaces in between

Bug #2058695 reported by Andrés Herrera
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
New
Undecided
Unassigned

Bug Description

There are cases when you find something like this on html pages:
URL 1: www.example-url-1.com<br/>URL 2: www.example-url-2.com
You can find a real example on this link (I am also attaching the html) (look for "Blog" in the html): https://www.linkedin.com/posts/sakana-ai_introducing-evolutionary-model-merge-a-new-activity-7176384016978178048-izIp?utm_source=share&utm_medium=member_desktop
(when I see that page with the inspector tool, the html is prettified, so the <br> is surrounded by new lines, but when downloading the html, it is a case like the above).

The current implementation of text acquisition (get_text()) ignores <br> tags, resulting in a string like the following (for the example above): URL 1: www.example-url-1.comURL 2: www.example-url-2.com

Here is a code to reproduce it:
```
from bs4 import BeautifulSoup
bs = BeautifulSoup("URL 1: www.example-url-1.com<br/>URL 2: www.example-url-2.com")
print(bs.text)
# URL 1: www.example-url-1.comURL 2: www.example-url-2.com
```

I did a merge proposal trying to fix this: https://code.launchpad.net/~andres-he/beautifulsoup/+git/beautifulsoup/+merge/462910

Additional information:
- Python version: 3.12.2
- html5lib installed: no
- lxml installed: no
- parser used: 'html.parser'

Thank you for your review

Revision history for this message
Andrés Herrera (andres-he) wrote :
Revision history for this message
Leonard Richardson (leonardr) wrote :
Download full text (3.5 KiB)

Thanks for taking the time to file this issue and write a patch.

The problem you've found is that certain markup tags should have whitespace appended when converting them to a string. In HTML, this includes the <br> tag and possibly others. In XML, it depends on the XML vocabulary in use--there are probably no tags with this property. So we can't solve this problem by changing the Tag class: a specific <br> tag might or might not be equivalent to a newline.

This is similar to other situations where certain HTML tags or attributes have special semantics that don't apply to XML. For instance, the <br> tag is an empty-element tag in HTML, but in XML it's generally a regular tag that can contain other tags.

We deal with this by customizing the TreeBuilder object, either through a constructor argument (as with multi_valued_attributes and preserve_whitespace_tags) or as something that can be overridden in a subclass (as with empty_element_tags).

So, there are two open questions:

1. Is this customizable often enough that it should be made a constructor argument? Or is the difference mainly between HTML semantics and a default set of XML semantics? Probably the latter, so I think treating it like empty_element_tags is fine.
2. What are the relevant semantics as defined in the HTML spec, and what other HTML tags have semantics of the same sort?

The section on the <br> element itself says "The br element represents a line break."
https://html.spec.whatwg.org/#the-br-element

There's a step specific to <br> in the algorithm for generating innerText and outerText (the closest thing in the HTML spec to get_text())
https://html.spec.whatwg.org/#the-innertext-idl-attribute
"If node is a br element, then append a string containing a single U+000A LF code point to items."
(U+000A is the newline character)

Are there any other elements like this? Superficially, I don't think so; otherwise the HTML spec would define a type of element that should be treated this way, rather than giving <br> special treatment in the innerText algorithm. The <wbr> tag "represents a line break opportunity" but you don't have to take that opportunity.

You could argue that the <hr> element should also be turned into a newline, but I think that is covered by these other steps in the innerText algorithm:

8. If node is a p element, then append 2 (a required line break count) at the beginning and end of items.
9. If node's used value of 'display' is block-level or 'table-caption', then append 1 (a required line break count) at the beginning and end of items.

That's saying that a <p> tag should be preceded and followed by a newline, and any other block-level tag should be followed by a newline. (As usual, Beautiful Soup is ignoring the effects of CSS on these rules.) <hr> is a block-level tag, for example. So the real special case is that <br> is followed by a newline even though it's *not* a block-level tag.

Right now get_text() doesn't go through those steps:

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("<p>foo</p><hr><p>bar</p>").get_text()
'foobar'

Beautiful Soup knows which HTML tags are defined as block-level tags (or at least which ones were defined as block-leve...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.