Beautiful Soup

soup.get_text() ignore previous unwrap()

Bug #1686408 reported by Billy Kong on 2017-04-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Won't Fix	Undecided	Unassigned

Bug Description

from bs4 import BeautifulSoup

markup = '<a href="http://example.com/">I linked to example.com</a>'
soup = BeautifulSoup(markup, 'lxml')
print('Before unwrap: \n')
print(soup)
print(soup.get_text('\n'))

soup.i.unwrap()
soup.a.unwrap()
print('After unwrap: \n')
print(soup)
print(soup.get_text('\n'))
# unwrap has no effect on get_text() even though the soup object is changed

"""
Before unwrap:
<html><body><a href="http://example.com/">I linked to example.com</a></body></html>
I linked to
example.com

After unwrap:

<html><body>I linked to example.com</body></html>
I linked to
example.com
"""

See original description

Billy Kong (billyklh) on 2017-04-26

description:

updated

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-07:

I'm not sure what your expected behavior is. unwrap() replaces a tag with its own contents. In your example you effectively remove the tag and the <a> tag from the document, while leaving everything else alone. That doesn't affect get_text(), because none of the text is removed, only the tags.

If you try using extract() instead of unwrap() you'll see the difference.

Changed in beautifulsoup:
status:	New → Invalid

Revision history for this message

Billy Kong (billyklh) wrote on 2017-05-07:

Hi Leonard,

Yes, the text is unchanged. My question is about the line-break.

For after unwraping:
<html><body>I linked to example.com</body></html>

Shouldn't the get_text() returns:
I linked to example.com

Instead of:
I linked to
example.com

Many thanks,
Billy

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-07:

a Edit (1.7 KiB, text/plain)

Thanks, I see what you're saying now. Your expected behavior is that strings should be combined whenever they become adjacent. Here's a simpler example that illustrates the same behavior:

from bs4 import _soup
soup = _soup("foo")
soup.b.append("bar")
soup.b.contents
# [u'foo', u'bar']

This is a reasonable request but I'm not going to make the change. It's easy to join strings yourself but impossible to separate them once they're joined. The way I use Beautiful Soup, it's more useful to keep track of strings separately and join them if necessary when outputting markup. So I'm not convinced that the users who notice this change would welcome it on balance.

This is a change of moderate complexity, which I'm trying to avoid in a project that's in the maintenance phase of its lifecycle. The change would go in a place that's likely to create a lot of subtle edge-case bugs. html5lib does something similar on the initial document parse and it's caused me a lot of grief over the years.

That said, here's the patch I wrote while investigating this issue. It works in the simple case, but like I said, edge-case bugs. In particular, it breaks unwrap().