soup.get_text() ignore previous unwrap()

Bug #1686408 reported by Billy Kong
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Won't Fix
Undecided
Unassigned

Bug Description

from bs4 import BeautifulSoup

markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup, 'lxml')
print('Before unwrap: \n')
print(soup)
print(soup.get_text('\n'))

soup.i.unwrap()
soup.a.unwrap()
print('After unwrap: \n')
print(soup)
print(soup.get_text('\n'))
# unwrap has no effect on get_text() even though the soup object is changed

"""
Before unwrap:
<html><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>
I linked to
example.com

After unwrap:

<html><body>I linked to example.com</body></html>
I linked to
example.com
"""

Billy Kong (billyklh)
description: updated
Revision history for this message
Leonard Richardson (leonardr) wrote :

I'm not sure what your expected behavior is. unwrap() replaces a tag with its own contents. In your example you effectively remove the <i> tag and the <a> tag from the document, while leaving everything else alone. That doesn't affect get_text(), because none of the text is removed, only the tags.

If you try using extract() instead of unwrap() you'll see the difference.

Changed in beautifulsoup:
status: New → Invalid
Revision history for this message
Billy Kong (billyklh) wrote :

Hi Leonard,

Yes, the text is unchanged. My question is about the line-break.

For after unwraping:
<html><body>I linked to example.com</body></html>

Shouldn't the get_text() returns:
I linked to example.com

Instead of:
I linked to
example.com

Many thanks,
Billy

Revision history for this message
Leonard Richardson (leonardr) wrote :
  • a Edit (1.7 KiB, text/plain)

Thanks, I see what you're saying now. Your expected behavior is that strings should be combined whenever they become adjacent. Here's a simpler example that illustrates the same behavior:

from bs4 import _soup
soup = _soup("<b>foo</b>")
soup.b.append("bar")
soup.b.contents
# [u'foo', u'bar']

This is a reasonable request but I'm not going to make the change. It's easy to join strings yourself but impossible to separate them once they're joined. The way I use Beautiful Soup, it's more useful to keep track of strings separately and join them if necessary when outputting markup. So I'm not convinced that the users who notice this change would welcome it on balance.

This is a change of moderate complexity, which I'm trying to avoid in a project that's in the maintenance phase of its lifecycle. The change would go in a place that's likely to create a lot of subtle edge-case bugs. html5lib does something similar on the initial document parse and it's caused me a lot of grief over the years.

That said, here's the patch I wrote while investigating this issue. It works in the simple case, but like I said, edge-case bugs. In particular, it breaks unwrap().

Changed in beautifulsoup:
status: Invalid → Won't Fix
Revision history for this message
Billy Kong (billyklh) wrote :

Understood. Thank you very much for your detailed response!!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.