soup.get_text() ignore previous unwrap()
Bug #1686408 reported by
Billy Kong
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Won't Fix
|
Undecided
|
Unassigned |
Bug Description
from bs4 import BeautifulSoup
markup = '<a href="http://
soup = BeautifulSoup(
print('Before unwrap: \n')
print(soup)
print(soup.
soup.i.unwrap()
soup.a.unwrap()
print('After unwrap: \n')
print(soup)
print(soup.
# unwrap has no effect on get_text() even though the soup object is changed
"""
Before unwrap:
<html><body><a href="http://
I linked to
example.com
After unwrap:
<html><body>I linked to example.
I linked to
example.com
"""
I'm not sure what your expected behavior is. unwrap() replaces a tag with its own contents. In your example you effectively remove the <i> tag and the <a> tag from the document, while leaving everything else alone. That doesn't affect get_text(), because none of the text is removed, only the tags.
If you try using extract() instead of unwrap() you'll see the difference.