Some tags never contain 'text' per the HTML spec
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
First off, for what I'm about to describe, I'm using Python 3.6.9, bs4 4.8.2, and lxml 4.5.0.
I maintain a website that's built around a listserv for physicians around the world to discuss emerging infectious diseases. Users send emails to a designated email address, and those emails are parsed to pull out the content. That content is then thrown into a database for storage, and there's a whole UI around it so that users can browse and search and whatnot.
I use https:/
I recently encountered an email that was throwing errors. I ended up tracking it down, and it turns out that it was an HTML content-only email, and bs4 isn't properly cleaning up some of the HTML comments. I'm attaching a sanitized version of the HTML content portion of the email, and below is code that you can run to see it for yourself:
from bs4 import BeautifulSoup
with open('email.html') as infile:
soup = BeautifulSoup(
print(soup.
What I expect to see is provided in expected_
It seems like the get_text() method needs to be a little more aggressive with cleaning HTML comments. Thanks a bunch for your help!
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Changed in beautifulsoup: | |
status: | Fix Released → New |
Changed in beautifulsoup: | |
status: | In Progress → Fix Released |
Here's my expected output.