soup.findAll doesn't get all the data when <p></br><p> is in content
Bug #1597211 reported by
Abhijeet Chauhan
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Invalid
|
Undecided
|
Unassigned |
Bug Description
code:
from bs4 import BeautifulSoup
page ='<div class="
soup = BeautifulSoup(page, 'html.parser')
mydivs = soup.findAll("div", { "class" : "entry-content" })
print mydivs[0]
output:
<div class="
summary: |
- soup.findAll doesn't get all the data when <p><br><p> is in content + soup.findAll doesn't get all the data when <p></br><p> is in content |
To post a comment you must log in.
I understand why it looks like "line two" has disappaered, but here's what the document as a whole looks like when parsed with html.parser:
<div class=" entry-content" > <p>line one</p> <p></p></div> <p>line two</p>
find_all() is working correctly: it finds the one and only div in the document that matches your arguments. The second <p> tag doesn't isn't in the div because Python's html.parser moved it outside the <div> tag when parsing the document.
You're seeing the result of a decision made by the html.parser parser when faced with ambiguous HTML. The lxml parser makes a different decision: it puts both <p> tags inside the <div>
BeautifulSoup(page, 'lxml') entry-content" > <p>line one</p> <p></p> <p>line two</p></div></body </html>
print soup
# <html><body><div class="
So does html5lib:
soup = BeautifulSoup(page, 'html5lib') head></ head><body> <div class=" entry-content" > <p>line one</p> <p><br/></p> <p>line two</p> </div>< /body>< /html>
print soup
# <html><
Ultimately it's your choice. All three parsers do reasonable things, none of them lose data, and the differences between parsers lie beyond the scope of the Beautiful Soup project.