soup.findAll doesn't get all the data when <p></br><p> is in content

Bug #1597211 reported by Abhijeet Chauhan
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Beautiful Soup
Invalid
Undecided
Unassigned

Bug Description

code:
from bs4 import BeautifulSoup
page ='<div class="entry-content"> <p>line one</p> <p></br></p> <p>line two</p></div>'
soup = BeautifulSoup(page, 'html.parser')
mydivs = soup.findAll("div", { "class" : "entry-content" })
print mydivs[0]

output:
<div class="entry-content"> <p>line one</p> <p></p></div>

summary: - soup.findAll doesn't get all the data when <p><br><p> is in content
+ soup.findAll doesn't get all the data when <p></br><p> is in content
Revision history for this message
Leonard Richardson (leonardr) wrote :

I understand why it looks like "line two" has disappaered, but here's what the document as a whole looks like when parsed with html.parser:

<div class="entry-content"> <p>line one</p> <p></p></div> <p>line two</p>

find_all() is working correctly: it finds the one and only div in the document that matches your arguments. The second <p> tag doesn't isn't in the div because Python's html.parser moved it outside the <div> tag when parsing the document.

You're seeing the result of a decision made by the html.parser parser when faced with ambiguous HTML. The lxml parser makes a different decision: it puts both <p> tags inside the <div>

BeautifulSoup(page, 'lxml')
print soup
# <html><body><div class="entry-content"> <p>line one</p> <p></p> <p>line two</p></div></body </html>

So does html5lib:

soup = BeautifulSoup(page, 'html5lib')
print soup
# <html><head></head><body><div class="entry-content"> <p>line one</p> <p><br/></p> <p>line two</p></div></body></html>

Ultimately it's your choice. All three parsers do reasonable things, none of them lose data, and the differences between parsers lie beyond the scope of the Beautiful Soup project.

Changed in beautifulsoup:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.