Beautiful Soup

soup.findAll doesn't get all the data when <p></br><p> is in content

Bug #1597211 reported by Abhijeet Chauhan on 2016-06-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Invalid	Undecided	Unassigned

Bug Description

code:
from bs4 import BeautifulSoup
page ='<div class="entry-content"> <p>line one</p> <p></br></p> <p>line two</p></div>'
soup = BeautifulSoup(page, 'html.parser')
mydivs = soup.findAll("div", { "class" : "entry-content" })
print mydivs[0]

output:
<div class="entry-content"> <p>line one</p> <p></p></div>

Abhijeet Chauhan (abhijeet-chauhan) on 2016-06-29

summary:

- soup.findAll doesn't get all the data when <p><br><p> is in content
+ soup.findAll doesn't get all the data when <p></br><p> is in content

Revision history for this message

Leonard Richardson (leonardr) wrote on 2016-07-17:

I understand why it looks like "line two" has disappaered, but here's what the document as a whole looks like when parsed with html.parser:

find_all() is working correctly: it finds the one and only div in the document that matches your arguments. The second <p> tag doesn't isn't in the div because Python's html.parser moved it outside the <div> tag when parsing the document.

You're seeing the result of a decision made by the html.parser parser when faced with ambiguous HTML. The lxml parser makes a different decision: it puts both <p> tags inside the <div>

BeautifulSoup(page, 'lxml')
print soup
# <html><body><div class="entry-content"> <p>line one</p> <p></p> <p>line two</p></div></body </html>

So does html5lib:

soup = BeautifulSoup(page, 'html5lib')
print soup
# <html><head></head><body><div class="entry-content"> <p>line one</p> <p><br/></p> <p>line two</p></div></body></html>

Ultimately it's your choice. All three parsers do reasonable things, none of them lose data, and the differences between parsers lie beyond the scope of the Beautiful Soup project.

Changed in beautifulsoup:
status:	New → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.