Reading <br> tags differs per parser

Bug #1676935 reported by paul weijtens
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Beautiful Soup
Fix Released
Undecided
Unassigned

Bug Description

Well this is a problem I noticed while parsing web-pages. But it seems that the "unclosed" <br> tag is handled differently using different parsers. - Even though that tag is perfectly valid HTML4 and hence the parser shouldn't differ on those.

The full problem is described here: http://stackoverflow.com/questions/43022298/beautifulsoup-br-tag-handling-from-input

But in short:

<html><head><title>s</title></head><body>something <br> <b> another thing</b> <br> even more <br> <b> end</b></body></html>

is read as

<html><head></head><body>something <br> <b> another thing</b> <br> even more <br> <b> end</b></br></br></br></body></html>

using html.parser from the python library. While html5lib gives the output:

<html><head></head><body>something <br/> <b> another thing</b> <br/> even more <br/> <b> end</b></body></html>

Now above "input" is perfectly valid HTML, and hence beautifulsoup -per documentation- shouldn't give different outputs/interpretations for the same valid html.

Notice, while omitted here, I also tried above HTML with a HTML4 (and html5) doctype declaration.

PS: I strongly prefer the html5lib version, and consider the html.parser version to be "wrong"...

paul weijtens (pulli23)
description: updated
Revision history for this message
Leonard Richardson (leonardr) wrote :

Fixed in revision 446.

Changed in beautifulsoup:
status: New → Fix Committed
paul weijtens (pulli23)
Changed in beautifulsoup:
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed
Changed in beautifulsoup:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.