Beautiful Soup

Reading tags differs per parser

Bug #1676935 reported by paul weijtens on 2017-03-28

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Beautiful Soup	Fix Released	Undecided	Unassigned

Bug Description

Well this is a problem I noticed while parsing web-pages. But it seems that the "unclosed" tag is handled differently using different parsers. - Even though that tag is perfectly valid HTML4 and hence the parser shouldn't differ on those.

The full problem is described here: http://stackoverflow.com/questions/43022298/beautifulsoup-br-tag-handling-from-input

But in short:

<html><head><title>s</title></head><body>something another thing even more end</body></html>

is read as

<html><head></head><body>something another thing even more end</body></html>

using html.parser from the python library. While html5lib gives the output:

<html><head></head><body>something another thing even more end</body></html>

Now above "input" is perfectly valid HTML, and hence beautifulsoup -per documentation- shouldn't give different outputs/interpretations for the same valid html.

Notice, while omitted here, I also tried above HTML with a HTML4 (and html5) doctype declaration.

PS: I strongly prefer the html5lib version, and consider the html.parser version to be "wrong"...

See original description

paul weijtens (pulli23) on 2017-03-28

description:

updated

Revision history for this message

Leonard Richardson (leonardr) wrote on 2017-05-07:

Fixed in revision 446.

Changed in beautifulsoup:
status:	New → Fix Committed

paul weijtens (pulli23) on 2017-05-08

Changed in beautifulsoup:
status:	Fix Committed → Fix Released
status:	Fix Released → Fix Committed

Leonard Richardson (leonardr) on 2018-07-28

Changed in beautifulsoup:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

Bug #1681015

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Beautiful Soup

Reading <br> tags differs per parser

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches