Reading <br> tags differs per parser
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Beautiful Soup |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Well this is a problem I noticed while parsing web-pages. But it seems that the "unclosed" <br> tag is handled differently using different parsers. - Even though that tag is perfectly valid HTML4 and hence the parser shouldn't differ on those.
The full problem is described here: http://
But in short:
<html><
is read as
<html><
using html.parser from the python library. While html5lib gives the output:
<html><
Now above "input" is perfectly valid HTML, and hence beautifulsoup -per documentation- shouldn't give different outputs/
Notice, while omitted here, I also tried above HTML with a HTML4 (and html5) doctype declaration.
PS: I strongly prefer the html5lib version, and consider the html.parser version to be "wrong"...
description: | updated |
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
status: | Fix Released → Fix Committed |
Changed in beautifulsoup: | |
status: | Fix Committed → Fix Released |
Fixed in revision 446.