Beautiful Soup

Bug #1676935
Activity log

Activity log for bug #1676935

Date	Who	What changed	Old value	New value	Message
2017-03-28 15:55:09	paul weijtens	bug			added bug
2017-03-28 16:00:30	paul weijtens	description	Well this is a problem I noticed while parsing web-pages. But it seems that the "unclosed" <br> tag is handled differently using different parsers. - Even though that tag is perfectly valid HTML4 and hence the parser shouldn't differ on those. The full problem is described here: http://stackoverflow.com/questions/43022298/beautifulsoup-br-tag-handling-from-input But in short: <html><head><title>s</title></head><body>something <br> <b> another thing</b> <br> even more <br> <b> end</b></body></html> is read as <html><head></head><body>something <br> <b> another thing</b> <br> even more <br> <b> end</b></br></br></br></body></html> using html.parser from the python library. While html5lib gives the output: <html><head></head><body>something <br/> <b> another thing</b> <br/> even more <br/> <b> end</b></body></html> Now above "input" is perfectly valid HTML, and hence beautifulsoup -per documentation- shouldn't give different outputs/interpretations for the same valid html. Notice, while omitted here, I also tried above HTML with a HTML4 (and html5) doctype declaration.	Well this is a problem I noticed while parsing web-pages. But it seems that the "unclosed" <br> tag is handled differently using different parsers. - Even though that tag is perfectly valid HTML4 and hence the parser shouldn't differ on those. The full problem is described here: http://stackoverflow.com/questions/43022298/beautifulsoup-br-tag-handling-from-input But in short: <html><head><title>s</title></head><body>something <br> <b> another thing</b> <br> even more <br> <b> end</b></body></html> is read as <html><head></head><body>something <br> <b> another thing</b> <br> even more <br> <b> end</b></br></br></br></body></html> using html.parser from the python library. While html5lib gives the output: <html><head></head><body>something <br/> <b> another thing</b> <br/> even more <br/> <b> end</b></body></html> Now above "input" is perfectly valid HTML, and hence beautifulsoup -per documentation- shouldn't give different outputs/interpretations for the same valid html. Notice, while omitted here, I also tried above HTML with a HTML4 (and html5) doctype declaration. PS: I strongly prefer the html5lib version, and consider the html.parser version to be "wrong"...
2017-05-07 01:32:43	Leonard Richardson	beautifulsoup: status	New	Fix Committed
2017-05-08 16:12:32	paul weijtens	beautifulsoup: status	Fix Committed	Fix Released
2017-05-08 16:12:37	paul weijtens	beautifulsoup: status	Fix Released	Fix Committed
2018-07-28 23:51:04	Leonard Richardson	beautifulsoup: status	Fix Committed	Fix Released