Well this is a problem I noticed while parsing web-pages. But it seems that the "unclosed" <br> tag is handled differently using different parsers. - Even though that tag is perfectly valid HTML4 and hence the parser shouldn't differ on those.
<html><head><title>s</title></head><body>something <br> <b> another thing</b> <br> even more <br> <b> end</b></body></html>
is read as
<html><head></head><body>something <br> <b> another thing</b> <br> even more <br> <b> end</b></br></br></br></body></html>
using html.parser from the python library. While html5lib gives the output:
<html><head></head><body>something <br/> <b> another thing</b> <br/> even more <br/> <b> end</b></body></html>
Now above "input" is perfectly valid HTML, and hence beautifulsoup -per documentation- shouldn't give different outputs/interpretations for the same valid html.
Notice, while omitted here, I also tried above HTML with a HTML4 (and html5) doctype declaration.
Well this is a problem I noticed while parsing web-pages. But it seems that the "unclosed" <br> tag is handled differently using different parsers. - Even though that tag is perfectly valid HTML4 and hence the parser shouldn't differ on those.
The full problem is described here: http:// stackoverflow. com/questions/ 43022298/ beautifulsoup- br-tag- handling- from-input
But in short:
<html>< head><title> s</title> </head> <body>something <br> <b> another thing</b> <br> even more <br> <b> end</b> </body> </html>
is read as
<html>< head></ head><body> something <br> <b> another thing</b> <br> even more <br> <b> end</b> </br></ br></br> </body> </html>
using html.parser from the python library. While html5lib gives the output:
<html>< head></ head><body> something <br/> <b> another thing</b> <br/> even more <br/> <b> end</b> </body> </html>
Now above "input" is perfectly valid HTML, and hence beautifulsoup -per documentation- shouldn't give different outputs/ interpretations for the same valid html.
Notice, while omitted here, I also tried above HTML with a HTML4 (and html5) doctype declaration.