soupparser mishandles doctypes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
Fix Released
|
Medium
|
Unassigned |
Bug Description
lxml.html.
Let's take a simple HTML document and parse it with soupparser:
example = \
'''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://
<HTML>
<HEAD>
<TITLE>My first HTML document</TITLE>
</HEAD>
<BODY>
<P>Hello world!
</BODY>
</HTML>'''
import lxml.html, lxml.html.
root = lxml.html.
tree = root.getroottree()
tree.docinfo.
Result:
u'<!DOCTYPE [document] PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://
So 'HTML' become '[document]'. Weird.
Okay, let's serialize the tree back to string.
lxml.html.
lxml.html.
Results:
'<html>
'<[document]
First result is fine, second one not so much. How about we specify the doctype manually?
lxml.html.
Result:
''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://
Okay, workaround found. Unless...
example = \
'''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://
<!-- comment -->
<HTML>
<HEAD>
<TITLE>My first HTML document</TITLE>
</HEAD>
<BODY>
<P>Hello world!
</BODY>
</HTML>'''
root = lxml.html.
lxml.html.
Result:
'<html>DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n "http://
Well that's interesting.
It seems that the bug is specific to lxml:
import bs4
tree = bs4.BeautifulSo
bs4.__version__
tree
Results:
'4.3.2'
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://
<!-- comment --><html>
<head>
<title>My first HTML document</title>
</head>
<body>
<p>Hello world!
</p></body>
</html>
Version info:
Python : sys.version_
lxml.etree : (3, 3, 5, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)
did you try the lxml backend in BS?