lxml

soupparser mishandles doctypes

Bug #1341964 reported by Olli Pottonen on 2014-07-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	Fix Released	Medium	Unassigned	lxml 3.5

Bug Description

lxml.html.soupparser mishandles doctypes.

Let's take a simple HTML document and parse it with soupparser:

example = \
'''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<HTML>
 <HEAD>
 <TITLE>My first HTML document</TITLE>
 </HEAD>
 <BODY>
 Hello world!
 </BODY>
</HTML>'''

import lxml.html, lxml.html.soupparser
root = lxml.html.soupparser.fromstring(example)
tree = root.getroottree()
tree.docinfo.doctype

Result:
u'<!DOCTYPE [document] PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">'

So 'HTML' become '[document]'. Weird.
Okay, let's serialize the tree back to string.

lxml.html.tostring(root)
lxml.html.tostring(tree)

Results:
'<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\nHello world!\n </body>\n</html>'
'<[document]>DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"\n<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\nHello world!\n </body>\n</html></[document]>'

First result is fine, second one not so much. How about we specify the doctype manually?

lxml.html.tostring(root, doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">')

Result:
''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\nHello world!\n </body>\n</html>'

Okay, workaround found. Unless...

example = \
'''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
 "http://www.w3.org/TR/html4/strict.dtd">

<HTML>
 <HEAD>
 <TITLE>My first HTML document</TITLE>
 </HEAD>
 <BODY>
 Hello world!
 </BODY>
</HTML>'''
root = lxml.html.soupparser.fromstring(example)
lxml.html.tostring(root)

Result:
'<html>DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n "http://www.w3.org/TR/html4/strict.dtd"\n\n<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\nHello world!\n </body>\n</html></html>'

Well that's interesting.
It seems that the bug is specific to lxml:
import bs4
tree = bs4.BeautifulSoup(example)
bs4.__version__
tree

Results:
'4.3.2'
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>My first HTML document</title>
</head>
<body>
Hello world!
</body>
</html>

Version info:
Python : sys.version_info(major=2, minor=7, micro=7, releaselevel='final', serial=0)
lxml.etree : (3, 3, 5, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message

scoder (scoder) wrote on 2014-12-06:

did you try the lxml backend in BS?

Changed in lxml:
status:	New → Triaged

Revision history for this message

scoder (scoder) wrote on 2015-02-14:

proposed fix:
https://github.com/opottone/lxml/commit/1b0d3625523c6d389f47ac99b7b75f4b80a7bea8

Changed in lxml:
importance:	Undecided → Medium
milestone:	none → 3.5
status:	Triaged → In Progress

Revision history for this message

scoder (scoder) wrote on 2016-03-18:

Fix released in lxml 3.5.0.

Changed in lxml:
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.