soupparser mishandles doctypes

Bug #1341964 reported by Olli Pottonen
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Fix Released
Medium
Unassigned

Bug Description

lxml.html.soupparser mishandles doctypes.

Let's take a simple HTML document and parse it with soupparser:

example = \
 '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
 <HTML>
    <HEAD>
       <TITLE>My first HTML document</TITLE>
    </HEAD>
    <BODY>
       <P>Hello world!
    </BODY>
 </HTML>'''

import lxml.html, lxml.html.soupparser
root = lxml.html.soupparser.fromstring(example)
tree = root.getroottree()
tree.docinfo.doctype

Result:
u'<!DOCTYPE [document] PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">'

So 'HTML' become '[document]'. Weird.
Okay, let's serialize the tree back to string.

lxml.html.tostring(root)
lxml.html.tostring(tree)

Results:
'<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\n<p>Hello world!\n </p></body>\n</html>'
'<[document]>DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"\n<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\n<p>Hello world!\n </p></body>\n</html></[document]>'

First result is fine, second one not so much. How about we specify the doctype manually?

lxml.html.tostring(root, doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">')

Result:
''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\n<p>Hello world!\n </p></body>\n</html>'

Okay, workaround found. Unless...

example = \
'''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<!-- comment -->
<HTML>
   <HEAD>
      <TITLE>My first HTML document</TITLE>
   </HEAD>
   <BODY>
      <P>Hello world!
   </BODY>
</HTML>'''
root = lxml.html.soupparser.fromstring(example)
lxml.html.tostring(root)

Result:
'<html>DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n "http://www.w3.org/TR/html4/strict.dtd"\n\n<!-- comment --><html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\n<p>Hello world!\n </p></body>\n</html></html>'

Well that's interesting.
It seems that the bug is specific to lxml:
import bs4
tree = bs4.BeautifulSoup(example)
bs4.__version__
tree

Results:
'4.3.2'
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<!-- comment --><html>
<head>
<title>My first HTML document</title>
</head>
<body>
<p>Hello world!
   </p></body>
</html>

Version info:
Python : sys.version_info(major=2, minor=7, micro=7, releaselevel='final', serial=0)
lxml.etree : (3, 3, 5, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Revision history for this message
scoder (scoder) wrote :

did you try the lxml backend in BS?

Changed in lxml:
status: New → Triaged
Revision history for this message
scoder (scoder) wrote :
Changed in lxml:
importance: Undecided → Medium
milestone: none → 3.5
status: Triaged → In Progress
Revision history for this message
scoder (scoder) wrote :

Fix released in lxml 3.5.0.

Changed in lxml:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.