soupparser mishandles doctypes

Bug #1341964 reported by Olli Pottonen on 2014-07-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Medium
Unassigned

Bug Description

lxml.html.soupparser mishandles doctypes.

Let's take a simple HTML document and parse it with soupparser:

example = \
 '''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
 <HTML>
    <HEAD>
       <TITLE>My first HTML document</TITLE>
    </HEAD>
    <BODY>
       <P>Hello world!
    </BODY>
 </HTML>'''

import lxml.html, lxml.html.soupparser
root = lxml.html.soupparser.fromstring(example)
tree = root.getroottree()
tree.docinfo.doctype

Result:
u'<!DOCTYPE [document] PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">'

So 'HTML' become '[document]'. Weird.
Okay, let's serialize the tree back to string.

lxml.html.tostring(root)
lxml.html.tostring(tree)

Results:
'<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\n<p>Hello world!\n </p></body>\n</html>'
'<[document]>DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"\n<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\n<p>Hello world!\n </p></body>\n</html></[document]>'

First result is fine, second one not so much. How about we specify the doctype manually?

lxml.html.tostring(root, doctype='<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">')

Result:
''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">\n<html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\n<p>Hello world!\n </p></body>\n</html>'

Okay, workaround found. Unless...

example = \
'''<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<!-- comment -->
<HTML>
   <HEAD>
      <TITLE>My first HTML document</TITLE>
   </HEAD>
   <BODY>
      <P>Hello world!
   </BODY>
</HTML>'''
root = lxml.html.soupparser.fromstring(example)
lxml.html.tostring(root)

Result:
'<html>DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"\n "http://www.w3.org/TR/html4/strict.dtd"\n\n<!-- comment --><html>\n<head>\n<title>My first HTML document</title>\n</head>\n<body>\n<p>Hello world!\n </p></body>\n</html></html>'

Well that's interesting.
It seems that the bug is specific to lxml:
import bs4
tree = bs4.BeautifulSoup(example)
bs4.__version__
tree

Results:
'4.3.2'
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<!-- comment --><html>
<head>
<title>My first HTML document</title>
</head>
<body>
<p>Hello world!
   </p></body>
</html>

Version info:
Python : sys.version_info(major=2, minor=7, micro=7, releaselevel='final', serial=0)
lxml.etree : (3, 3, 5, 0)
libxml used : (2, 9, 0)
libxml compiled : (2, 9, 1)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

scoder (scoder) wrote :

did you try the lxml backend in BS?

Changed in lxml:
status: New → Triaged
scoder (scoder) wrote :
Changed in lxml:
importance: Undecided → Medium
milestone: none → 3.5
status: Triaged → In Progress
scoder (scoder) wrote :

Fix released in lxml 3.5.0.

Changed in lxml:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers