missing doctype when serialized

Bug #659367 reported by Tomasz Melcer on 2010-10-12
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Olli Pottonen

Bug Description

    In [1]: from lxml import etree

I've got an HTML document:

    In [2]: root = etree.fromstring(u'''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">\n<HTML></HTML>''', etree.HTMLParser())

Its doctype is parsed correctly:

    In [3]: root.getroottree().docinfo.doctype
    Out[3]: u'<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">'

But when serializing it, I am losing it:

    In [4]: etree.tostring(root.getroottree(), method='html')
    Out[4]: '<html></html>'

I expected to get the doctype here too.

Python : (2, 6, 6, 'final', 0)
lxml.etree : (2, 2, 8, 0)
libxml used : (2, 7, 7)
libxml compiled : (2, 7, 7)
libxslt used : (1, 1, 26)
libxslt compiled : (1, 1, 26)

scoder (scoder) wrote :

Makes sense. I'd accept a pull request that inserts the doctype in the _tostring() function (serialiser.pxi) if None was provided, the document has an internal or external subset and " write_complete_document" is set.

Note that the doctype would have to be reconstructed from the DTD, as done in the DocInfo() class. This functionality would need to be factored out.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
scoder (scoder) wrote :
Changed in lxml:
milestone: none → 3.5
status: Confirmed → In Progress
scoder (scoder) wrote :
Changed in lxml:
assignee: nobody → Olli Pottonen (olli-pottonen)
status: In Progress → Fix Committed
scoder (scoder) wrote :

Fixed in lxml 3.5.0.

Changed in lxml:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers