Internal Subset Entities are not output correctly

Bug #364261 reported by Sidnei da Silva
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Chamelon Core
Fix Released
High
Sidnei da Silva

Bug Description

The Internal Subset of Entities is not carried over to the output. This seems to be a bug in lxml, and an additional bug in chameleon makes it worse.

Here's some context, from the lxml mailing list.

"""
Sidnei da Silva wrote:
> I am looking for a way to output internal entities that have been
> parsed from the original document when writing out a tree, but
> apparently this is not exposed in any attribute.
>
> Here's an example:
>
> {{{
> import lxml.etree
>
> document = """<?xml version="1.0"?>
> <!DOCTYPE application [
> <!ENTITY nbsp "\&#160;">
> ]>
> <application>&nbsp;</application>
> """
>
>
> tree = lxml.etree.fromstring(document)
> print tree.getroottree().docinfo.doctype
> }}}
>
> I would expect this to output:
> {{{
> <!DOCTYPE application [
> <!ENTITY nbsp "\&#160;">
> ]>
> }}}
>
> But instead it gives me:
>
> {{{
> <!DOCTYPE application>
> }}}
>
> Is it a bug or I'm not looking at the right place?

What you are looking for is the internal subset of the document, which is
not (really) part of the DOCTYPE itself. It's available through the
"docinfo.internalDTD" property. However, lxml.etree doesn't expose the
content of the DTD, so this is currently only usable for validation (i.e.
not very helpful in your case).

What you could try is to parse the document without resolving the entities,
then traverse the Entity elements and collect their names in a set. That
will not give you the resolved entity values, though...

I think it would be nice if tostring() could serialise DTDs, but I doubt
that there are so many use cases for that. In your case, you'd then have to
parse the DTD yourself, which you could also do by clearing the root node
and serialising the document to unicode.

Stefan
"""

Revision history for this message
Sidnei da Silva (sidnei) wrote :

A fix was checked in into chameleon so that the initial parsing with expat keeps the 'doctype' around, and the same doctype is passed over explicitly to the output, instead of relying on the lxml docinfo object. This could be improved as described by Stefan, but should do for now.

Changed in chameleon.core:
assignee: nobody → Sidnei da Silva (sidnei)
importance: Undecided → High
status: New → Fix Committed
Sidnei da Silva (sidnei)
Changed in chameleon.core:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.