Internal Subset Entities are not output correctly
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Chamelon Core |
Fix Released
|
High
|
Sidnei da Silva |
Bug Description
The Internal Subset of Entities is not carried over to the output. This seems to be a bug in lxml, and an additional bug in chameleon makes it worse.
Here's some context, from the lxml mailing list.
"""
Sidnei da Silva wrote:
> I am looking for a way to output internal entities that have been
> parsed from the original document when writing out a tree, but
> apparently this is not exposed in any attribute.
>
> Here's an example:
>
> {{{
> import lxml.etree
>
> document = """<?xml version="1.0"?>
> <!DOCTYPE application [
> <!ENTITY nbsp "\ ">
> ]>
> <application>
> """
>
>
> tree = lxml.etree.
> print tree.getroottre
> }}}
>
> I would expect this to output:
> {{{
> <!DOCTYPE application [
> <!ENTITY nbsp "\ ">
> ]>
> }}}
>
> But instead it gives me:
>
> {{{
> <!DOCTYPE application>
> }}}
>
> Is it a bug or I'm not looking at the right place?
What you are looking for is the internal subset of the document, which is
not (really) part of the DOCTYPE itself. It's available through the
"docinfo.
content of the DTD, so this is currently only usable for validation (i.e.
not very helpful in your case).
What you could try is to parse the document without resolving the entities,
then traverse the Entity elements and collect their names in a set. That
will not give you the resolved entity values, though...
I think it would be nice if tostring() could serialise DTDs, but I doubt
that there are so many use cases for that. In your case, you'd then have to
parse the DTD yourself, which you could also do by clearing the root node
and serialising the document to unicode.
Stefan
"""
Changed in chameleon.core: | |
status: | Fix Committed → Fix Released |
A fix was checked in into chameleon so that the initial parsing with expat keeps the 'doctype' around, and the same doctype is passed over explicitly to the output, instead of relying on the lxml docinfo object. This could be improved as described by Stefan, but should do for now.