DOCTYPE declaration might make an entity appear in the middle of the children elements
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
The following XML file, article_
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublish
<article>
Stuff <sup>1</sup> stuff ⋯ stuff <sup>2</sup>
</article>
Was loaded with:
from lxml import etree
et = etree.parse(
elements = list(et.getroot())
With that doctype declaration, elements[1] is an lxml.etree._Entity object, not an Element. On the other hand, that doesn't happen if we remove the DOCTYPE line from the input XML, but we lose the "⋯" entity no matter the XMLParser options.
Is there a way to force a DOCTYPE for parsing, or even to disable it, instead of loading it from the XML? I mean, is there a way to:
- Always keep the Entity objects, even if there's no DOCTYPE in the XML file?
- Use an "entity_to_text" fallback function to replace all entities by text?
EDIT: This description was changed to be more clear that manually replacing the dangling _Entity instances by text isn't the main issue.
Versions:
Python : sys.version_
lxml.etree : (4, 3, 3, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)
With the "recover" option, you are asking the parser to ignore errors. Since you are not enabling the DTD usage, an undeclared entity is an error, and asking the parser to recover from it makes it keep that entity instead of raising that error.
Basically, you are explicitly asking for trouble, and shouldn't be surprised if you get it.