DOCTYPE declaration might make an entity appear in the middle of the children elements

Bug #1830661 reported by Danilo J. S. Bellini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

The following XML file, article_example.xml:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article>
  Stuff <sup>1</sup> stuff &ctdot; stuff <sup>2</sup>
</article>

Was loaded with:

from lxml import etree
et = etree.parse("article_example.xml", parser=etree.XMLParser(recover=True))
elements = list(et.getroot())

With that doctype declaration, elements[1] is an lxml.etree._Entity object, not an Element. On the other hand, that doesn't happen if we remove the DOCTYPE line from the input XML, but we lose the "&ctdot;" entity no matter the XMLParser options.

Is there a way to force a DOCTYPE for parsing, or even to disable it, instead of loading it from the XML? I mean, is there a way to:

- Always keep the Entity objects, even if there's no DOCTYPE in the XML file?
- Use an "entity_to_text" fallback function to replace all entities by text?

EDIT: This description was changed to be more clear that manually replacing the dangling _Entity instances by text isn't the main issue.

Versions:

Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 3, 3, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)

Revision history for this message
scoder (scoder) wrote :

With the "recover" option, you are asking the parser to ignore errors. Since you are not enabling the DTD usage, an undeclared entity is an error, and asking the parser to recover from it makes it keep that entity instead of raising that error.

Basically, you are explicitly asking for trouble, and shouldn't be surprised if you get it.

Changed in lxml:
status: New → Invalid
Revision history for this message
Danilo J. S. Bellini (danilobellini) wrote :

Is there a way to always keep all these entity objects, even if the input XML has some arbitrary DOCTYPE or no DOCTYPE at all? I can't control the XML files.

For now I'm applying a regex to remove the beginning of the file and replace it by an invalid external DOCTYPE before parsing it with lxml, so I can get these entity objects in a consistent way, to replace them by text afterwards.

Revision history for this message
scoder (scoder) wrote :

Look at the "load_dtd" and "resolve_entities" parser options.

Revision history for this message
Danilo J. S. Bellini (danilobellini) wrote :

I've already tried them, but I found no way they could help me. The load_dtd tells in the help "referenced by the document" (that's exactly the opposite of what I need, I want to impose everything from my code, no matter the XML input). The "resolve_entities" don't give me an option to control the entity-to-text converter, and disabling it doesn't suffice: in the example above I removed the DOCTYPE declaration, and used etree.XMLParser(recover=True, resolve_entities=False) to open it, but doing so I can see no entity at all (and not even its text).

Changed in lxml:
status: Invalid → New
Revision history for this message
Danilo J. S. Bellini (danilobellini) wrote :

Is there a way to force/impose a DOCTYPE **before** parsing (not from the XML, but as a parameter in the code, either for the XMLParser or for the etree.parse function), instead of using the one from the XML input (if any)? The main issue is the lack of control regarding the [acceptable/used] DOCTYPE, and that have something to do with the lack of an ENTITY to text replacement customization (without updating the XML files themselves). I still might be missing something, though.

I see no problem if the answer is "No, there's no way to do that, and we won't update lxml to do it!", but that doesn't make this issue invalid.

Allowing loading arbitrary external DTD might be a security issue, I think I shouldn't do that, but even that wouldn't solve anything. Many XML files I receive don't have a DOCTYPE, and sometimes the DOCTYPE is invalid.

The context where I need that is in a robust loader of arbitrary manually created [and perhaps broken] XML (where the input DOCTYPE shouldn't be trusted): https://github.com/scieloorg/clea/blob/v0.4.0/clea/core.py#L94

Perhaps I'm not being clear, but what fixes this issue is either:

- A way to always keep the Entity objects, even if there's no DOCTYPE in the XML file
- A way to give an arbitrary "entity_to_text" fallback function to replace all entities by text on parsing (I'm thinking on an entity to text rule, not an exhaustive table)

Actually, if the XML file declare some entities, I'd like to use them. What I need is a way to resolve every other entity left undefined by the DOCTYPE declaration.

description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.