lxml

DOCTYPE declaration might make an entity appear in the middle of the children elements

Bug #1830661 reported by Danilo J. S. Bellini on 2019-05-28

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	lxml	New	Undecided	Unassigned

Bug Description

The following XML file, article_example.xml:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article>
Stuff <sup>1</sup> stuff &ctdot; stuff <sup>2</sup>
</article>

Was loaded with:

from lxml import etree
et = etree.parse("article_example.xml", parser=etree.XMLParser(recover=True))
elements = list(et.getroot())

With that doctype declaration, elements[1] is an lxml.etree._Entity object, not an Element. On the other hand, that doesn't happen if we remove the DOCTYPE line from the input XML, but we lose the "&ctdot;" entity no matter the XMLParser options.

Is there a way to force a DOCTYPE for parsing, or even to disable it, instead of loading it from the XML? I mean, is there a way to:

- Always keep the Entity objects, even if there's no DOCTYPE in the XML file?
- Use an "entity_to_text" fallback function to replace all entities by text?

EDIT: This description was changed to be more clear that manually replacing the dangling _Entity instances by text isn't the main issue.

Versions:

Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 3, 3, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)

See original description

Revision history for this message

scoder (scoder) wrote on 2019-08-11:

#1

With the "recover" option, you are asking the parser to ignore errors. Since you are not enabling the DTD usage, an undeclared entity is an error, and asking the parser to recover from it makes it keep that entity instead of raising that error.

Basically, you are explicitly asking for trouble, and shouldn't be surprised if you get it.

Changed in lxml:
status:	New → Invalid

Revision history for this message

Danilo J. S. Bellini (danilobellini) wrote on 2019-08-11:

#2

Is there a way to always keep all these entity objects, even if the input XML has some arbitrary DOCTYPE or no DOCTYPE at all? I can't control the XML files.

For now I'm applying a regex to remove the beginning of the file and replace it by an invalid external DOCTYPE before parsing it with lxml, so I can get these entity objects in a consistent way, to replace them by text afterwards.

Revision history for this message

scoder (scoder) wrote on 2019-08-11:

#3

Look at the "load_dtd" and "resolve_entities" parser options.

Revision history for this message

Danilo J. S. Bellini (danilobellini) wrote on 2019-08-11:

#4

I've already tried them, but I found no way they could help me. The load_dtd tells in the help "referenced by the document" (that's exactly the opposite of what I need, I want to impose everything from my code, no matter the XML input). The "resolve_entities" don't give me an option to control the entity-to-text converter, and disabling it doesn't suffice: in the example above I removed the DOCTYPE declaration, and used etree.XMLParser(recover=True, resolve_entities=False) to open it, but doing so I can see no entity at all (and not even its text).

Danilo J. S. Bellini (danilobellini) on 2019-08-14

Changed in lxml:
status:	Invalid → New

Revision history for this message

Danilo J. S. Bellini (danilobellini) wrote on 2019-08-14:

#5

Is there a way to force/impose a DOCTYPE **before** parsing (not from the XML, but as a parameter in the code, either for the XMLParser or for the etree.parse function), instead of using the one from the XML input (if any)? The main issue is the lack of control regarding the [acceptable/used] DOCTYPE, and that have something to do with the lack of an ENTITY to text replacement customization (without updating the XML files themselves). I still might be missing something, though.

I see no problem if the answer is "No, there's no way to do that, and we won't update lxml to do it!", but that doesn't make this issue invalid.

Allowing loading arbitrary external DTD might be a security issue, I think I shouldn't do that, but even that wouldn't solve anything. Many XML files I receive don't have a DOCTYPE, and sometimes the DOCTYPE is invalid.

The context where I need that is in a robust loader of arbitrary manually created [and perhaps broken] XML (where the input DOCTYPE shouldn't be trusted): https://github.com/scieloorg/clea/blob/v0.4.0/clea/core.py#L94

Perhaps I'm not being clear, but what fixes this issue is either:

- A way to always keep the Entity objects, even if there's no DOCTYPE in the XML file
- A way to give an arbitrary "entity_to_text" fallback function to replace all entities by text on parsing (I'm thinking on an entity to text rule, not an exhaustive table)

Actually, if the XML file declare some entities, I'd like to use them. What I need is a way to resolve every other entity left undefined by the DOCTYPE declaration.

Danilo J. S. Bellini (danilobellini) on 2019-08-14

description:

updated

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.