Comment 5 for bug 1830661

Revision history for this message
Danilo J. S. Bellini (danilobellini) wrote :

Is there a way to force/impose a DOCTYPE **before** parsing (not from the XML, but as a parameter in the code, either for the XMLParser or for the etree.parse function), instead of using the one from the XML input (if any)? The main issue is the lack of control regarding the [acceptable/used] DOCTYPE, and that have something to do with the lack of an ENTITY to text replacement customization (without updating the XML files themselves). I still might be missing something, though.

I see no problem if the answer is "No, there's no way to do that, and we won't update lxml to do it!", but that doesn't make this issue invalid.

Allowing loading arbitrary external DTD might be a security issue, I think I shouldn't do that, but even that wouldn't solve anything. Many XML files I receive don't have a DOCTYPE, and sometimes the DOCTYPE is invalid.

The context where I need that is in a robust loader of arbitrary manually created [and perhaps broken] XML (where the input DOCTYPE shouldn't be trusted): https://github.com/scieloorg/clea/blob/v0.4.0/clea/core.py#L94

Perhaps I'm not being clear, but what fixes this issue is either:

- A way to always keep the Entity objects, even if there's no DOCTYPE in the XML file
- A way to give an arbitrary "entity_to_text" fallback function to replace all entities by text on parsing (I'm thinking on an entity to text rule, not an exhaustive table)

Actually, if the XML file declare some entities, I'd like to use them. What I need is a way to resolve every other entity left undefined by the DOCTYPE declaration.