Comment 0 for bug 1830661

Revision history for this message
Danilo J. S. Bellini (danilobellini) wrote :

The following XML file, article_example.xml:

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article>
  Stuff <sup>1</sup> stuff &ctdot; stuff <sup>2</sup>
</article>

Was loaded with:

from lxml import etree
et = etree.parse("article_example.xml", parser=etree.XMLParser(recover=True))
elements = list(et.getroot())

With that doctype declaration, elements[1] is an lxml.etree._Entity object, not an Element. Therefore, elements[1].tag isn't a string (it's a cython function that receives a string and returns another _Entity object like "&your_input_string;"). That's breaking some code that expects that the tag should always be strings and that the iteration (with the Element object or with iterchildren) is just through elements, not entities/text. On the other hand, that doesn't happen if we remove the DOCTYPE line from the input XML.

Is there a way to force a DOCTYPE for parsing, or even to disable it, instead of loading it from the XML?

Versions:

Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 3, 3, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33)