lxml

Bug #1830661
Activity log

Activity log for bug #1830661

Date	Who	What changed	Old value	New value	Message
2019-05-28 00:40:41	Danilo J. S. Bellini	bug			added bug
2019-08-11 07:49:39	scoder	lxml: status	New	Invalid
2019-08-14 19:51:15	Danilo J. S. Bellini	lxml: status	Invalid	New
2019-08-14 20:39:08	Danilo J. S. Bellini	description	The following XML file, article_example.xml: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd"> <article> Stuff <sup>1</sup> stuff &ctdot; stuff <sup>2</sup> </article> Was loaded with: from lxml import etree et = etree.parse("article_example.xml", parser=etree.XMLParser(recover=True)) elements = list(et.getroot()) With that doctype declaration, elements[1] is an lxml.etree._Entity object, not an Element. Therefore, elements[1].tag isn't a string (it's a cython function that receives a string and returns another _Entity object like "&your_input_string;"). That's breaking some code that expects that the tag should always be strings and that the iteration (with the Element object or with iterchildren) is just through elements, not entities/text. On the other hand, that doesn't happen if we remove the DOCTYPE line from the input XML. Is there a way to force a DOCTYPE for parsing, or even to disable it, instead of loading it from the XML? Versions: Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0) lxml.etree : (4, 3, 3, 0) libxml used : (2, 9, 9) libxml compiled : (2, 9, 9) libxslt used : (1, 1, 33) libxslt compiled : (1, 1, 33)	The following XML file, article_example.xml: <?xml version="1.0" encoding="utf-8"?> <!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd"> <article> Stuff <sup>1</sup> stuff &ctdot; stuff <sup>2</sup> </article> Was loaded with: from lxml import etree et = etree.parse("article_example.xml", parser=etree.XMLParser(recover=True)) elements = list(et.getroot()) With that doctype declaration, elements[1] is an lxml.etree._Entity object, not an Element. On the other hand, that doesn't happen if we remove the DOCTYPE line from the input XML, but we lose the "&ctdot;" entity no matter the XMLParser options. Is there a way to force a DOCTYPE for parsing, or even to disable it, instead of loading it from the XML? I mean, is there a way to: - Always keep the Entity objects, even if there's no DOCTYPE in the XML file? - Use an "entity_to_text" fallback function to replace all entities by text? EDIT: This description was changed to be more clear that manually replacing the dangling _Entity instances by text isn't the main issue. Versions: Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0) lxml.etree : (4, 3, 3, 0) libxml used : (2, 9, 9) libxml compiled : (2, 9, 9) libxslt used : (1, 1, 33) libxslt compiled : (1, 1, 33)