2019-08-14 20:39:08 |
Danilo J. S. Bellini |
description |
The following XML file, article_example.xml:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article>
Stuff <sup>1</sup> stuff ⋯ stuff <sup>2</sup>
</article>
Was loaded with:
from lxml import etree
et = etree.parse("article_example.xml", parser=etree.XMLParser(recover=True))
elements = list(et.getroot())
With that doctype declaration, elements[1] is an lxml.etree._Entity object, not an Element. Therefore, elements[1].tag isn't a string (it's a cython function that receives a string and returns another _Entity object like "&your_input_string;"). That's breaking some code that expects that the tag should always be strings and that the iteration (with the Element object or with iterchildren) is just through elements, not entities/text. On the other hand, that doesn't happen if we remove the DOCTYPE line from the input XML.
Is there a way to force a DOCTYPE for parsing, or even to disable it, instead of loading it from the XML?
Versions:
Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 3, 3, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33) |
The following XML file, article_example.xml:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v3.0 20080202//EN" "journalpublishing3.dtd">
<article>
Stuff <sup>1</sup> stuff ⋯ stuff <sup>2</sup>
</article>
Was loaded with:
from lxml import etree
et = etree.parse("article_example.xml", parser=etree.XMLParser(recover=True))
elements = list(et.getroot())
With that doctype declaration, elements[1] is an lxml.etree._Entity object, not an Element. On the other hand, that doesn't happen if we remove the DOCTYPE line from the input XML, but we lose the "⋯" entity no matter the XMLParser options.
Is there a way to force a DOCTYPE for parsing, or even to disable it, instead of loading it from the XML? I mean, is there a way to:
- Always keep the Entity objects, even if there's no DOCTYPE in the XML file?
- Use an "entity_to_text" fallback function to replace all entities by text?
EDIT: This description was changed to be more clear that manually replacing the dangling _Entity instances by text isn't the main issue.
Versions:
Python : sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
lxml.etree : (4, 3, 3, 0)
libxml used : (2, 9, 9)
libxml compiled : (2, 9, 9)
libxslt used : (1, 1, 33)
libxslt compiled : (1, 1, 33) |
|