dtd resolver resolves from parent directory

Bug #1905558 reported by Cardinal Kracker
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

Python : sys.version_info(major=3, minor=8, micro=5, releaselevel='final', serial=0)
lxml.etree : (4, 5, 0, 0)
libxml used : (2, 9, 10)
libxml compiled : (2, 9, 10)
libxslt used : (1, 1, 34)
libxslt compiled : (1, 1, 34)

Hi there,

I have a very weird case where the DTD is not correctly searched:

when I use a parameter entity ref inside a declaration subset
the DTD itself is being searched in the parent(!) directory.

It works if there is no declaration subset.

It works if the external entityrefs are specified
directly inside the decl subset.

--------------- TEST PROGRAM
#!/usr/bin/env python3
import sys
from lxml import etree

doc = open(sys.argv[1],"rb").read()
parser = etree.XMLParser(dtd_validation=True)
tree = etree.fromstring( doc, parser )
res = etree.tostring(tree,encoding="unicode")

print( res )

--------------- DOCUMENT (rama.xml)
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE buch PUBLIC "-//Testing//DTD Buch//DE" "buch.dtd" [
<!ENTITY % parts SYSTEM "parts.ent" >
%parts;
]>
<buch>
<titel>Rendezvous mit Rama</titel>
&kap1;
<kapitel nr="review">
<absatz>Sch&ouml;nes Buch.</absatz>
</kapitel>
</buch>

--------------- buch.dtd
<!ELEMENT buch (titel?,(kapitel)*) >

<!ELEMENT kapitel (absatz)* >
<!ATTLIST kapitel nr CDATA #IMPLIED >

<!ENTITY % plaintext "(#PCDATA)*" >

<!ELEMENT titel %plaintext; >
<!ELEMENT absatz %plaintext; >

<!ENTITY auml "ä">
<!ENTITY ouml "ö">
<!ENTITY uuml "ü">

-------------- parts.ent
<!ENTITY kap1 SYSTEM "kapitel1.xml">

-------------- kapitel1.xml
<kapitel nr="1">
<absatz>exitement!</absatz>
</kapitel>

--------------- working without parametric entref (rama2.xml)
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE buch PUBLIC "-//Testing//DTD Buch//DE" "buch.dtd" [
<!ENTITY kap1 SYSTEM "kapitel1.xml">
]>
<buch>
<titel>Rendezvous mit Rama</titel>
&kap1;
<kapitel nr="review">
<absatz>Sch&ouml;nes Buch.</absatz>
</kapitel>
</buch>

-------------- working with different directory (rama3.xml)
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE buch PUBLIC "-//Testing//DTD Buch//DE" "lxmlbug/buch.dtd" [
<!ENTITY % parts SYSTEM "parts.ent" >
%parts;
]>
<buch>
<titel>Rendezvous mit Rama</titel>
&kap1;
<kapitel nr="review">
<absatz>Sch&ouml;nes Buch.</absatz>
</kapitel>
</buch>

Revision history for this message
Cardinal Kracker (launchpap-user) wrote :
Revision history for this message
Cardinal Kracker (launchpap-user) wrote :

After some more digging I found out that the DTD entity resolution
machanism prefixes the system ID with the path of the parent directory,
whereas parametric or general entites do not get that treatment.

class DTDResolver(etree.Resolver):
  def resolve(self,system_id,public_id,context):
    print( f"*** SYSTEM {system_id} PUBLIC {public_id}" )
    return super().resolve(system_id,public_id,context)

doc = open("rama.xml","rb").read()
parser = etree.XMLParser(dtd_validation=True,load_dtd=True)
parser.resolvers.add( DTDResolver() )
tree = etree.fromstring( doc, parser )

/home/em/Workbench/beautifulsoup> ./dtdbug.py
*** SYSTEM parts.ent PUBLIC None
*** SYSTEM /data/home/em/Workbench/buch.dtd PUBLIC -//Testing//DTD Buch//DE
*** SYSTEM kapitel1.xml PUBLIC None
Traceback (most recent call last):
  File "./dtdbug.py", line 16, in <module>
    tree = etree.fromstring( doc, parser )
  File "src/lxml/etree.pyx", line 3235, in lxml.etree.fromstring
  File "src/lxml/parser.pxi", line 1876, in lxml.etree._parseMemoryDocument
  File "src/lxml/parser.pxi", line 1764, in lxml.etree._parseDoc
  File "src/lxml/parser.pxi", line 1127, in lxml.etree._BaseParser._parseDoc
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "<string>", line 5
lxml.etree.XMLSyntaxError: failed to load external entity "/data/home/em/Workbench/buch.dtd", line 5, column 3
/home/em/Workbench/beautifulsoup>

However at least I can fix that using explicit catalog.xml

<?xml version="1.0"?>
<!DOCTYPE catalog PUBLIC "-//OASIS//DTD XML Catalogs V1.0//EN"
  "file:///usr/share/xml/schema/xml-core/catalog.dtd">
<catalog xmlns="urn:oasis:names:tc:entity:xmlns:xml:catalog">
  <public publicId="-//Testing//DTD Buch//DE" uri="buch.dtd"/>
  <system systemId="parts.ent" uri="parts.ent"/>
  <system systemId="kapitel1.xml" uri="kapitel1.xml"/>
  <system systemId="kapitel2.xml" uri="kapitel2.xml"/>
</catalog>

> XML_CATALOG_FILES=catalog.xml ./dtdbug.py
*** SYSTEM parts.ent PUBLIC None
*** SYSTEM /data/home/em/Workbench/buch.dtd PUBLIC -//Testing//DTD Buch//DE
*** SYSTEM kapitel1.xml PUBLIC None
>

Still gets the wrong system id, but does not throw expections.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.