lxml not including entities defined in another file

Bug #1911928 reported by Paul Higgs
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

I have created a file of entity definitions (entities.dtd) that gets included in many schemas
  <!ENTITY lowalpha "a-z">
  <!ENTITY hialpha "A-Z">
  <!ENTITY alpha "&lowalpha;&hialpha;">
  <!ENTITY digit "0-9">

  <!ENTITY uword "([&digit;]{1,4}|[1-5][&digit;]{4}|6[0-4][&digit;]{3}|65[0-4][&digit;]{2}|655[0-2][&digit;]|6553[0-5])">
  <!ENTITY Port ":&uword;">

including this one
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE schema SYSTEM "entities.dtd">
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="urn:paul:types1" targetNamespace="urn:paul:types1" elementFormDefault="qualified" attributeFormDefault="unqualified">
 <xs:simpleType name="PortType">
  <xs:restriction base="xs:string">
   <xs:pattern value="&Port;"/>
  </xs:restriction>
 </xs:simpleType>
</xs:schema>

etree.parse() on this schema fails with the messaage
lxml.etree.XMLSyntaxError: Entity 'Scheme' not defined, line 10, column 36

-version info------------------------
Python : sys.version_info(major=3, minor=9, micro=1, releaselevel='final', serial=0)
lxml.etree : (4, 6, 2, 0)
libxml used : (2, 9, 5)
libxml compiled : (2, 9, 5)
libxslt used : (1, 1, 30)
libxslt compiled : (1, 1, 30)
-------------------------------------

Revision history for this message
Paul Higgs (paul-higgs) wrote :
Revision history for this message
Paul Higgs (paul-higgs) wrote :
Revision history for this message
Paul Higgs (paul-higgs) wrote :

Test script

Revision history for this message
Paul Higgs (paul-higgs) wrote :

In Oxygen and XMLSpy these validate OK and XML instance documents also check the pattern correctly. We want to use lxml for ci in github!

Revision history for this message
Paul Higgs (paul-higgs) wrote :

I did some more analysis, and in some situations this may work, however, the following scenario always fails

1-types.xsd includes the entities.dtd
1-main.xsd defines elements that use types defined in 1-types.xsd
1.xml instantiates the element from 1-main.xsd

entities.dtd
<!ENTITY digit "0-9">
<!ENTITY uword "([&digit;]{1,4}|[1-5][&digit;]{4}|6[0-4][&digit;]{3}|65[0-4][&digit;]{2}|655[0-2][&digit;]|6553[0-5])">
<!ENTITY Port ":&uword;">

1-types.xsd
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE schema SYSTEM "entities.dtd">
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="urn:paul:types1" targetNamespace="urn:paul:types1" elementFormDefault="qualified" attributeFormDefault="unqualified">
 <xs:simpleType name="PortType">
  <xs:restriction base="xs:string">
   <xs:pattern value="&Port;"/>
  </xs:restriction>
 </xs:simpleType>
</xs:schema>

1-main.xsd
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE schema SYSTEM "entities.dtd">
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="urn:paul:main1" xmlns:t="urn:paul:types1" targetNamespace="urn:paul:main1" elementFormDefault="qualified" attributeFormDefault="unqualified">
 <xs:import namespace="urn:paul:types1" schemaLocation="1-types.xsd"/>
 <xs:element name="Port" type="t:PortType"/>
</xs:schema>

1.xml
<?xml version="1.0" encoding="UTF-8"?>
<Port xmlns="urn:paul:main1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="urn:paul:main1 1-main.xsd">:101001</Port>

test1.py
import lxml
from lxml import etree

myParser = lxml.etree.XMLParser()

with open('1-types.xsd', 'r') as baseschema_file:
    baseschema=etree.parse(baseschema_file, parser=myParser)
    baseschema.xinclude()
    my_baseschema=etree.XMLSchema(baseschema)

with open('1-main.xsd', 'r') as schema_file:
    mainschema=etree.parse(schema_file, parser=myParser)
    my_schema=etree.XMLSchema(mainschema)

with open('1.xml') as file:
    my_xml=etree.parse(file)
    my_schema.assertValid(my_xml)

==> error
Traceback (most recent call last):
  File "G:\lxml-test\test1.py", line 18, in <module>
    my_schema.assertValid(my_xml)
  File "src\lxml\etree.pyx", line 3623, in lxml.etree._Validator.assertValid
lxml.etree.DocumentInvalid: Element '{urn:paul:main1}Port': [facet 'pattern'] The value ':101001' is not accepted by the pattern ''., line 2

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.