XMLSyntaxError when parsing an XML with a schema, where an attribute is declared

Bug #1642950 reported by Andrew Pashkin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lxml
Invalid
Undecided
Unassigned

Bug Description

Here is a simple schema, which declares an element `root` with a single boolean attribute, named `foo`:

    <?xml version="1.0"?>

    <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
        <xs:element name="root">
            <xs:complexType>
                <xs:attribute name="foo" type="xs:boolean" default="false"/>
            </xs:complexType>
        </xs:element>
    </xs:schema>

Here is the XML, that complies to the schema:

    <?xml version="1.0"?>

    <root
        foo="true"
    />

And here is the Python code, that parses the XML and validates it against the schema:

    import lxml
    import lxml.etree

    schema_parser = lxml.etree.XMLParser(load_dtd=True)
    schema_doc = lxml.etree.parse(open('test.xsd'), parser=schema_parser)
    schema = lxml.etree.XMLSchema(schema_doc)
    parser = lxml.etree.XMLParser(
        load_dtd=True,
        dtd_validation=False,
        attribute_defaults=True,
        schema=schema
    )
    settings = lxml.etree.parse(open('test.xml'), parser=parser)

It produces a strange exception: `XMLSyntaxError: Element 'root', attribute 'foo': '' is not a valid value of the atomic type 'xs:boolean'.`.

Notice `'foo': ''` part - it means, that LXML think, that the attribute is empty, for some reason.

p.s.
Versions info:

Python : sys.version_info(major=3, minor=5, micro=2, releaselevel='final', serial=0)
lxml.etree : (3, 6, 4, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

description: updated
Revision history for this message
Aaron Schumacher (ajschumacher) wrote :

This looks like a more general problem with attributes.

For example, changing `type="xs:boolean" default="false"` to `type="xs:short"` and `foo="true"` to `foo="1"` above gives `lxml.etree.XMLSyntaxError: Element 'root', attribute 'foo': '' is not a valid value of the atomic type 'xs:short'.`

The failures are on Python 2.7.11, libxslt 1.1.29, libxml2 2.9.4, lxml 3.7.3.

If I switch to libxslt 1.1.28, libxml2 2.9.2, lxml 3.6.4, then both examples work fine.

summary: - XMLSyntaxError when parsing an XML with a schema, where an attribute
- with boolean type is declared
+ XMLSyntaxError when parsing an XML with a schema, where an attribute is
+ declared
Revision history for this message
Henry S Thompson (hsthst) wrote :

I can reproduce, with configuration (on Windows 10, Cygwin):

Python : sys.version_info(major=2, minor=7, micro=13, releaselevel='final', serial=0)
lxml.etree : (3, 7, 3, 0)
libxml used : (2, 9, 4)
libxml compiled : (2, 9, 4)
libxslt used : (1, 1, 29)
libxslt compiled : (1, 1, 29)

Furthermore, the following suggests the problem is on the validation side, not etree as such:

>>> x=etree.parse("testa.xml")
>>> x.root().attrib['foo']
'true'

Revision history for this message
Henry S Thompson (hsthst) wrote :

I've done some bisection, the problem is in libxml2 -- running with 2.9.3 works, 2.9.4 not
In fact, it's down to a single change in xmlschemas.c:

< value = xmlStrndup(attributes[j+3], attributes[j+4] - attributes[j+3]);
---
> value = xmlStringLenDecodeEntities(vctxt->parserCtxt, attributes[j+3],
> attributes[j+4] - attributes[j+3], XML_SUBSTITUTE_REF, 0, 0, 0);

There's a commit fixing this:

https://git.gnome.org/browse/libxml2/commit/?id=3169602058bd2d04913909e869c61d1540bc7fb4

Revision history for this message
scoder (scoder) wrote :

Bug is in libxml2 (and probably resolved there already).

Changed in lxml:
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.