Support or warn about XML 1.1

Bug #2056314 reported by masklinn
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
New
Undecided
Unassigned

Bug Description

XML 1.0 forbids control characters: https://www.w3.org/TR/xml/#charsets

XML 1.1 allows them: https://www.w3.org/TR/xml11/#charsets (note how only #x0 and the surrogate codepoint ranges are excluded from `Char`)

Currently, lxml tries to parse XML 1.1 documents with no warning but if the document contains any control code it chokes and raises

    lxml.etree.XMLSyntaxError: PCDATA invalid Char

Looking for the error, it seems various sources these days are emitting XML 1.1 (or at least putting control characters in documents), which randomly blows up parsers. This is not great. A common workaround is to use ` recovering parser but that seems way overkill and to potentially have nasty side-effects around differential parsing.

- ideally, LXML would allow control characters (maybe as an opt-in?), possibly just for documents doctyped as XML 1.1 to play well with *good* content producer, that may not be possible due to the dependency on libxml2 which as far as I can see only advertises XML 1.0 support (but maybe somewhere in the bowels of libxml2 there's an option for XML 1.1 support?)

- alternatively and for the same reasons, barring the previous LXML *should* warn if it encounters an XML 1.1 document, so that users can be aware of the limitation / risk before actually hitting it

Might also be useful if lxml.de more clearly advertised a lack of XML 1.1 support, the only reference I found is in negative in the FAQ (https://lxml.de/FAQ.html#what-standards-does-lxml-implement where only XML 1.0 support is advertised).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.