Support or warn about XML 1.1
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
lxml |
New
|
Undecided
|
Unassigned |
Bug Description
XML 1.0 forbids control characters: https:/
XML 1.1 allows them: https:/
Currently, lxml tries to parse XML 1.1 documents with no warning but if the document contains any control code it chokes and raises
lxml.
Looking for the error, it seems various sources these days are emitting XML 1.1 (or at least putting control characters in documents), which randomly blows up parsers. This is not great. A common workaround is to use ` recovering parser but that seems way overkill and to potentially have nasty side-effects around differential parsing.
- ideally, LXML would allow control characters (maybe as an opt-in?), possibly just for documents doctyped as XML 1.1 to play well with *good* content producer, that may not be possible due to the dependency on libxml2 which as far as I can see only advertises XML 1.0 support (but maybe somewhere in the bowels of libxml2 there's an option for XML 1.1 support?)
- alternatively and for the same reasons, barring the previous LXML *should* warn if it encounters an XML 1.1 document, so that users can be aware of the limitation / risk before actually hitting it
Might also be useful if lxml.de more clearly advertised a lack of XML 1.1 support, the only reference I found is in negative in the FAQ (https:/