Incorrect encoding declaration not detected

Bug #1463609 reported by Olli Pottonen on 2015-06-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Undecided
Unassigned

Bug Description

According to section 4.3.3 of the XML 1.0 standard, "In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration."

So if we encode a document in utf-16 but declare that it is utf-8, an exception should be raised. However that does not happen.

>>> docsrc = u'<?xml version="1.0" encoding="utf-8"?><root/>'.encode('utf-16')
>>> lxml.etree.fromstring(docsrc)
<Element root at 0x1095f4fc8>

Also, this does not strictly have to raise an exception, but it could:

>>> docsrc = '<root/>'.encode('utf-16le')
>>> lxml.etree.fromstring(docsrc, lxml.etree.XMLParser(encoding='utf-16'))
<Element root at 0x1095f4cf8>

The document is in UTF-16 but has no byte order marker. The section 4.3.3 states that this is an error, although not a fatal error.

Might be a bug in libxml2.

Python : sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
lxml.etree : (3, 5, 0, -99)
libxml used : (2, 9, 2)
etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

Olli Pottonen (olli-pottonen) wrote :

Also this:

>>> docsrc = u'\ufeff<?xml version="1.0" encoding="us-ascii"?><root/>'.encode('utf-8')
>>> lxml.etree.fromstring(docsrc)
<Element root at 0x10c7c1ef0>

Because of the byte order marker '\ufeff', the document is not ascii.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers