UTF-16 BOM not accepted when encoding is specified

Bug #1463610 reported by Olli Pottonen on 2015-06-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Low
Unassigned

Bug Description

UTF-16 can be encoded in little endian manner, UTF-16LE, and in big endian, UTF-16BE. As Section 4 of RFC 2781 explains, 'UTF-16' means 'either little endian or big endian, as indicated by the byte order marker (BOM)'. However, lxml always assumes it is little endian.

>>> docsrc = u'\ufeff<root/>'.encode('utf-16be')
>>> # '\ufeff' is manually added BOM
>>>
>>> lxml.etree.fromstring(docsrc,
... lxml.etree.XMLParser(encoding='utf-16'))
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "lxml.etree.pyx", line 3169, in lxml.etree.fromstring (src/lxml/lxml.etree.c:71679)
  File "parser.pxi", line 1828, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:107518)
  File "parser.pxi", line 1716, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:106309)
  File "parser.pxi", line 1086, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:100991)
  File "parser.pxi", line 580, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:95470)
  File "parser.pxi", line 690, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:96906)
  File "parser.pxi", line 620, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:95973)
lxml.etree.XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

This should not throw an exception, it should work.

Absurdly, while lxml is not able to process the document when encoding is correctly specified, it can parse the same document when encoding is not specified.

>>> lxml.etree.fromstring(docsrc,
... lxml.etree.XMLParser())
<Element root at 0x1095f44d0>

Note also that little endian utf-16 works fine:

>>> docsrc = u'\ufeff<root/>'.encode('utf-16le')
>>> lxml.etree.fromstring(docsrc,
... lxml.etree.XMLParser(encoding='utf-16'))
<Element root at 0x1095f4f80>
>>> lxml.etree.fromstring(docsrc,
... lxml.etree.XMLParser())
<Element root at 0x1095f7440>

By Section 4.3.3 of the XML 1.0 standard, utf-16 must be supported.

It seems that the bug is in libxml2, it just considers utf-16 to be utf-16le with BOM.

Python : sys.version_info(major=2, minor=7, micro=9, releaselevel='final', serial=0)
lxml.etree : (3, 5, 0, -99)
libxml used : (2, 9, 2)
etree.LIBXML_COMPILED_VERSION))
libxml compiled : (2, 9, 2)
libxslt used : (1, 1, 28)
libxslt compiled : (1, 1, 28)

scoder (scoder) wrote :

Yes, my guess is that libxml2 takes UTF-16 as meaning the current platform's endianess (which tends to work in most cases, I guess).

lxml could special case this encoding for in-memory data. It already handles BOMs 'manually' when you call parser.feed():

https://github.com/lxml/lxml/blob/ed6af05991ca39910c575e7177a8389244a2cc4f/src/lxml/parser.pxi#L1247

Changed in lxml:
importance: Undecided → Low
status: New → Triaged
scoder (scoder) on 2017-08-13
summary: - UTF-16 endianness not detected
+ UTF-16 BOM not accepted when encoding is specified
Changed in lxml:
status: Triaged → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers