LXML does not support unicode when building python3 and osx

Bug #1687236 reported by Matt Bachmann on 2017-04-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Medium
Unassigned

Bug Description

Howdy!

While working on https://bugs.launchpad.net/lxml/+bug/1658169 I noticed that mac builds just fail when building python 3. Below is the whole log but here is what I found in my investigation.

I'll help in anyway I can but i'm afraid i'm a bit out of my element.

iconv versions:

Mac 10.12.2
iconv (GNU libiconv 1.11) (Though the same result under libiconv 1.15)

Linux
iconv (Ubuntu EGLIBC 2.15-0ubuntu10.18) 2.15

I also inspected parser.pxi the function _setupPythonUnicode to see what the enc value was on various versions.

Mac
2.7.13 = UTF-16LE
3.3.6 = UCS-4LE

Linux
2.7.13 = UCS-4LE
3.3.6 = UCS-4LE

What perplexes me is that libiconv should be able to handle this (to my... limited understanding)

 iconv -l | grep UTF-16LE UTF-16LE

iconv -l | grep UCS-4LE UCS-4LE

Ive seen this behavior on my machine and travis CI.

Matt Bachmann (bachmann.matt) wrote :
scoder (scoder) wrote :

This isn't due to libiconv, it's an incomplete implementation in lxml. See the difference between

https://github.com/lxml/lxml/blob/ebafce689ae62704b1c0944bcd5b84e34f275a2d/src/lxml/parser.pxi#L1014

and

https://github.com/lxml/lxml/blob/ebafce689ae62704b1c0944bcd5b84e34f275a2d/src/lxml/parser.pxi#L1251

This isn't easy to fix, because the incremental parser can receive arbitrary Unicode strings in different memory buffer formats (PEP-393) across its lifetime, which means that the data might need copying into a 4-byte format before passing it into libxml2, as we cannot repeatedly switch encodings at a per-byte level while parsing.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers