LXML does not support unicode when building python3 and osx

Bug #1687236 reported by Matt Bachmann
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
lxml
Confirmed
Medium
Unassigned

Bug Description

Howdy!

While working on https://bugs.launchpad.net/lxml/+bug/1658169 I noticed that mac builds just fail when building python 3. Below is the whole log but here is what I found in my investigation.

I'll help in anyway I can but i'm afraid i'm a bit out of my element.

iconv versions:

Mac 10.12.2
iconv (GNU libiconv 1.11) (Though the same result under libiconv 1.15)

Linux
iconv (Ubuntu EGLIBC 2.15-0ubuntu10.18) 2.15

I also inspected parser.pxi the function _setupPythonUnicode to see what the enc value was on various versions.

Mac
2.7.13 = UTF-16LE
3.3.6 = UCS-4LE

Linux
2.7.13 = UCS-4LE
3.3.6 = UCS-4LE

What perplexes me is that libiconv should be able to handle this (to my... limited understanding)

 iconv -l | grep UTF-16LE UTF-16LE

iconv -l | grep UCS-4LE UCS-4LE

Ive seen this behavior on my machine and travis CI.

Revision history for this message
Matt Bachmann (bachmann.matt) wrote :
Revision history for this message
scoder (scoder) wrote :

This isn't due to libiconv, it's an incomplete implementation in lxml. See the difference between

https://github.com/lxml/lxml/blob/ebafce689ae62704b1c0944bcd5b84e34f275a2d/src/lxml/parser.pxi#L1014

and

https://github.com/lxml/lxml/blob/ebafce689ae62704b1c0944bcd5b84e34f275a2d/src/lxml/parser.pxi#L1251

This isn't easy to fix, because the incremental parser can receive arbitrary Unicode strings in different memory buffer formats (PEP-393) across its lifetime, which means that the data might need copying into a 4-byte format before passing it into libxml2, as we cannot repeatedly switch encodings at a per-byte level while parsing.

Changed in lxml:
importance: Undecided → Medium
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.