iconv/gconv: "illegal input sequence at position"/incomplete implementation
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
glibc (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Doing a conversion from HP-UX to Linux (Red Hat/Ubuntu) shows some encodings on Linux are incomplete. Or at at least they show unexpected behaviour.
Using "\xc3\xbc\x73", which is a UTF-8 string of "u umlaut" followed by an "s".
Doing the following works
printf "\xc3\xbc\x73" | iconv -f utf8 -t ISO-8859-15
Doing the following doesn't work
printf "\xc3\xbc\x73" | iconv -f utf8 -t EUC-KR
and outputs "iconv: illegal input sequence at position 0"
While following works:
printf "\xc3\xbc\x73" | iconv -f utf8 -t EUC-CN
On HP-UX all of the above generate proper output. Since UTF-8 is used as input in all cases it seems strange iconv/gconv thinks the input is wrong (errno 84) in the EUC-KR case. Converting to US-ASCII has the same problem as converting to EUC-KR.
On Mon, Mar 22, 2010 at 09:16:45PM -0000, Sven Boden wrote:
> On HP-UX all of the above generate proper output. Since UTF-8 is used as
> input in all cases it seems strange iconv/gconv thinks the input is
> wrong (errno 84) in the EUC-KR case. Converting to US-ASCII has the same
> problem as converting to EUC-KR.
Well, it's entirely unsurprising that converting to US-ASCII would fail,
as U+00FC "ü" has no representation in US-ASCII. The error code is just
a slightly awkward way to say that conversion is impossible; the iconv()
function doesn't intrinsically distinguish between "this text isn't
valid in your input encoding" and "this text can't be converted to your
output encoding". Thus this just means that iconv thinks that there's
no mapping for U+00FC in EUC-KR.
So, the question is, what byte sequence does iconv on HP-UX output for
this string? And does it actually match the Korean character set
standards? That is, there are two possibilities here: either this is a
bug in glibc for failing to perform a correct conversion, or it's
actually a bug in HP-UX for performing an incorrect conversion rather
than returning an error.