Comment 5 for bug 666565

Revision history for this message
Colin Watson (cjwatson) wrote : Re: [Bug 666565] Re: "utf8" charmap in locale name is wrong

I still believe that the best option is to use UTF-8 as the primary
user-visible name in environment variables and such (since it's what's
in /usr/share/i18n/SUPPORTED), even though it's an alias, but to fix the
small handful of things that have trouble when you use one of the other
valid spellings. It's not that hard - I've implemented software that
does it. The vast majority of locale-aware software doesn't need to
care, because it will just do setlocale(LC_ALL, "") and get on with
things regardless of whether an alias is in use. It's only a tiny
minority of software that does more sophisticated things with locale
strings that needs to care.

The reason to take this approach is that software that parses locale
strings in ways that only handle particular spellings of them tend to be
buggy in other ways. For example, such buggy software can easily fail
to handle LANG=en_IN as a UTF-8 locale, even though it's defined as such
in /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
with locales that previously had a non-UTF-8 version, and some newer
locales just went UTF-8 from the start). This sort of thing is easily
fixed by (for example) using nl_langinfo(CODESET) rather than trying to
parse locale strings.

Fundamentally, locale strings are supposed to be opaque, and anything
that parses them had better (a) have a good excuse and (b) read the
documentation very carefully to understand what it can and can't do.

Getting back to the original patch, the general idea seems OK to me, but
I think it would be helpful for it to take a slightly different approach
to implementation. Rather than just appending .UTF-8, I suggest
searching /usr/share/i18n/SUPPORTED for a suitable match for the
language, country, and variant which has "UTF-8" as the second column.
That way, language-selector will always select the canonical
user-visible name for the locale, even if it's one of the interesting
cases such as en_IN where the canonical name doesn't have an encoding
suffix.