"utf8" charmap in locale name is wrong

Bug #666565 reported by Lauri Tirkkonen on 2010-10-26
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Ubuntu Translations
Low
Unassigned
eglibc (Ubuntu)
Undecided
Unassigned
gdm (Ubuntu)
Undecided
Gunnar Hjalmarsson
langpack-locales (Ubuntu)
Undecided
Unassigned
language-selector (Ubuntu)
Undecided
Gunnar Hjalmarsson
vim (Ubuntu)
Undecided
Unassigned

Bug Description

Binary package hint: language-selector

LanguageSelector/macros.py explicitly sets the charmap part of locale strings to "utf8" - it should be set to "UTF-8" instead. This is relevant because not all systems alias locale names with the former to the latter, and compatibility with those systems is broken.

Rationale for this change is that the 'locales' package uses the uppercase hyphenated format everywhere, even going as far as replacing '.utf8' with it in one case:
% dpkg -L locales | xargs grep '\.utf8'
/usr/sbin/locale-gen: elif [ $IS_LANG = no ] && L=`grep "^${1/%.utf8/.UTF-8} " /usr/share/i18n/SUPPORTED`; then

Aron Xu (happyaron) wrote :

The problem of "utf8" and "UTF-8" has been there for some time, and there were arguments about it. Let's see:

$ locale -a
C
POSIX
zh_CN.utf8
zh_SG.utf8

# locale-gen
Generating locales...
  zh_CN.UTF-8... up-to-date
  zh_SG.UTF-8... up-to-date
Generation complete.

The problem is that we have mixed up the use of "utf8" and "UTF-8", and I think language-selector isn't the root of such problem - it should be a lower level one.

Many Ubuntu developers tend to use utf8 instead of UTF-8 because it is "defined in eglibc and UTF-8 is now an alias". But we may consider it a bug in the eglibc package in Ubuntu. It has broke many things, many people are confused, work become more complicated. While, I don't mean changing it to UTF-8 can solve all problems, but it might be the right thing.

Changed in ubuntu-translations:
importance: Undecided → High
status: New → Triaged

But "utf8" has been the canonical form in eglibc for as long as I can
remember (at least ten years or so I believe). This isn't something
specific to Ubuntu. Changing it seems risky.

Aron Xu (happyaron) wrote :

Yes, changing is risky. An alternative option is "fix" this in langpack-locale, and try to make everywhere in the system to use "utf8" if any problems occur. A temporary solution to use both utf8 and UTF-8 is of course needed, and it should be just a work around. Such problem tends to cost more and more time to fix issues when users need to change locale settings, and the complexity of dealing related problems are now much higher than ever before.

Colin Watson (cjwatson) wrote :

I still believe that the best option is to use UTF-8 as the primary
user-visible name in environment variables and such (since it's what's
in /usr/share/i18n/SUPPORTED), even though it's an alias, but to fix the
small handful of things that have trouble when you use one of the other
valid spellings. It's not that hard - I've implemented software that
does it. The vast majority of locale-aware software doesn't need to
care, because it will just do setlocale(LC_ALL, "") and get on with
things regardless of whether an alias is in use. It's only a tiny
minority of software that does more sophisticated things with locale
strings that needs to care.

The reason to take this approach is that software that parses locale
strings in ways that only handle particular spellings of them tend to be
buggy in other ways. For example, such buggy software can easily fail
to handle LANG=en_IN as a UTF-8 locale, even though it's defined as such
in /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
with locales that previously had a non-UTF-8 version, and some newer
locales just went UTF-8 from the start). This sort of thing is easily
fixed by (for example) using nl_langinfo(CODESET) rather than trying to
parse locale strings.

Fundamentally, locale strings are supposed to be opaque, and anything
that parses them had better (a) have a good excuse and (b) read the
documentation very carefully to understand what it can and can't do.

Getting back to the original patch, the general idea seems OK to me, but
I think it would be helpful for it to take a slightly different approach
to implementation. Rather than just appending .UTF-8, I suggest
searching /usr/share/i18n/SUPPORTED for a suitable match for the
language, country, and variant which has "UTF-8" as the second column.
That way, language-selector will always select the canonical
user-visible name for the locale, even if it's one of the interesting
cases such as en_IN where the canonical name doesn't have an encoding
suffix.

Lauri Tirkkonen (lotheac) wrote :

Colin's right, of course -- my main issue with this isn't software running locally, but remote systems. That's not trivial though: ssh into some legacy machine, and they might not have compiled your locale at all, or perhaps it's different (such as the case with en_IN). Of course, that's not an Ubuntu bug, but rather a problem with POSIX not separating charmaps from locales.

tags: added: patch
kk19881201 (kk19881201) wrote :

 When I was using GVIM in China , it can not properly display Chinese characters. It can only recognizes UTF-8, not utf8.

Colin Watson (cjwatson) wrote :

Then that's a vim bug - I've opened a task for it.

Aron Xu (happyaron) wrote :

No, it wouldn't be. I think any application that doesn't work with .UTF-8 should be a bug, but not for it doesn't work with .utf8.

Aron Xu (happyaron) wrote :

Referring to gettext document, which might be not really a standard but shows their attitude about locale, they give .UTF-8 as example, but not mentioning .utf8 at all.
http://www.gnu.org/software/hello/manual/gettext/Locale-Names.html#Locale-Names

I didn't do detailed research, but in some other major distribution they use .UTF-8 in their official documentations, e.g. http://www.gentoo.org/doc/en/utf-8.xml

Colin Watson (cjwatson) wrote :

On Fri, Jan 28, 2011 at 12:13:56PM -0000, Aron Xu wrote:
> No, it wouldn't be. I think any application that doesn't work with
> .UTF-8 should be a bug, but not for it doesn't work with .utf8.

I entirely disagree. .utf8 is a valid spelling of the locale and it's a
bug for applications to fail to work with it.

ZhengPeng Hou (zhengpeng-hou) wrote :

if we shift to use utf8, what about other distro still use UTF-8? are
we going to ignore the interoperability. In addition, how can we
convince all other user space applications adopt utf8?
I found that UTF-8 is till being used in eglibc, so whats the
advantage to use utf8?

On Fri, Jan 28, 2011 at 8:36 PM, Colin Watson <email address hidden> wrote:
> On Fri, Jan 28, 2011 at 12:13:56PM -0000, Aron Xu wrote:
>> No, it wouldn't be. I think any application that doesn't work with
>> .UTF-8 should be a bug, but not for it doesn't work with .utf8.
>
> I entirely disagree.  .utf8 is a valid spelling of the locale and it's a
> bug for applications to fail to work with it.
>
> --
> You received this bug notification because you are a direct subscriber
> of the bug.
> https://bugs.launchpad.net/bugs/666565
>
> Title:
>  "utf8" charmap in locale name is wrong
>
> Status in Ubuntu Translations:
>  Triaged
> Status in “eglibc” package in Ubuntu:
>  New
> Status in “langpack-locales” package in Ubuntu:
>  New
> Status in “language-selector” package in Ubuntu:
>  New
> Status in “vim” package in Ubuntu:
>  New
>
> Bug description:
>  Binary package hint: language-selector
>
>  LanguageSelector/macros.py explicitly sets the charmap part of locale
>  strings to "utf8" - it should be set to "UTF-8" instead. This is
>  relevant because not all systems alias locale names with the former to
>  the latter, and compatibility with those systems is broken.
>
>  Rationale for this change is that the 'locales' package uses the uppercase hyphenated format everywhere, even going as far as replacing '.utf8' with it in one case:
>  % dpkg -L locales | xargs grep '\.utf8'
>  /usr/sbin/locale-gen:    elif [ $IS_LANG = no ] && L=`grep "^${1/%.utf8/.UTF-8} " /usr/share/i18n/SUPPORTED`; then
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/ubuntu-translations/+bug/666565/+subscribe
>

Colin Watson (cjwatson) wrote :

I am not saying that we should shift to use .utf8. I am saying that
when the locale ends up as .utf8 for one reason or another, applications
must not break.

This does not have to be an either/or thing! The primary name for the
locales are still generally .UTF-8 and should remain that way. But
locale aliases exist and it's only a small number of buggy applications
that fail to cope with them.

Aron Xu (happyaron) wrote :

If we fix it for users and applications, say, the only visible one for them is .UTF-8, then we can avoid many issues to deal with. It also improves the compatibility when people connect to other distros (like via ssh).

Colin Watson (cjwatson) wrote :

I agree that it makes sense to present UTF-8 as the primary spelling.
Note that I already said above that I agreed that language-selector
should be fixed. But it is clearly wrong to deny the existence of
locale aliases.

Gunnar Hjalmarsson (gunnarhj) wrote :

As from version 2.32.0-0ubuntu2, gdm (ubuntu) may assign locale name to LC_MESSAGES.

Changed in gdm (Ubuntu):
assignee: nobody → Gunnar Hjalmarsson (gunnarhj)
Gunnar Hjalmarsson (gunnarhj) wrote :
Download full text (3.8 KiB)

There seems to be a consensus of opinion that the encoding part of
locale names, that are assigned to the LANG or LC_* environment
variables, should be .UTF-8 rather than .utf8. I'm currently working on
language-selector and GDM with other language/locale related matters, so
I can include the necessary changes in a couple of merge proposals in
pipeline. Before I do so, and since I don't have an own idea on to which
extent the changes would create new issues, I'd like that someone
triages the bug with respect to language-selector and gdm (ubuntu). I'd
also need help to draw a conclusion from the reasoning below.

On 2011-01-25 12:54, Colin Watson wrote:
> ... software that parses locale strings in ways that only handle
> particular spellings of them tend to be buggy in other ways. For
> example, such buggy software can easily fail to handle LANG=en_IN as
> a UTF-8 locale, even though it's defined as such in
> /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
> with locales that previously had a non-UTF-8 version, and some newer
> locales just went UTF-8 from the start).

Doesn't that point towards simply appending .UTF-8 to e.g. en_IN,
irrespective of the name according to /usr/share/i18n/SUPPORTED?

I did this test:

  [gunnar@gunnar-laptop ~/sandbox]$ sh
  $ cat mytest.po
  msgid "hello"
  msgstr "hello from India"
  $ dir=/usr/share/locale/en_IN/LC_MESSAGES
  $ sudo mkdir -p $dir
  $ sudo msgfmt mytest.po -o $dir/mytest.mo
  $ LANGUAGE=''
  $ LC_MESSAGES=en_IN
  $ echo $( gettext -d mytest hello )
  hello from India
  $ LC_MESSAGES=en_IN.utf8
  $ echo $( gettext -d mytest hello )
  hello from India
  $ LC_MESSAGES=en_IN.UTF-8
  $ echo $( gettext -d mytest hello )
  hello from India
  $ exit
  [gunnar@gunnar-laptop ~/sandbox]$

No complaints, and gettext found the Indian 'translation' in all three
cases, so en_IN.UTF-8 seems to work. Or would that name cause other apps
to fail?

> Getting back to the original patch, the general idea seems OK to me,
> but I think it would be helpful for it to take a slightly different
> approach to implementation. Rather than just appending .UTF-8, I
> suggest searching /usr/share/i18n/SUPPORTED for a suitable match for
> the language, country, and variant which has "UTF-8" as the second
> column. That way, language-selector will always select the canonical
> user-visible name for the locale, even if it's one of the
> interesting cases such as en_IN where the canonical name doesn't have
> an encoding suffix.

Even if we would go for the canonical names, I don't think it's
necessary to parse /usr/share/i18n/SUPPORTED.

  [gunnar@gunnar-laptop ~]$ locale -a | grep -F en_IN
  en_IN
  en_IN.utf8
  [gunnar@gunnar-laptop ~]$

As you can see, the special case en_IN is represented by two items in
the 'locale -a' output. We ought to be able to make use of that info.

This example shows how the English locale names might be grabbed:

  [gunnar@gunnar-laptop ~]$ sh
  $ tmp=$( locale -a | grep -xvE C\|POSIX )
  $ no_enc=$( echo "$tmp" | grep -vF .utf8 )
  $ for locale in $( echo "$tmp" | grep -F .utf8 | sed 's/\.utf8//' )
  > do...

Read more...

Changed in language-selector (Ubuntu):
assignee: nobody → Gunnar Hjalmarsson (gunnarhj)
Colin Watson (cjwatson) wrote :

I prefer the canonical names myself (i.e. en_IN rather than
en_IN.UTF-8), but either should be OK.

Parsing /usr/share/i18n/SUPPORTED is *easier* than parsing the
output of 'locale -a', and I think it's safer than trying to draw
inferences from details of the latter's output.

Gunnar Hjalmarsson (gunnarhj) wrote :

Copied from #ubuntu-devel, for the record:

Gunnar Hjalmarsson:
Thanks! Then, how about just replacing .utf8 with .UTF-8 to start with, and introduce parsing of .../SUPPORTED later on, if the simplistic solution proves to not suffice?

Colin Watson:
it would likely be an improvement, at least

Gunnar Hjalmarsson:
Ok, then I include .utf8 => .UTF-8 in a couple of MPs, so we get it confirmed that it's an improvement, to start with.

Changed in language-selector (Ubuntu):
status: New → In Progress
Changed in gdm (Ubuntu):
status: New → In Progress
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package gdm - 2.32.0-0ubuntu8

---------------
gdm (2.32.0-0ubuntu8) natty; urgency=low

  [ Gunnar Hjalmarsson ]
  * debian/patches/36_language_environment_settings.patch:
    - Use locale names with '.UTF-8' instead of '.utf8' when setting
      the LC_MESSAGES environment variable (LP: #666565).
  * debian/patches/40_one_lang_option_per_translation.patch:
    - Modification of /usr/share/gdm/language-options so an absent
      translation directory won't cause it to exit.
 -- Evan Dandrea <email address hidden> Mon, 14 Feb 2011 15:53:38 +0000

Changed in gdm (Ubuntu):
status: In Progress → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package language-selector - 0.13

---------------
language-selector (0.13) natty; urgency=low

  [ Gunnar Hjalmarsson ]
  * LanguageSelector/gtk/GtkLanguageSelector.py:
    - Ensure that main or origin country is included when country
      specific options for a language are shown (LP: #710148).
    - Do not let an absent translation directory make the program crash
      (LP: #714093).
  * data/LanguageSelector.ui:
    - Shorter label to describe the second tab (LP: #709855).
  * LanguageSelector/macros.py:
    - Use locale names with '.UTF-8' instead of '.utf8' when setting
      LC_* or LANG environment variables (LP: #666565, #700619).
      Thanks to Lauri Tirkkonen for the patch!
 -- Evan Dandrea <email address hidden> Mon, 14 Feb 2011 16:13:04 +0000

Changed in language-selector (Ubuntu):
status: In Progress → Fix Released
Gunnar Hjalmarsson (gunnarhj) wrote :

Fixes of this bug for Lucid and Maverick are now available in official backports packages. To make Synaptic check for backports updates you can do:

o System -> Administration -> Update Manager -> Settings...

o Select the "Updates" tab and check the "Unsupported updates" option.

More about Ubuntu backports:
https://help.ubuntu.com/community/UbuntuBackports

Martin Pitt (pitti) wrote :

Not going to apply a large Ubuntu specific patch for this in langpack-locales. This should get fixed in upstream glibc or not at all IMHO.

Changed in langpack-locales (Ubuntu):
status: New → Won't Fix
David Planella (dpm) wrote :

From the latest comments, I'm unsure about the status. Is there anything else needed to fix this bug?

Changed in ubuntu-translations:
status: Triaged → Incomplete
David Planella (dpm) on 2012-10-18
Changed in ubuntu-translations:
importance: High → Low
Lauri Tirkkonen (lotheac) wrote :

This was fixed in language-selector, which is what I originally reported it against. I'm not sure why it's marked as affecting ubuntu-translations.

David Planella (dpm) wrote :

We track all i18n and l10n bugs under the ubuntu-translations project to have a better oversight on them. Thanks a lot for the feedback, marked it as Fix Released there.

Changed in ubuntu-translations:
status: Incomplete → Fix Released
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in eglibc (Ubuntu):
status: New → Confirmed
Changed in vim (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers