Comment 17 for bug 666565

Revision history for this message
Gunnar Hjalmarsson (gunnarhj) wrote :

There seems to be a consensus of opinion that the encoding part of
locale names, that are assigned to the LANG or LC_* environment
variables, should be .UTF-8 rather than .utf8. I'm currently working on
language-selector and GDM with other language/locale related matters, so
I can include the necessary changes in a couple of merge proposals in
pipeline. Before I do so, and since I don't have an own idea on to which
extent the changes would create new issues, I'd like that someone
triages the bug with respect to language-selector and gdm (ubuntu). I'd
also need help to draw a conclusion from the reasoning below.

On 2011-01-25 12:54, Colin Watson wrote:
> ... software that parses locale strings in ways that only handle
> particular spellings of them tend to be buggy in other ways. For
> example, such buggy software can easily fail to handle LANG=en_IN as
> a UTF-8 locale, even though it's defined as such in
> /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
> with locales that previously had a non-UTF-8 version, and some newer
> locales just went UTF-8 from the start).

Doesn't that point towards simply appending .UTF-8 to e.g. en_IN,
irrespective of the name according to /usr/share/i18n/SUPPORTED?

I did this test:

  [gunnar@gunnar-laptop ~/sandbox]$ sh
  $ cat mytest.po
  msgid "hello"
  msgstr "hello from India"
  $ dir=/usr/share/locale/en_IN/LC_MESSAGES
  $ sudo mkdir -p $dir
  $ sudo msgfmt mytest.po -o $dir/mytest.mo
  $ LANGUAGE=''
  $ LC_MESSAGES=en_IN
  $ echo $( gettext -d mytest hello )
  hello from India
  $ LC_MESSAGES=en_IN.utf8
  $ echo $( gettext -d mytest hello )
  hello from India
  $ LC_MESSAGES=en_IN.UTF-8
  $ echo $( gettext -d mytest hello )
  hello from India
  $ exit
  [gunnar@gunnar-laptop ~/sandbox]$

No complaints, and gettext found the Indian 'translation' in all three
cases, so en_IN.UTF-8 seems to work. Or would that name cause other apps
to fail?

> Getting back to the original patch, the general idea seems OK to me,
> but I think it would be helpful for it to take a slightly different
> approach to implementation. Rather than just appending .UTF-8, I
> suggest searching /usr/share/i18n/SUPPORTED for a suitable match for
> the language, country, and variant which has "UTF-8" as the second
> column. That way, language-selector will always select the canonical
> user-visible name for the locale, even if it's one of the
> interesting cases such as en_IN where the canonical name doesn't have
> an encoding suffix.

Even if we would go for the canonical names, I don't think it's
necessary to parse /usr/share/i18n/SUPPORTED.

  [gunnar@gunnar-laptop ~]$ locale -a | grep -F en_IN
  en_IN
  en_IN.utf8
  [gunnar@gunnar-laptop ~]$

As you can see, the special case en_IN is represented by two items in
the 'locale -a' output. We ought to be able to make use of that info.

This example shows how the English locale names might be grabbed:

  [gunnar@gunnar-laptop ~]$ sh
  $ tmp=$( locale -a | grep -xvE C\|POSIX )
  $ no_enc=$( echo "$tmp" | grep -vF .utf8 )
  $ for locale in $( echo "$tmp" | grep -F .utf8 | sed 's/\.utf8//' )
  > do
  > if ! expr $locale : en > /dev/null ; then
  > continue
  > elif expr "$no_enc" : .*$locale > /dev/null ; then
  > echo $locale
  > else
  > echo $( echo $locale | sed -r 's/([^@]+)/\1.UTF-8/' )
  > fi
  > done
  en_AG
  en_AU.UTF-8
  en_BW.UTF-8
  en_CA.UTF-8
  en_DK.UTF-8
  en_GB.UTF-8
  en_HK.UTF-8
  en_IE.UTF-8
  en_IN
  en_NG
  en_NZ.UTF-8
  en_PH.UTF-8
  en_SG.UTF-8
  en_US.UTF-8
  en_ZA.UTF-8
  en_ZW.UTF-8
  $ exit
  [gunnar@gunnar-laptop ~]$

As you can see, English locale names for Antigua/Barbuda and Nigeria are
the same kind of special cases as en_IN.