There seems to be a consensus of opinion that the encoding part of
locale names, that are assigned to the LANG or LC_* environment
variables, should be .UTF-8 rather than .utf8. I'm currently working on
language-selector and GDM with other language/locale related matters, so
I can include the necessary changes in a couple of merge proposals in
pipeline. Before I do so, and since I don't have an own idea on to which
extent the changes would create new issues, I'd like that someone
triages the bug with respect to language-selector and gdm (ubuntu). I'd
also need help to draw a conclusion from the reasoning below.
On 2011-01-25 12:54, Colin Watson wrote:
> ... software that parses locale strings in ways that only handle
> particular spellings of them tend to be buggy in other ways. For
> example, such buggy software can easily fail to handle LANG=en_IN as
> a UTF-8 locale, even though it's defined as such in
> /usr/share/i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
> with locales that previously had a non-UTF-8 version, and some newer
> locales just went UTF-8 from the start).
Doesn't that point towards simply appending .UTF-8 to e.g. en_IN,
irrespective of the name according to /usr/share/i18n/SUPPORTED?
I did this test:
[gunnar@gunnar-laptop ~/sandbox]$ sh
$ cat mytest.po
msgid "hello"
msgstr "hello from India"
$ dir=/usr/share/locale/en_IN/LC_MESSAGES
$ sudo mkdir -p $dir
$ sudo msgfmt mytest.po -o $dir/mytest.mo
$ LANGUAGE=''
$ LC_MESSAGES=en_IN
$ echo $( gettext -d mytest hello )
hello from India
$ LC_MESSAGES=en_IN.utf8
$ echo $( gettext -d mytest hello )
hello from India
$ LC_MESSAGES=en_IN.UTF-8
$ echo $( gettext -d mytest hello )
hello from India
$ exit
[gunnar@gunnar-laptop ~/sandbox]$
No complaints, and gettext found the Indian 'translation' in all three
cases, so en_IN.UTF-8 seems to work. Or would that name cause other apps
to fail?
> Getting back to the original patch, the general idea seems OK to me,
> but I think it would be helpful for it to take a slightly different
> approach to implementation. Rather than just appending .UTF-8, I
> suggest searching /usr/share/i18n/SUPPORTED for a suitable match for
> the language, country, and variant which has "UTF-8" as the second
> column. That way, language-selector will always select the canonical
> user-visible name for the locale, even if it's one of the
> interesting cases such as en_IN where the canonical name doesn't have
> an encoding suffix.
Even if we would go for the canonical names, I don't think it's
necessary to parse /usr/share/i18n/SUPPORTED.
There seems to be a consensus of opinion that the encoding part of
locale names, that are assigned to the LANG or LC_* environment
variables, should be .UTF-8 rather than .utf8. I'm currently working on
language-selector and GDM with other language/locale related matters, so
I can include the necessary changes in a couple of merge proposals in
pipeline. Before I do so, and since I don't have an own idea on to which
extent the changes would create new issues, I'd like that someone
triages the bug with respect to language-selector and gdm (ubuntu). I'd
also need help to draw a conclusion from the reasoning below.
On 2011-01-25 12:54, Colin Watson wrote: i18n/SUPPORTED (the .UTF-8 suffix is mainly for dealing
> ... software that parses locale strings in ways that only handle
> particular spellings of them tend to be buggy in other ways. For
> example, such buggy software can easily fail to handle LANG=en_IN as
> a UTF-8 locale, even though it's defined as such in
> /usr/share/
> with locales that previously had a non-UTF-8 version, and some newer
> locales just went UTF-8 from the start).
Doesn't that point towards simply appending .UTF-8 to e.g. en_IN, i18n/SUPPORTED?
irrespective of the name according to /usr/share/
I did this test:
[gunnar@ gunnar- laptop ~/sandbox]$ sh share/locale/ en_IN/LC_ MESSAGES en_IN.utf8 en_IN.UTF- 8 gunnar- laptop ~/sandbox]$
$ cat mytest.po
msgid "hello"
msgstr "hello from India"
$ dir=/usr/
$ sudo mkdir -p $dir
$ sudo msgfmt mytest.po -o $dir/mytest.mo
$ LANGUAGE=''
$ LC_MESSAGES=en_IN
$ echo $( gettext -d mytest hello )
hello from India
$ LC_MESSAGES=
$ echo $( gettext -d mytest hello )
hello from India
$ LC_MESSAGES=
$ echo $( gettext -d mytest hello )
hello from India
$ exit
[gunnar@
No complaints, and gettext found the Indian 'translation' in all three
cases, so en_IN.UTF-8 seems to work. Or would that name cause other apps
to fail?
> Getting back to the original patch, the general idea seems OK to me, i18n/SUPPORTED for a suitable match for
> but I think it would be helpful for it to take a slightly different
> approach to implementation. Rather than just appending .UTF-8, I
> suggest searching /usr/share/
> the language, country, and variant which has "UTF-8" as the second
> column. That way, language-selector will always select the canonical
> user-visible name for the locale, even if it's one of the
> interesting cases such as en_IN where the canonical name doesn't have
> an encoding suffix.
Even if we would go for the canonical names, I don't think it's i18n/SUPPORTED.
necessary to parse /usr/share/
[gunnar@ gunnar- laptop ~]$ locale -a | grep -F en_IN gunnar- laptop ~]$
en_IN
en_IN.utf8
[gunnar@
As you can see, the special case en_IN is represented by two items in
the 'locale -a' output. We ought to be able to make use of that info.
This example shows how the English locale names might be grabbed:
[gunnar@ gunnar- laptop ~]$ sh ]+)/\1. UTF-8/' ) gunnar- laptop ~]$
$ tmp=$( locale -a | grep -xvE C\|POSIX )
$ no_enc=$( echo "$tmp" | grep -vF .utf8 )
$ for locale in $( echo "$tmp" | grep -F .utf8 | sed 's/\.utf8//' )
> do
> if ! expr $locale : en > /dev/null ; then
> continue
> elif expr "$no_enc" : .*$locale > /dev/null ; then
> echo $locale
> else
> echo $( echo $locale | sed -r 's/([^@
> fi
> done
en_AG
en_AU.UTF-8
en_BW.UTF-8
en_CA.UTF-8
en_DK.UTF-8
en_GB.UTF-8
en_HK.UTF-8
en_IE.UTF-8
en_IN
en_NG
en_NZ.UTF-8
en_PH.UTF-8
en_SG.UTF-8
en_US.UTF-8
en_ZA.UTF-8
en_ZW.UTF-8
$ exit
[gunnar@
As you can see, English locale names for Antigua/Barbuda and Nigeria are
the same kind of special cases as en_IN.