Can not display GB2312/GB18030 encoded chinese files

Bug #819714 reported by An Yang
30
This bug affects 4 people
Affects Status Importance Assigned to Milestone
gedit
New
Undecided
Unassigned
gedit (Ubuntu)
Triaged
High
Unassigned
Nominated for Precise by Eric Miao

Bug Description

GB2312/GB18030 encoding is the national standard in China, gedit should support them.

Revision history for this message
An Yang (euroford) wrote :
Revision history for this message
An Yang (euroford) wrote :

My local settings:

LANG=zh_CN.UTF-8
LANGUAGE=zh_CN:en_US:en
LC_CTYPE="zh_CN.UTF-8"
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES=zh_CN.UTF-8
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=

Revision history for this message
An Yang (euroford) wrote :

The content of gb18030.txt should like this picture.

Revision history for this message
Kyle Nitzsche (knitzsche) wrote :

Hi,

Is it possible that the characters do not display simply because there is no font installed that provides a glyph for the code point?

(I notice when looking in the character map application, there are many Chinese characters that appear to have no glyph, for example <UFACE> has a glyph but <UFACF> does not (natty).

Revision history for this message
An Yang (euroford) wrote : Re: [Bug 819714] Re: Can not display GB2312/GB18030 encoded chinese files

在 2011-08-02二的 19:41 +0000,Kyle Nitzsche写道:

> Hi,
>
> Is it possible that the characters do not display simply because there
> is no font installed that provides a glyph for the code point?

Sure, It's possible.
But wqy fonts include all glyphs in CJK UNIFIED IDEOGRAPHs and extension
B.
It's enough according to China national standards.

In this bug, gedit could display none of them if the file is GB18030
encoded.
GB18030 is the encoding standard in China, So I think it's a fatal bug.

>
> (I notice when looking in the character map application, there are many
> Chinese characters that appear to have no glyph, for example <UFACE> has
> a glyph but <UFACF> does not (natty).
>

Revision history for this message
An Yang (euroford) wrote :

I'm sorry, wqy fonts include all glyphs in CJK UNIFIED IDEOGRAPHs and extension A.
CJK UNIFIED IDEOGRAPHs extension B/C is optional.

Revision history for this message
An Yang (euroford) wrote :

Gedit indeed can display all gb18030 encoded files, it support gb18030 encoding very well.
But gedit in ubuntu can not do this -:(

Revision history for this message
Sebastien Bacher (seb128) wrote :

The file opens fine on my Oneiric installation if I do this:
- run gedit
- click open
- select in the encoding combo "add" and add GB18030
- select that encoding in the combo
- select the file

it renders like it is on the screenshot then

is there a way to detect that a file is GB18030 in a programmatic way? how does other editor deal with that example?

Revision history for this message
An Yang (euroford) wrote :

Sebastien,

Yes, you are right, if the user know the encoding of the file, they can open it with gedit.
But not all of them know what's the encoding, so automatic detect mode is the most user case.

Gedit has a auto detect sequence recorded in gconf, the correct value is [CURRENT,GB18030,GBK,GB2312,UTF-6,UTF-16], you can try this configration.

Revision history for this message
An Yang (euroford) wrote :

I'm sorry, typing mistaken, the correct value is

[CURRENT,GB18030,GBK,GB2312,UTF-8,UTF-16]

Revision history for this message
Sebastien Bacher (seb128) wrote :

can encoding be automatically be detected though? the gconf key you list is by local and suggest that chinese install should use gb encoding before utf8 so the example should open fine?

Revision history for this message
An Yang (euroford) wrote :

In /usr/share/gconf/schemas/gedit.schemas, it can set different key value of auto_detected according to the local settings.

For example, when the LANG=zh_CN, the following will be set:
<locale name="zh_CN">
        <default>[CURRENT,GB18030,GBK,GB2312,UTF-8,UTF-16]</default>
</locale>

postinst scripts of gedit:
if [ "$1" = "configure" ]; then
        gconf-schemas --register gedit-file-browser.schemas gedit.schemas
fi

Revision history for this message
An Yang (euroford) wrote :

So I guess, if the LANG environment were set to zh_CN, you can create a release CD with the right settings of gedit.

Revision history for this message
Sebastien Bacher (seb128) wrote :

right, that's getting confusing though, what issue do you try to solve or what are you asking for there? gedit should already do the right thing when using a zh_CN locale and open files in gb encoding which is rated before utf for that locale

Revision history for this message
An Yang (euroford) wrote :

Sebastien,

I just want to contribute to Qin-ubuntu project(a Chinese locale edition of ubuntu), but I do not know who is the right person should notice this problem, Martin Pitti or somebody else?

And of cause, this bug should influence on any other local editions of ubuntu, I hope the guy there would notice the problem.

Revision history for this message
An Yang (euroford) wrote :

The default value of auto_detected is [UTF-8,CURRENT,ISO-8859-15,UTF-16], no matter which language were used.
I think something is wrong in ubuntu, but I don't know who sould be involved.

Revision history for this message
Sebastien Bacher (seb128) wrote :

is the issue specific to the liveCD or also on the installed system? What version of Ubuntu do you use?

Revision history for this message
An Yang (euroford) wrote :

I tested it from lucid to natty, no matter CD or DVD edition, no matter x86 or x86_64 edition, all of them have this bug.

Revision history for this message
Sebastien Bacher (seb128) wrote :

is the issue specific to liveCD sessions or also on the installed system?

Revision history for this message
An Yang (euroford) wrote :

both of them.

Revision history for this message
ZhengPeng Hou (zhengpeng-hou) wrote :
Revision history for this message
An Yang (euroford) wrote :

Hi Hou,

I just tested the default setting in the gedit package, [CURRENT,GB18030,GBK,GB2312,UTF-8,UTF-16]

All of GB18030,GBK,GB2312,UTF-8 and UTF-16 characters can be displayed.

[UTF-8,CURRENT,ISO-8859-15,UTF-16,GB2312,GBK,GB18030]
Your config maybe have some problem, did you test the case, if the file contents GB18030 characters which is not in GB2123?
I'm not sure.

Revision history for this message
Kyle Nitzsche (knitzsche) wrote :

I just tested opening the gb180130.txt file in oneiric alpha3. Here are my findings:

For auto detect of a new encoding to work in gedit, one must do two things:
 * Add the encoding in gedit (Open > Character Encodings > Add/Remove > add desired encoding)
   - Note that after this step the value of Charect Encoding is still "Automatically Detected". This will not work to open the file yet.
 * Set Character Encoding specifically to the encoding you added, and open the file

After this, Character Encoding of "Automaically Detect" works.

So, perhaps the fix is to change the Character Encoding widget to select the encoding that one just added (instead of remaining at "Automatically Detected") for this ONE open and then, perhaps, revert to "Automatically Detected".

Revision history for this message
Sebastien Bacher (seb128) wrote :

Not sure how gedit3 is supposed to work, it seems the old gconf key which has the encoding order got deprecated

Changed in gedit (Ubuntu):
importance: Undecided → High
Changed in gedit (Ubuntu Oneiric):
importance: Undecided → High
Revision history for this message
Sebastien Bacher (seb128) wrote :

Ok, in fact they are still there, could you run that on Oneiric with a Chinese installation:
gsettings get org.gnome.gedit.preferences.encodings auto-detected

Changed in gedit (Ubuntu Oneiric):
status: New → Incomplete
Revision history for this message
Sebastien Bacher (seb128) wrote :

The key should be set to something similar to what was pointed before, i.e "[CURRENT,GB18030,GBK,GB2312,UTF-8,UTF-16]" so GB18030 encoding is used before UTF ones

Gary Ekker (gekker)
tags: added: qin
Changed in gedit (Ubuntu):
status: New → Incomplete
Revision history for this message
Eric Miao (eric.y.miao) wrote :

Well, I'd say an ideal solution would be for gedit to detect the encoding by itself, and thus avoid all these tricky configurations. I've experimented a bit with universalchardet, which comes with Mozilla project, and its separate library libuchardet. I found it to be smart enough in most cases. Attached is a preliminary patch I did to support gedit with uchardet, for preliminary early preview.

I'll come up with a testing package a bit later.

Revision history for this message
Eric Miao (eric.y.miao) wrote :

I've uploaded testing packages to http://people.canonical.com/~ycmiao/lp819714/, please help test.

Note it's for precise, and one needs to install libuchardet0 firstly.

  $> sudo apt-get install libuchardet0
  $> sudo dpkg --install gedit*~lp819714_*.deb

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

The attachment "uchardet.diff" of this bug report has been identified as being a patch. The ubuntu-reviewers team has been subscribed to the bug report so that they can review the patch. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-reviewers team please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags: added: patch
Revision history for this message
Ma Hsiao-chun (mahsiaochun) wrote :
Changed in gedit (Ubuntu):
status: Incomplete → Confirmed
Changed in gedit (Ubuntu Oneiric):
status: Incomplete → Confirmed
tags: added: precise quantal raring
Revision history for this message
Ma Hsiao-chun (mahsiaochun) wrote :

This problem is partially worked around by not-so-recent translation change in upstream.
https://git.gnome.org/browse/gedit/tree/po/zh_CN.po#n424

Such translation is included in 3.6.1 tarball already but Unfortunately Ubuntu 12.10, even claim to have Gedit 3.6.1, doesn't seem to get that from upstream.

Revision history for this message
Alberto Salvia Novella (es20490446e) wrote :

Since this bug:

- Is valid.
- Is well described.
- Is reported in the upstream project.
- Is ready to be worked on by a developer.

It's already triaged.

Changed in gedit (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Alberto Salvia Novella (es20490446e) wrote :

Oneiric reached EOL.

Changed in gedit (Ubuntu Oneiric):
status: Confirmed → Won't Fix
no longer affects: gedit (Ubuntu Oneiric)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.