Auto-detect of Japanese Character Encoding does not work

Bug #217613 reported by Jun Kobayashi
6
Affects Status Importance Assigned to Milestone
Mozilla Firefox
Fix Released
High
firefox-3.0 (Ubuntu)
Fix Released
Undecided
Unassigned
Nominated for Hardy by Jun Kobayashi

Bug Description

Binary package hint: firefox-3.0

Auto detectiong for Japanese character encodings does not work on Firefox 3 Beta even if this function works well on Firefox 2.
This is hassle for users who browse Japanese web pages.

A patch to fix this problem is already in upstream bug tracker.

Revision history for this message
In , Masa141421356 (masa141421356) wrote :

At this case, both of HTTP response header and META in html contains only
"Content-Type: text/html", "charset" is not contained.

Revision history for this message
In , Alice0775 (alice0775) wrote :

Regression window is :
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5pre) Gecko/2008032512 :
work fine
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5pre) Gecko/2008032513 : NG

Revision history for this message
In , M-wada (m-wada) wrote :

Confirmed with Fx trunk 2008033105(Win-XP SP2).
No problem when Auto Detect/Universal, with all EUC-JP,UTF-8,Shift_JIS test pages.
> View/Character Encoding/Auto Detect/Universal
> (intl.charset.detector=universal_charset_detector)

Revision history for this message
In , Masa141421356 (masa141421356) wrote :
Revision history for this message
In , hidenosuke (hidenosuke) wrote :

I can reproduce with recent trunk Linux.
In the case of 2 URLs in comment 0 auto-detect doesn't work at all.

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9pre) Gecko/2008040120 Firefox 3.0pre

Revision history for this message
In , Vladimir Vukicevic (vvuk) wrote :

Marking blocking, based on regression... need an automated test for this.

Revision history for this message
In , hidenosuke (hidenosuke) wrote :

I tried to back the patch out of bug 424916
and it seeme to have solved this problem.
I think this is a regression of bug 424916.

Revision history for this message
In , Smontagu (smontagu) wrote :

Created attachment 313966
Patch

So based on comment 3 I think the way to fix this is to get rid of the old CJK parallel state machine detectors in intl/chardet and just use the universal detector with a language filter. The universal detector has been much better maintained, and it will remove a lot of duplicated data.

The patch is a bit large and scary, but most of it is just moving XPCOM stuff around. Since time is short, I'm requesting code review already while I work on testcases.

The patch doesn't include the cvs removes:
intl/chardet/src/Big5Statistics.h
intl/chardet/src/EUCJPStatistics.h
intl/chardet/src/EUCKRStatistics.h
intl/chardet/src/EUCTWStatistics.h
intl/chardet/src/GB2312Statistics.h
intl/chardet/src/nsBIG5Verifier.h
intl/chardet/src/nsCP1252Verifier.h
intl/chardet/src/nsEUCJPVerifier.h
intl/chardet/src/nsEUCKRVerifier.h
intl/chardet/src/nsEUCTWVerifier.h
intl/chardet/src/nsGB18030Verifier.h
intl/chardet/src/nsGB2312Verifier.h
intl/chardet/src/nsHZVerifier.h
intl/chardet/src/nsISO2022CNVerifier.h
intl/chardet/src/nsISO2022JPVerifier.h
intl/chardet/src/nsISO2022KRVerifier.h
intl/chardet/src/nsPSMDetectors.cpp
intl/chardet/src/nsPSMDetectors.h
intl/chardet/src/nsPkgInt.h
intl/chardet/src/nsSJISVerifier.h
intl/chardet/src/nsUCS2BEVerifier.h
intl/chardet/src/nsUCS2LEVerifier.h
intl/chardet/src/nsUTF8Verifier.h
intl/chardet/src/nsVerifier.h

Revision history for this message
In , Smontagu (smontagu) wrote :

Created attachment 313968
Corrected some silly typos

Revision history for this message
In , Smontagu (smontagu) wrote :

Created attachment 314152
Mochitests

Revision history for this message
In , L. David Baron (dbaron) wrote :

I knew there was another review I really meant to get to yesterday... sorry about missing this one.

I'm having trouble understanding the filtering stuff. It seems when FILTER_JAPANESE, etc., is used, the universal detector will consider all Japanese *plus* all non-CJK encodings, when we really only want the Japanese ones (plus UTF-8), or something like that. When FILTER_NONE is used, it seems like we'll consider all encodings except for CJK, when we really want to consider all of them.

Revision history for this message
In , L. David Baron (dbaron) wrote :

Er, never mind about the first problem -- I didn't read the nsUniversalDetector.cpp changes closely enough. But I still think the second problem exists, and it might help to fix it by replacing FILTER_NONE by FILTER_ALL (with at least one additional constant for FILTER_NON_CJK) -- and then you could always do Latin1 and UTF8, regardless of filter (which looks like what you intended).

Revision history for this message
In , Smontagu (smontagu) wrote :

Created attachment 314495
Address David's comments

Revision history for this message
In , Smontagu (smontagu) wrote :

Created attachment 314496
Address David's comments

The right patch this time

Revision history for this message
In , L. David Baron (dbaron) wrote :

Comment on attachment 314496
Address David's comments

r+sr=dbaron

Revision history for this message
In , Smontagu (smontagu) wrote :

Checked in.

Revision history for this message
In , Masa141421356 (masa141421356) wrote :

Works fine. Thank you.
-> Verified.

Revision history for this message
Jun Kobayashi (jkbys) wrote :

Binary package hint: firefox-3.0

Auto detectiong for Japanese character encodings does not work on Firefox 3 Beta even if this function works well on Firefox 2.
This is hassle for users who browse Japanese web pages.

A patch to fix this problem is already in upstream bug tracker.

Changed in firefox:
status: Unknown → Fix Released
Revision history for this message
Rolf Leggewie (r0lf) wrote :
Revision history for this message
Rolf Leggewie (r0lf) wrote :

Kobayashi-san, I am having some other trouble with fonts, though. Many times in Thunderbird and Firefox the display is not as good as in gutsy (mixing different fonts, for example). If you can see above pages (or force them to the correct encoding), do you get different fonts to display the page as in the attached screenshot? If this is not a configuration error on my part, I'd like to open a new bug report about that.

Revision history for this message
Alexander Sack (asac) wrote :

should be fixed everywhere now

Changed in firefox-3.0:
status: New → Fix Released
Changed in firefox:
importance: Unknown → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.