Default charsets handling for Windows archives in CJKV+th locale
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| | unzip (Debian) |
Confirmed
|
Unknown
|
||
| | unzip (Ubuntu) |
Medium
|
Unassigned | ||
Bug Description
With the current unzip package in Ubuntu, we need to specify charset explicitly to extract zip files sent from localized Windows systems.
For example zip files sent from Japanese localized Windows,
$ zipinfo -O CP932 sent-from-
$ unzip -O CP932 sent-from-
This method won't work for GUI application like file-roller, users do not have way to specify charset from GUI.
Attached branch adds default charsets handling for Windows archives in CJKV+th locale, inspired by Ubuntu Kylin way.
As a result of bug #580961, two options have been added as Ubuntu patch.
> -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
> -I CHARSET specify a character encoding for UNIX and other archives
Then Ubuntu Kylin added default encoding as environment variables for their distribution.
http://
Now as Ubuntu, we can go further by a better way:
- per user settings by their locales instead of global settings
- don't interfere in other locales by locale guard
I only add "-O", so no behavior change for zip files created on Ubuntu or other Linux/UNIX systems. This branch just handles zip file created on localized Windows system seamlessly.
charsets list is taken from:
https:/
and
msdos/msdos.c in unzip package:
1682 case 932: /* Japanese */
1683 case 949: /* Korean */
1684 case 936: /* Chinese, simple */
1685 case 950: /* Chinese, traditional */
1686 case 874: /* Thai */
1687 case 1258: /* Vietnamese */
(Copied from @nobuto's branch description.)
Related branches
- Mathieu Trudel-Lapierre: Needs Information on 2015-05-11
- Sebastien Bacher: Needs Information on 2015-02-16
- Aron Xu (community): Approve on 2015-02-15
-
Diff: 65 lines (+42/-0)3 files modifieddebian/changelog (+7/-0)
debian/profile.unzip-default-charset.sh (+32/-0)
debian/rules (+3/-0)
- Steve Langasek: Needs Fixing on 2015-09-04
- Aron Xu: Pending requested 2015-08-23
-
Diff: 147 lines (+106/-0)6 files modifieddebian/changelog (+9/-0)
debian/control (+1/-0)
debian/tests/control (+2/-0)
debian/tests/fallback-encoding (+57/-0)
debian/unzip-fallback-charset.sh (+36/-0)
debian/unzip.install (+1/-0)
| Changed in unzip (Ubuntu): | |
| importance: | Undecided → Medium |
| status: | New → Triaged |
| Aron Xu (happyaron) wrote : | #1 |
| description: | updated |
| Changed in unzip (Debian): | |
| status: | Unknown → Confirmed |
| Yuan Chao (yuanchao) wrote : | #2 |
It would be nice to have some auto-detect mechanism on top of this locale fallback. For my personal case, most zip files that need to specify the encoding is not the same as my corresponding locale setting.
| Nobuto Murata (nobuto) wrote : | #3 |
@Yuan,
My patch refers LC_CTYPE first, so you can specify different locale to LC_CTYPE and LC_MESSAGES for example. And of cource you can manually export UNZIP and ZIPINFO variables on your ~/.profile. I understand my patch is for short-term workaround.
FWIW, unar supports encoding autodetection, but unzip does not. You can see auto-dection result by:
$ sudo apt-get install unar
$ lsar -l -pe /PATH/TO/ZIPFILE
I'm not sure if file-roller supports unar backend or not.
| Yuan Chao (yuanchao) wrote : | #4 |
Dear @Nobuto,
I appreciate the patch work very much, but it simply doesn't fit my use case. Quite frequently, I get
zip files with CJK file names from zh_CN and ja_JP. (my environment is either zh_TW or en_US, the
later which is for office desktop PC) Changing LC_CTYPE to something other than UTF8 is definitely
*no good* here. This would generate more "monjikai" file names. Changing LC_MESSAGE is not
necessary since I can read either. Adding support in the GUI front-end is really needed here.
My use case may be not close to general end users. But I know many experienced users used to adopt
en_US locale to be able to use the type-n-search for launching applications. (surely this is another issue)
Another thing I met before is using CJKV in the archiving password. Not sure if this is (still) a problem?
(unrar and unzip)
| Aron Xu (happyaron) wrote : | #5 |
@yuanchao, with or without this trick, running unzip would lead to garbled file name for you, so I don't think this change would bother you that much like you describe, does it?
| Yuan Chao (yuanchao) wrote : | #6 |
Well, without this trick, the filenames could be recovered with 'convmv'. But with this trick, it would be scrambled further... Still I personally prefer an auto-detect plus this fallback or an option in the GUI, like file-roller.
| Aron Xu (happyaron) wrote : | #7 |
@yuanchao, you cannot recover file name when it's decompressing with unzip (because characters are replaced by question marks), but you can do that when using 7zip.
| Yuan Chao (yuanchao) wrote : | #8 |
This is from one of my machine running LUbuntu:
$ export |grep LANG
declare -x LANG="en_US.UTF-8"
$ export |grep LC
declare -x LC_ADDRESS=
declare -x LC_IDENTIFICATI
declare -x LC_MEASUREMENT=
declare -x LC_MONETARY=
declare -x LC_NAME=
declare -x LC_NUMERIC=
declare -x LC_PAPER=
declare -x LC_TELEPHONE=
declare -x LC_TIME=
$ unzip -h
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
...
Use the file from here: http://
$ unzip celluloid.zip
Archive: celluloid.zip
inflating: celluloid/
inflating: celluloid/
inflating: celluloid/
inflating: celluloid/
$ unzip -O cp932 celluloid.zip
Archive: celluloid.zip
inflating: celluloid/
inflating: celluloid/せるらうど.ust
inflating: celluloid/
inflating: celluloid/
$ unzip -O cp936 celluloid.zip
Archive: celluloid.zip
inflating: celluloid/
inflating: celluloid/偣傞傜偆偳.ust
inflating: celluloid/
inflating: celluloid/
$ unzip -O cp950 celluloid.zip
Archive: celluloid.zip
inflating: celluloid/
inflating: celluloid/
inflating: celluloid/
inflating: celluloid/
Another file from here http://
$ unzip -L 王妃.zip
Archive: 王妃.zip
inflating: ═їх·_a.ust
inflating: ═їх·_b.ust
$ unzip -O cp932 王妃.zip
Archive: 王妃.zip
inflating: ヘ銈A.ust
inflating: ヘ銈B.ust
$ unzip -O cp936 王妃.zip
Archive: 王妃.zip
inflating: 王妃_A.ust
inflating: 王妃_B.ust
$ unzip -O cp950 王妃.zip
Archive: 王妃.zip
inflating: 卼漦_A.ust
inflating: 卼漦_B.ust
Actually, not all the wrong cases map to illegal UTF8 string (question marks). I guess why an auto-detect is not so straight forward?
| Nobuto Murata (nobuto) wrote : | #9 |
@Yuan,
For example "王妃.zip" you posted, it has short file names in the archive. Even with unar/lsar it fails to detect encoding (you expect CP932, but lsar shows it's ISO-8859-8). Auto detection of encoding is not 100% reliable especially with short file names (less hints for encoding detector).
====
$ lsar -l -pe 王妃.zip
王妃.zip: Zip
Flags File size Ratio Mode Date Time Name
===== ========== ===== ==== ========== ===== ====
0. ----- 40344 82.9% Defl 2014-10-03 13:40 %cd%f5%e5%fa_A.ust
1. ----- 20311 80.4% Defl 2014-10-03 13:40 %cd%f5%e5%fa_B.ust
(Flags: D=Directory, R=Resource fork, L=Link, E=Encrypted, @=Extended attributes)
(Mode: Defl=Deflate)
Encoding: ISO-8859-8 (76% confidence)
====
Anyway enabling auto-detection or specifying encoding in file-roller is out of scope of this bug report. You need to open separate bugs if needed. I would like to proceed with fallback setting in the attached branch for vivid.
Any progress?
| Sebastien Bacher (seb128) wrote : | #11 |
It seems like there are no Ubuntu developers that feel like reviewing those changes, it would be good to get that reviewed upstream and/or in Debian...
| Nobuto Murata (nobuto) wrote : | #12 |
I have sent an enhancement request to upstream through http://
Putting a copy of the request here for your reference.
====
This is an enhancement request. Thanks to ICONV_MAPPING(-O/-I options), we can specify character encoding when extracting zip files. However in combination with GUI application(e.g. file-roller on Linux), there is no way to specify -I/-O from a user perspective. Therefore We cannot extract zip files created on localized Windows system properly with GUI.
A workaround would be exporting UNZIP and ZIPINFO variables with "-O <local charset on Windows>" per locale on login by putting [1] under /etc/profile.d/.
It would be nice if unzip had fallback charset mapping per locale out of the box. I have created a test case to handle 3 types of zip files in ja_JP locale.
[2] http://
(without [1], 3rd test case, fat and CP932, will fail.)
$ unzip -v
UnZip 6.1c19-BETA (2015-04-15) by Info-ZIP. Maintainer: Steven M. Schweda
Copyright (c) 1990-2015 Info-ZIP. For software license: unzip --license
See README for details. More info: http://
Compiled with GCC 4.9.2 for Unix (GNU/Linux x86_64).
UnZip special compilation options:
SYMLINKS (Symbolic links supported, if RTL and file sys do)
TIMESTAMP (Restoring file timestamps supported)
UNIXBACKUP (-B creates backup files)
Traditional Zip Encryption notice:
The traditional zip encryption code of this program is not
written in Europe, and, to the best of our knowledge, can be freely
...
| Iain Lane (laney) wrote : | #13 |
Did upstream say anything?
What is "GBK" that Kylin uses and why is it different from the one we have here?
Sorry for being clueless. :)
| Aron Xu (happyaron) wrote : | #14 |
Upstream won't apply such a behavior as they regard it as locale hacks.
GBK is a superset of cp936 but is not too big to cover portions of UTF-8 (so it can be reliably detected, not like GB18030). It's better to use GBK than cp936 from this POV.
| Nobuto Murata (nobuto) wrote : | #15 |
> Did upstream say anything?
I've got a reply from a developer of unzip, but he is also not familiar with those charset issues. I need to discuss it more in upstream. However what I'm trying to do here is a relatively short-term solution. I believe the request in the attached branch is still valid as downstream to workaround real-life problem which users are seeing in daily-use (as non-latin charset users).
> What is "GBK" that Kylin uses and why is it different from the one we have here?
I took the charset list from:
https:/
and
msdos/msdos.c in unzip package:
1682 case 932: /* Japanese */
1683 case 949: /* Korean */
1684 case 936: /* Chinese, simple */
1685 case 950: /* Chinese, traditional */
1686 case 874: /* Thai */
1687 case 1258: /* Vietnamese */
I'm not so familiar with Chinese charset. I thought CP936 was suitable because we were trying to solve the issue with localized Windows made zip files. GBK may have wider coverage than CP936 though.
| Steve Langasek (vorlon) wrote : | #16 |
Followed up on https:/
Bug #1462848 could be a duplicate.


Additional background:
On Windows, file names are encoded with different encoding for CJKV+th locales, while ZIP archive does not store file name encoding information. When decompressing the ZIP archive on system with another encoding (i.e. UTF-8 on Linux), the file names are garbage and those characters are replaced to ??? by unzip command. And in reality there is no concrete algorithm can detect encoding reliably, not mentioning file names are too short (so it becomes more unreliable, not like in browsers).
Upstream solution to this problem was documented in bug #580961 which is not a direct path that works for ordinary users, hence we are adding a -O switch to specify encoding for archives created on Windows as a locale hack in distribution.