Default charsets handling for Windows archives in CJKV+th locale

Bug #1422290 reported by Aron Xu on 2015-02-16
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
unzip (Debian)
Confirmed
Unknown
unzip (Ubuntu)
Medium
Unassigned

Bug Description

With the current unzip package in Ubuntu, we need to specify charset explicitly to extract zip files sent from localized Windows systems.

For example zip files sent from Japanese localized Windows,
$ zipinfo -O CP932 sent-from-localized-windows.zip
$ unzip -O CP932 sent-from-localized-windows.zip

This method won't work for GUI application like file-roller, users do not have way to specify charset from GUI.

Attached branch adds default charsets handling for Windows archives in CJKV+th locale, inspired by Ubuntu Kylin way.

As a result of bug #580961, two options have been added as Ubuntu patch.
> -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
> -I CHARSET specify a character encoding for UNIX and other archives

Then Ubuntu Kylin added default encoding as environment variables for their distribution.
http://bazaar.launchpad.net/~ubuntukylin-members/ubuntukylin-default-settings/trunk/revision/171

Now as Ubuntu, we can go further by a better way:
 - per user settings by their locales instead of global settings
 - don't interfere in other locales by locale guard

I only add "-O", so no behavior change for zip files created on Ubuntu or other Linux/UNIX systems. This branch just handles zip file created on localized Windows system seamlessly.

charsets list is taken from:
https://msdn.microsoft.com/en-us/goglobal/bb964654
and
msdos/msdos.c in unzip package:
   1682 case 932: /* Japanese */
   1683 case 949: /* Korean */
   1684 case 936: /* Chinese, simple */
   1685 case 950: /* Chinese, traditional */
   1686 case 874: /* Thai */
   1687 case 1258: /* Vietnamese */

(Copied from @nobuto's branch description.)

Related branches

Aron Xu (happyaron) on 2015-02-16
Changed in unzip (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Aron Xu (happyaron) wrote :

Additional background:

On Windows, file names are encoded with different encoding for CJKV+th locales, while ZIP archive does not store file name encoding information. When decompressing the ZIP archive on system with another encoding (i.e. UTF-8 on Linux), the file names are garbage and those characters are replaced to ??? by unzip command. And in reality there is no concrete algorithm can detect encoding reliably, not mentioning file names are too short (so it becomes more unreliable, not like in browsers).

Upstream solution to this problem was documented in bug #580961 which is not a direct path that works for ordinary users, hence we are adding a -O switch to specify encoding for archives created on Windows as a locale hack in distribution.

Nobuto Murata (nobuto) on 2015-02-16
description: updated
Changed in unzip (Debian):
status: Unknown → Confirmed
Yuan Chao (yuanchao) wrote :

It would be nice to have some auto-detect mechanism on top of this locale fallback. For my personal case, most zip files that need to specify the encoding is not the same as my corresponding locale setting.

Nobuto Murata (nobuto) wrote :

@Yuan,

My patch refers LC_CTYPE first, so you can specify different locale to LC_CTYPE and LC_MESSAGES for example. And of cource you can manually export UNZIP and ZIPINFO variables on your ~/.profile. I understand my patch is for short-term workaround.

FWIW, unar supports encoding autodetection, but unzip does not. You can see auto-dection result by:
$ sudo apt-get install unar
$ lsar -l -pe /PATH/TO/ZIPFILE
I'm not sure if file-roller supports unar backend or not.

Yuan Chao (yuanchao) wrote :

Dear @Nobuto,

I appreciate the patch work very much, but it simply doesn't fit my use case. Quite frequently, I get
zip files with CJK file names from zh_CN and ja_JP. (my environment is either zh_TW or en_US, the
later which is for office desktop PC) Changing LC_CTYPE to something other than UTF8 is definitely
*no good* here. This would generate more "monjikai" file names. Changing LC_MESSAGE is not
necessary since I can read either. Adding support in the GUI front-end is really needed here.

My use case may be not close to general end users. But I know many experienced users used to adopt
en_US locale to be able to use the type-n-search for launching applications. (surely this is another issue)

Another thing I met before is using CJKV in the archiving password. Not sure if this is (still) a problem?
(unrar and unzip)

Aron Xu (happyaron) wrote :

@yuanchao, with or without this trick, running unzip would lead to garbled file name for you, so I don't think this change would bother you that much like you describe, does it?

Yuan Chao (yuanchao) wrote :

Well, without this trick, the filenames could be recovered with 'convmv'. But with this trick, it would be scrambled further... Still I personally prefer an auto-detect plus this fallback or an option in the GUI, like file-roller.

Aron Xu (happyaron) wrote :

@yuanchao, you cannot recover file name when it's decompressing with unzip (because characters are replaced by question marks), but you can do that when using 7zip.

Yuan Chao (yuanchao) wrote :

This is from one of my machine running LUbuntu:

$ export |grep LANG
declare -x LANG="en_US.UTF-8"

$ export |grep LC
declare -x LC_ADDRESS="en_US.UTF-8"
declare -x LC_IDENTIFICATION="en_US.UTF-8"
declare -x LC_MEASUREMENT="en_US.UTF-8"
declare -x LC_MONETARY="en_US.UTF-8"
declare -x LC_NAME="en_US.UTF-8"
declare -x LC_NUMERIC="en_US.UTF-8"
declare -x LC_PAPER="en_US.UTF-8"
declare -x LC_TELEPHONE="en_US.UTF-8"
declare -x LC_TIME="en_US.UTF-8"

$ unzip -h
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
...

Use the file from here: http://www1.axfc.net/uploader/Sc/so/325701.zip (passwd: backer) (CP932)

$ unzip celluloid.zip
Archive: celluloid.zip
  inflating: celluloid/readme.txt
  inflating: celluloid/В╣ВщВчВдВ╟.ust
  inflating: celluloid/В╣ВщВчВдВ╟2Ф╘.ust
  inflating: celluloid/В╣ВщВчВдВ╟СхГTГrСOВйВч.ust

$ unzip -O cp932 celluloid.zip
Archive: celluloid.zip
  inflating: celluloid/readme.txt
  inflating: celluloid/せるらうど.ust
  inflating: celluloid/せるらうど2番.ust
  inflating: celluloid/せるらうど大サビ前から.ust

$ unzip -O cp936 celluloid.zip
Archive: celluloid.zip
  inflating: celluloid/readme.txt
  inflating: celluloid/偣傞傜偆偳.ust
  inflating: celluloid/偣傞傜偆偳2斣.ust
  inflating: celluloid/偣傞傜偆偳戝僒價慜偐傜.ust

$ unzip -O cp950 celluloid.zip
Archive: celluloid.zip
  inflating: celluloid/readme.txt
  inflating: celluloid/�����炤��.ust
  inflating: celluloid/�����炤��2��.ust
  inflating: celluloid/�����炤�Ǒ��T�r�O����.ust

Another file from here http://3jf.wodemo.com/file/310894 (CP936)

$ unzip -L 王妃.zip
Archive: 王妃.zip
  inflating: ═їх·_a.ust
  inflating: ═їх·_b.ust

$ unzip -O cp932 王妃.zip
Archive: 王妃.zip
  inflating: ヘ銈A.ust
  inflating: ヘ銈B.ust

$ unzip -O cp936 王妃.zip
Archive: 王妃.zip
  inflating: 王妃_A.ust
  inflating: 王妃_B.ust

$ unzip -O cp950 王妃.zip
Archive: 王妃.zip
  inflating: 卼漦_A.ust
  inflating: 卼漦_B.ust

Actually, not all the wrong cases map to illegal UTF8 string (question marks). I guess why an auto-detect is not so straight forward?

Nobuto Murata (nobuto) wrote :

@Yuan,

For example "王妃.zip" you posted, it has short file names in the archive. Even with unar/lsar it fails to detect encoding (you expect CP932, but lsar shows it's ISO-8859-8). Auto detection of encoding is not 100% reliable especially with short file names (less hints for encoding detector).
====
$ lsar -l -pe 王妃.zip
王妃.zip: Zip
     Flags File size Ratio Mode Date Time Name
     ===== ========== ===== ==== ========== ===== ====
  0. ----- 40344 82.9% Defl 2014-10-03 13:40 %cd%f5%e5%fa_A.ust
  1. ----- 20311 80.4% Defl 2014-10-03 13:40 %cd%f5%e5%fa_B.ust
(Flags: D=Directory, R=Resource fork, L=Link, E=Encrypted, @=Extended attributes)
(Mode: Defl=Deflate)
Encoding: ISO-8859-8 (76% confidence)
====

Anyway enabling auto-detection or specifying encoding in file-roller is out of scope of this bug report. You need to open separate bugs if needed. I would like to proceed with fallback setting in the attached branch for vivid.

Any progress?

Sebastien Bacher (seb128) wrote :

It seems like there are no Ubuntu developers that feel like reviewing those changes, it would be good to get that reviewed upstream and/or in Debian...

Nobuto Murata (nobuto) wrote :
Download full text (3.5 KiB)

I have sent an enhancement request to upstream through http://www.info-zip.org/zip-bug.html since the issue is still reproducible with 6.1c19-BETA which you can try from: https://launchpad.net/~nobuto/+archive/ubuntu/build-test/+build/7630500

Putting a copy of the request here for your reference.

====

This is an enhancement request. Thanks to ICONV_MAPPING(-O/-I options), we can specify character encoding when extracting zip files. However in combination with GUI application(e.g. file-roller on Linux), there is no way to specify -I/-O from a user perspective. Therefore We cannot extract zip files created on localized Windows system properly with GUI.

A workaround would be exporting UNZIP and ZIPINFO variables with "-O <local charset on Windows>" per locale on login by putting [1] under /etc/profile.d/.

[1] http://bazaar.launchpad.net/~nobuto/ubuntu/vivid/unzip/fallback-encoding/view/head:/debian/unzip-default-charset.sh

It would be nice if unzip had fallback charset mapping per locale out of the box. I have created a test case to handle 3 types of zip files in ja_JP locale.

[2] http://bazaar.launchpad.net/~nobuto/ubuntu/vivid/unzip/fallback-encoding/view/head:/debian/tests/fallback-encoding
(without [1], 3rd test case, fat and CP932, will fail.)

$ unzip -v
UnZip 6.1c19-BETA (2015-04-15) by Info-ZIP. Maintainer: Steven M. Schweda
 Copyright (c) 1990-2015 Info-ZIP. For software license: unzip --license
 See README for details. More info: http://info-zip.org/UnZip.html

Compiled with GCC 4.9.2 for Unix (GNU/Linux x86_64).

UnZip special compilation options:
        ARCHIVE_STDIN (Allow streaming archive from stdin)
        ICONV_MAPPING (ISO/OEM (iconv, -I/-O) conversion supported)
        IZ_HAVE_UXUIDGID (UID, GID > 16-bit ("ux" extra block) supported)
        SET_DIR_ATTRIB (Setting directory attributes supported)
        SYMLINKS (Symbolic links supported, if RTL and file sys do)
        TIMESTAMP (Restoring file timestamps supported)
        UNIXBACKUP (-B creates backup files)
        USE_EF_UT_TIME (Use Universal Time, if available)
        UNSHRINK_SUPPORT (PKZIP/Zip 1.x Shrink compression)
        DEFLATE64_SUPPORT (PKZIP 4.x Deflate64(tm) compression)
        UNICODE_SUPPORT [wide-chars, char coding: UTF-8] (handle UTF-8 paths)
        MBCS-support (Multibyte character support, MB_CUR_MAX = 6)
        LARGE_FILE_SUPPORT (Large files over 2 GiB supported)
        ZIP64_SUPPORT (Archives using Zip64 for large files supported)
        BZIP2_SUPPORT (PKZIP 4.6+, bzip2 lib ver 1.0.6, 6-Sept-2010)
        LZMA_SUPPORT (PKZIP 6.3+, LZMA compression, ver 9.20)
        PPMD_SUPPORT (PKZIP 6.3+, PPMd compression, ver 9.20)
        VMS_TEXT_CONV (Conversion of VMS var-len rec fmt text supported)
        IZ_CRYPT_TRAD (Traditional (weak) encryption, ver 3.0)

Traditional Zip Encryption notice:
        The traditional zip encryption code of this program is not
        copyrighted, and is put in the public domain. It was originally
        written in Europe, and, to the best of our knowledge, can be freely
        ...

Read more...

Iain Lane (laney) wrote :

Did upstream say anything?

What is "GBK" that Kylin uses and why is it different from the one we have here?

Sorry for being clueless. :)

Aron Xu (happyaron) wrote :

Upstream won't apply such a behavior as they regard it as locale hacks.

GBK is a superset of cp936 but is not too big to cover portions of UTF-8 (so it can be reliably detected, not like GB18030). It's better to use GBK than cp936 from this POV.

Nobuto Murata (nobuto) wrote :

> Did upstream say anything?

I've got a reply from a developer of unzip, but he is also not familiar with those charset issues. I need to discuss it more in upstream. However what I'm trying to do here is a relatively short-term solution. I believe the request in the attached branch is still valid as downstream to workaround real-life problem which users are seeing in daily-use (as non-latin charset users).

> What is "GBK" that Kylin uses and why is it different from the one we have here?

I took the charset list from:
https://msdn.microsoft.com/en-us/goglobal/bb964654
and
msdos/msdos.c in unzip package:
   1682 case 932: /* Japanese */
   1683 case 949: /* Korean */
   1684 case 936: /* Chinese, simple */
   1685 case 950: /* Chinese, traditional */
   1686 case 874: /* Thai */
   1687 case 1258: /* Vietnamese */

I'm not so familiar with Chinese charset. I thought CP936 was suitable because we were trying to solve the issue with localized Windows made zip files. GBK may have wider coverage than CP936 though.

Steve Langasek (vorlon) wrote :

Followed up on https://code.launchpad.net/~nobuto/ubuntu/wily/unzip/fallback-encoding/+merge/268850 with some feedback about the patch. There are better ways to achieve this than through profile.d.

Bug #1462848 could be a duplicate.

Unxed (unxed) wrote :

Wrote a patch for unzip fixing this issue:
https://sourceforge.net/p/infozip/patches/29/

The same patch for p7zip:
https://sourceforge.net/p/p7zip/bugs/187/

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.