unzip does not support UTF-8 filenames

Bug #10979 reported by Young-Ho Cha on 2004-12-06
102
This bug affects 20 people
Affects Status Importance Assigned to Milestone
unzip (Debian)
Confirmed
Unknown
unzip (Gentoo Linux)
Fix Released
Unknown
unzip (Mandriva)
Confirmed
Unknown
unzip (Ubuntu)
Medium
Matthias Klose

Bug Description

when unzip extract filename , unzip handle with 7 bit filename.
so filenames with non-latin1 characters are broken.

I described in gentoo bugzilla #69945. and reported zip-bug form.

http://bugs.gentoo.org/show_bug.cgi?id=69945: http://bugs.gentoo.org/show_bug.cgi?id=69945

Hello.

Another report from Dan Jacobson:

---------- Forwarded message ----------
From: Dan Jacobson <email address hidden>
Date: Sun, 15 Jun 2003 03:55:28 +0800
Subject: Bug#197428: unzip: zipinfo (and unzip) can't deal with chinese
    filenames like miniunzip can

Package: unzip
Version: 5.50-1
Severity: normal
File: /usr/bin/zipinfo
Tags: upstream

miniunzip extracts this archive from a windows system perfectly, but
as we can see, zipinfo gets the file names wrong.

07:37 liuyujiao$ miniunzip ~/kejia/liuyujiao/%AAF%B6%D5%A6r%A8%E5-1.zip
MiniUnz 0.15, demo of zLib + Unz package written by Gilles Vollant
more info at http://wwww.winimage.com/zLibDll/unzip.html

/home/jidanni/kejia/liuyujiao/%AAF%B6%D5%A6r%A8%E5-1.zip opened
 extracting: ªF¶Õ¦r¨å-1/G±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/z±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/M±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/N±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/S±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/I±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/O±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/P±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/H±Æ§Ç.doc
creating directory: ªF¶Õ¦r¨å-1/
07:38 liuyujiao$ zipinfo %AAF%B6%D5%A6r%A8%E5-1.zip
Archive: %AAF%B6%D5%A6r%A8%E5-1.zip 1453161 bytes 10 files
-rw-rw-rw- 2.0 fat 2091008 b- defN 16-Mar-03 14:26 ¬FÂiªr¿Õ-1/G¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1638912 b- defN 14-May-03 17:57 ¬FÂiªr¿Õ-1/z¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1190912 b- defN 27-Apr-03 14:43 ¬FÂiªr¿Õ-1/M¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1346560 b- defN 3-May-03 11:25 ¬FÂiªr¿Õ-1/N¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 2365952 b- defN 16-May-03 09:45 ¬FÂiªr¿Õ-1/S¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1546240 b- defN 21-Apr-03 18:42 ¬FÂiªr¿Õ-1/I¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 154112 b- defN 7-May-03 09:27 ¬FÂiªr¿Õ-1/O¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 729600 b- defN 7-May-03 10:39 ¬FÂiªr¿Õ-1/P¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1233408 b- defN 22-Apr-03 10:31 ¬FÂiªr¿Õ-1/H¦ãºÃ.doc
drwxrwxrwx 2.0 fat 0 b- stor 19-May-03 15:08 ¬FÂiªr¿Õ-1/
10 files, 12296704 bytes uncompressed, 1451997 bytes compressed: 88.2%

-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux debian 2.4.20-k7 #1 Tue Jan 14 00:29:06 EST 2003 i686
Locale: LANG=zh_TW.Big5, LC_CTYPE=zh_TW.Big5

Versions of packages unzip depends on:
ii libc6 2.3.1-10 GNU C Library: Shared libraries an

-- no debconf information

Young-Ho Cha (ganadist-gmail) wrote :

when unzip extract filename , unzip handle with 7 bit filename.
so filenames with non-latin1 characters are broken.

I described in gentoo bugzilla #69945. and reported zip-bug form.

tags 197427 + patch
tags 197428 + patch
thanks

patch proposal at http://ftp.mizi.com/~ganadist/unzip-locale.diff

Matthias Klose (doko) wrote :
Matthias Klose (doko) wrote :

the patch gets the encoding of the filenames in the archive from the current
locale (LANG), which is most likely correct, if the archives stay in the same
country. But you get wrong results extracting an archive from Russia in Japan.

Are these cases uncommon enough, and we should apply the patch anyway?

Martin Bergner (martin-bergner) wrote :

what is the progress on this, is the patch in there?

Rocco Stanzione (trappist) wrote :

Closing after several months without a reply. If this is still a problem, please reply or reopen the bug.

Changed in unzip:
status: Needs Info → Rejected
Matthias Klose (doko) wrote :

reopening, bogus reason for rejection. did you even see the report as confirmed for upstream?

Changed in unzip:
status: Rejected → Confirmed
Rocco Stanzione (trappist) wrote :

The policy is to reject a bug after more than a month of inactivity with a status of 'needs info'. Thank you for your attention.

Matthias Klose (doko) wrote :

please add a reference to that policy.

As far as I can see the bug is still there, and zip is showing the exact same behaviour as unzip so maybe we should file yet another bug or stick to killing 2 birds with one stone.

Maybe even a better suggestion is to go through all
packages and figure out what is needed for UTF-8
compatibility since this none-awareness of UTF-8
is not only displayed by simple command line utilities
(like zip and unzip) but also by user oriented packages
like Abiword.

Everything may seem to work as long as you never
ever exchange files with someone on a different
character encoding (has anyone heard of Microsoft
Windows?!).

Pavel Volkovitskiy (int) wrote :

i found working patch in altlinux bugzilla

https://bugzilla.altlinux.org/long_list.cgi?buglist=4871

sorry, it in russian only, most important part is:

patch that is works with utf-8
https://bugzilla.altlinux.org/attachment.cgi?id=1402

+ -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives\n\
+ -I CHARSET specify a character encoding for UNIX and other archives\n\n";

also it will be possible to add to /etc/profile.d/unzip.sh
# Set default encoding for filenames inside DOS/Windows Zip archives
export UNZIP="-O CP866"
export ZIPINFO="-O CP866"

so this will be used by default

Matthias Klose (doko) wrote :

fixed in unzip_5.52-9ubuntu3

Changed in unzip:
status: Confirmed → Fix Released
Changed in unzip:
status: Unknown → Invalid
Changed in unzip:
status: Invalid → Fix Released

Hi, if there is a patch, why is the bug still open? It is resolved in gentoo and I don't know how to reproduce it, can someone who has experienced the problem comment on that?

Ahh, misread the in debian line, it was fixed correctly in Ubuntu. Sorry for the noise!

The behaviour in 5.52-10ubuntu2 is still broken. I have a file with filenames in UTF-8, but if I try to unpack it in a UTF-8 locale, the filenames get corrupted.

Example:
  ó is stored in the file as \303\263 (raw UTF-8), but is extracted as \342\224\234\342\224\202

Workaround:
  using a Latin1 environemtn will make a verbatim copy of the octets:
    env LC_CTYPE=en_US unzip ...
  (obviously you first need to generate a Latin1 environment)

My Ubuntu version is 8.04.2

Yinon Ehrlich (yinone) wrote :

Hi,

I tried to extract a zip file that created on Hebrew Windows on Ubuntu 8.10, Hebrew locale.
The extracted files got Russian names...

Solved this way:
 unzip -O 862 -x ...

However:
* I didn't find any documentation for the -O option, just by trial and error (failed: 1255, windows-1255 and "ISO-8859-8" ).
* Is there any way to (automatically) identify the original Zip file encoding and environment (e.g. Hebrew on Windows) ?

Thanks in advance,
   Yinon

When in 9.04 the two lines in ~/.profile worked just fine (the lines below are for the greek locale):
export UNZIP="-O CP737"
export ZIPINFO="-O CP737"

but now in karmic its again broken.
My ~/.profile remains the same with the two above line appended just as it was with jaunty but karmic doesn't seem to bother :(

Lev Abashkin (lev-abashkin) wrote :

It looks like Ubuntu maintainers forgot to include patch from Altlinux folks into unzip 6.0. Here it is: http://sisyphus.ru/ru/srpm/Sisyphus/unzip/patches/0

proDOOMman (prodoomman) wrote :

Please, add natspec or any other patch to the unzip!

Murz (murznn) wrote :

Same problem on karmic, unzip didn't accept '-o' paramter.
After settings:
export UNZIP="-O CP737"
export ZIPINFO="-O CP737"
unzip command stops accepting arguments

Changed in unzip (Mandriva):
status: Unknown → Confirmed
Еггог (sergey-nr) wrote :

Same problem on Lucid 10.04beta2, unzip don't know '-O' and '-I' parameters! Please return the patch from Altlinux ! In old version unzip 5.52 (Jaunty) no this problem.

Yannis Tsop (ogiannhs) on 2010-04-20
Changed in unzip (Ubuntu):
status: Fix Released → Confirmed
proDOOMman (prodoomman) wrote :

WTF? This bug influences not only fileroller, but every archiver in Linux, that use unzip.
This is not a duplicate of bug #177929!

@proDOOMman Right
I guess we should make this bug as duplicate of https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/477755

Ubuntu 12.10 (UI with US English-UTF-8 codepage)

It seems if you KNOW from which SW platform zip file comes from and codepage, you can successfully unzip the archive without loosing non-ASCII filenames not encoded in UTF-8.

I just did one experiment to unpack zip file that has been created in Korean Windows 7 and contains the Korean characters in both zip archive name and compressed files.

First let's get a local-specific info:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Let's check the version of unzip utility:

$ unzip --help
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
...
Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
Default action is to extract files in list, except those in xlist, to exdir;
file[.zip] may be a wildcard. -Z => ZipInfo mode ("unzip -Z" for usage).
...
-O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
-I CHARSET specify a character encoding for UNIX and other archives

Look at options with the following modifier:

-O CHARSET specify a character encoding for DOS, Windows and OS/2 archives

It is not -"zero", it is -O (capital O letter)!

In my case Korean Windows has EUC-KR codepage. The compressed zip-file has "2013년 설날" file name.

It means my command line will look like:

$ unzip -O EUC-KR "2013년 설날"

After checking unpacked files, it works! All files have right Korean encoding without strange characters.

sae.area (saearea-test) wrote :

For anybody coming past here and looking for file-roller issues related to unzip, please check: https://bugzilla.gnome.org/show_bug.cgi?id=306403#c42

Thank you.

Stefan

affects: file-roller → ubuntu-translations
no longer affects: ubuntu-translations
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.