unzip does not support UTF-8 filenames

Bug #10979 reported by Young-Ho Cha
102
This bug affects 20 people
Affects Status Importance Assigned to Milestone
unzip (Debian)
Confirmed
Unknown
unzip (Gentoo Linux)
Fix Released
Unknown
unzip (Mandriva)
Confirmed
Unknown
unzip (Ubuntu)
Confirmed
Medium
Matthias Klose

Bug Description

when unzip extract filename , unzip handle with 7 bit filename.
so filenames with non-latin1 characters are broken.

I described in gentoo bugzilla #69945. and reported zip-bug form.

http://bugs.gentoo.org/show_bug.cgi?id=69945: http://bugs.gentoo.org/show_bug.cgi?id=69945

Revision history for this message
In , Santiago Vila Doncel (sanvila-unex) wrote : Bug#197428: unzip: zipinfo (and unzip) can't deal with chinese filenames like miniunzip can (fwd)

Hello.

Another report from Dan Jacobson:

---------- Forwarded message ----------
From: Dan Jacobson <email address hidden>
Date: Sun, 15 Jun 2003 03:55:28 +0800
Subject: Bug#197428: unzip: zipinfo (and unzip) can't deal with chinese
    filenames like miniunzip can

Package: unzip
Version: 5.50-1
Severity: normal
File: /usr/bin/zipinfo
Tags: upstream

miniunzip extracts this archive from a windows system perfectly, but
as we can see, zipinfo gets the file names wrong.

07:37 liuyujiao$ miniunzip ~/kejia/liuyujiao/%AAF%B6%D5%A6r%A8%E5-1.zip
MiniUnz 0.15, demo of zLib + Unz package written by Gilles Vollant
more info at http://wwww.winimage.com/zLibDll/unzip.html

/home/jidanni/kejia/liuyujiao/%AAF%B6%D5%A6r%A8%E5-1.zip opened
 extracting: ªF¶Õ¦r¨å-1/G±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/z±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/M±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/N±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/S±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/I±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/O±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/P±Æ§Ç.doc
 extracting: ªF¶Õ¦r¨å-1/H±Æ§Ç.doc
creating directory: ªF¶Õ¦r¨å-1/
07:38 liuyujiao$ zipinfo %AAF%B6%D5%A6r%A8%E5-1.zip
Archive: %AAF%B6%D5%A6r%A8%E5-1.zip 1453161 bytes 10 files
-rw-rw-rw- 2.0 fat 2091008 b- defN 16-Mar-03 14:26 ¬FÂiªr¿Õ-1/G¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1638912 b- defN 14-May-03 17:57 ¬FÂiªr¿Õ-1/z¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1190912 b- defN 27-Apr-03 14:43 ¬FÂiªr¿Õ-1/M¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1346560 b- defN 3-May-03 11:25 ¬FÂiªr¿Õ-1/N¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 2365952 b- defN 16-May-03 09:45 ¬FÂiªr¿Õ-1/S¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1546240 b- defN 21-Apr-03 18:42 ¬FÂiªr¿Õ-1/I¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 154112 b- defN 7-May-03 09:27 ¬FÂiªr¿Õ-1/O¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 729600 b- defN 7-May-03 10:39 ¬FÂiªr¿Õ-1/P¦ãºÃ.doc
-rw-rw-rw- 2.0 fat 1233408 b- defN 22-Apr-03 10:31 ¬FÂiªr¿Õ-1/H¦ãºÃ.doc
drwxrwxrwx 2.0 fat 0 b- stor 19-May-03 15:08 ¬FÂiªr¿Õ-1/
10 files, 12296704 bytes uncompressed, 1451997 bytes compressed: 88.2%

-- System Information:
Debian Release: testing/unstable
Architecture: i386
Kernel: Linux debian 2.4.20-k7 #1 Tue Jan 14 00:29:06 EST 2003 i686
Locale: LANG=zh_TW.Big5, LC_CTYPE=zh_TW.Big5

Versions of packages unzip depends on:
ii libc6 2.3.1-10 GNU C Library: Shared libraries an

-- no debconf information

Revision history for this message
Young-Ho Cha (ganadist-gmail) wrote :

when unzip extract filename , unzip handle with 7 bit filename.
so filenames with non-latin1 characters are broken.

I described in gentoo bugzilla #69945. and reported zip-bug form.

Revision history for this message
In , Matthias Klose (doko-cs) wrote : tagging unzip reports + patch

tags 197427 + patch
tags 197428 + patch
thanks

patch proposal at http://ftp.mizi.com/~ganadist/unzip-locale.diff

Revision history for this message
Matthias Klose (doko) wrote :

Created an attachment (id=1511)
patch found at http://ftp.mizi.com/~ganadist/unzip-locale.diff

Revision history for this message
Matthias Klose (doko) wrote :

the patch gets the encoding of the filenames in the archive from the current
locale (LANG), which is most likely correct, if the archives stay in the same
country. But you get wrong results extracting an archive from Russia in Japan.

Are these cases uncommon enough, and we should apply the patch anyway?

Revision history for this message
Martin Bergner (martin-bergner) wrote :

what is the progress on this, is the patch in there?

Revision history for this message
Rocco Stanzione (trappist) wrote :

Closing after several months without a reply. If this is still a problem, please reply or reopen the bug.

Changed in unzip:
status: Needs Info → Rejected
Revision history for this message
Matthias Klose (doko) wrote :

reopening, bogus reason for rejection. did you even see the report as confirmed for upstream?

Changed in unzip:
status: Rejected → Confirmed
Revision history for this message
Rocco Stanzione (trappist) wrote :

The policy is to reject a bug after more than a month of inactivity with a status of 'needs info'. Thank you for your attention.

Revision history for this message
Matthias Klose (doko) wrote :

please add a reference to that policy.

Revision history for this message
Roland Ronquist (roland-ronquist) wrote :

As far as I can see the bug is still there, and zip is showing the exact same behaviour as unzip so maybe we should file yet another bug or stick to killing 2 birds with one stone.

Maybe even a better suggestion is to go through all
packages and figure out what is needed for UTF-8
compatibility since this none-awareness of UTF-8
is not only displayed by simple command line utilities
(like zip and unzip) but also by user oriented packages
like Abiword.

Everything may seem to work as long as you never
ever exchange files with someone on a different
character encoding (has anyone heard of Microsoft
Windows?!).

Revision history for this message
Pavel Volkovitskiy (int) wrote :

i found working patch in altlinux bugzilla

https://bugzilla.altlinux.org/long_list.cgi?buglist=4871

sorry, it in russian only, most important part is:

patch that is works with utf-8
https://bugzilla.altlinux.org/attachment.cgi?id=1402

+ -O CHARSET specify a character encoding for DOS, Windows and OS/2 archives\n\
+ -I CHARSET specify a character encoding for UNIX and other archives\n\n";

also it will be possible to add to /etc/profile.d/unzip.sh
# Set default encoding for filenames inside DOS/Windows Zip archives
export UNZIP="-O CP866"
export ZIPINFO="-O CP866"

so this will be used by default

Revision history for this message
Matthias Klose (doko) wrote :

fixed in unzip_5.52-9ubuntu3

Changed in unzip:
status: Confirmed → Fix Released
Changed in unzip:
status: Unknown → Invalid
Changed in unzip:
status: Invalid → Fix Released
Revision history for this message
Martin Bergner (martin-bergner) wrote :

Hi, if there is a patch, why is the bug still open? It is resolved in gentoo and I don't know how to reproduce it, can someone who has experienced the problem comment on that?

Revision history for this message
Martin Bergner (martin-bergner) wrote :

Ahh, misread the in debian line, it was fixed correctly in Ubuntu. Sorry for the noise!

Revision history for this message
Kjetil Torgrim Homme (kjetilho) wrote :

The behaviour in 5.52-10ubuntu2 is still broken. I have a file with filenames in UTF-8, but if I try to unpack it in a UTF-8 locale, the filenames get corrupted.

Example:
  ó is stored in the file as \303\263 (raw UTF-8), but is extracted as \342\224\234\342\224\202

Workaround:
  using a Latin1 environemtn will make a verbatim copy of the octets:
    env LC_CTYPE=en_US unzip ...
  (obviously you first need to generate a Latin1 environment)

My Ubuntu version is 8.04.2

Revision history for this message
Yinon Ehrlich (yinone) wrote :

Hi,

I tried to extract a zip file that created on Hebrew Windows on Ubuntu 8.10, Hebrew locale.
The extracted files got Russian names...

Solved this way:
 unzip -O 862 -x ...

However:
* I didn't find any documentation for the -O option, just by trial and error (failed: 1255, windows-1255 and "ISO-8859-8" ).
* Is there any way to (automatically) identify the original Zip file encoding and environment (e.g. Hebrew on Windows) ?

Thanks in advance,
   Yinon

Revision history for this message
Andreas Chatziagapiou (xagapiou) wrote :

When in 9.04 the two lines in ~/.profile worked just fine (the lines below are for the greek locale):
export UNZIP="-O CP737"
export ZIPINFO="-O CP737"

but now in karmic its again broken.
My ~/.profile remains the same with the two above line appended just as it was with jaunty but karmic doesn't seem to bother :(

Revision history for this message
Lev Abashkin (lev-abashkin) wrote :

It looks like Ubuntu maintainers forgot to include patch from Altlinux folks into unzip 6.0. Here it is: http://sisyphus.ru/ru/srpm/Sisyphus/unzip/patches/0

Revision history for this message
proDOOMman (prodoomman) wrote :

Please, add natspec or any other patch to the unzip!

Revision history for this message
Murz (murznn) wrote :

Same problem on karmic, unzip didn't accept '-o' paramter.
After settings:
export UNZIP="-O CP737"
export ZIPINFO="-O CP737"
unzip command stops accepting arguments

Changed in unzip (Mandriva):
status: Unknown → Confirmed
Revision history for this message
Еггог (sergey-nr) wrote :

Same problem on Lucid 10.04beta2, unzip don't know '-O' and '-I' parameters! Please return the patch from Altlinux ! In old version unzip 5.52 (Jaunty) no this problem.

Yannis Tsop (ogiannhs)
Changed in unzip (Ubuntu):
status: Fix Released → Confirmed
Revision history for this message
proDOOMman (prodoomman) wrote :

WTF? This bug influences not only fileroller, but every archiver in Linux, that use unzip.
This is not a duplicate of bug #177929!

Revision history for this message
Dmitry Agafonov (dmitry-agafonov) wrote :

@proDOOMman Right
I guess we should make this bug as duplicate of https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/477755

Revision history for this message
Vladimir Skvortsov (vskvortsoff) wrote :

Ubuntu 12.10 (UI with US English-UTF-8 codepage)

It seems if you KNOW from which SW platform zip file comes from and codepage, you can successfully unzip the archive without loosing non-ASCII filenames not encoded in UTF-8.

I just did one experiment to unpack zip file that has been created in Korean Windows 7 and contains the Korean characters in both zip archive name and compressed files.

First let's get a local-specific info:

$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Let's check the version of unzip utility:

$ unzip --help
UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
...
Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
Default action is to extract files in list, except those in xlist, to exdir;
file[.zip] may be a wildcard. -Z => ZipInfo mode ("unzip -Z" for usage).
...
-O CHARSET specify a character encoding for DOS, Windows and OS/2 archives
-I CHARSET specify a character encoding for UNIX and other archives

Look at options with the following modifier:

-O CHARSET specify a character encoding for DOS, Windows and OS/2 archives

It is not -"zero", it is -O (capital O letter)!

In my case Korean Windows has EUC-KR codepage. The compressed zip-file has "2013년 설날" file name.

It means my command line will look like:

$ unzip -O EUC-KR "2013년 설날"

After checking unpacked files, it works! All files have right Korean encoding without strange characters.

Revision history for this message
sae.area (saearea-test) wrote :

For anybody coming past here and looking for file-roller issues related to unzip, please check: https://bugzilla.gnome.org/show_bug.cgi?id=306403#c42

Thank you.

Stefan

Mathew Hodson (mhodson)
affects: file-roller → ubuntu-translations
no longer affects: ubuntu-translations
Revision history for this message
Unxed (unxed) wrote :

Wrote a patch for unzip fixing this issue:
https://sourceforge.net/p/infozip/patches/29/

The same patch for p7zip:
https://sourceforge.net/p/p7zip/bugs/187/

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.