File Roller cannot handle archive that doesn't encode filenames in UTF-8

Bug #495880 reported by markusd112 on 2009-12-12
176
This bug affects 35 people
Affects Status Importance Assigned to Milestone
File Roller
Confirmed
Medium
file-roller (Ubuntu)
Low
Unassigned

Bug Description

Binary package hint: file-roller

I have received a zip containing a file with a german "Umlaut" in the filename. I cannot extract the file because I get the following error message:

caution: filename not matched: Liste Verwaltung und Verk\?ndigung Dezember 2009.xls

I have no possibility to change the filename and eliminate the "Umlaut" in the filename...

ProblemType: Bug
Architecture: i386
CheckboxSubmission: e27141b8feed9a0134eefdd87f008818
CheckboxSystem: 558fbfb2a1258711a37bb7e23c5d4e6e
Date: Sat Dec 12 11:48:49 2009
DistroRelease: Ubuntu 9.10
ExecutablePath: /usr/bin/file-roller
NonfreeKernelModules: nvidia
Package: file-roller 2.28.1-0ubuntu1
ProcEnviron:
 LANGUAGE=de_DE.UTF-8
 PATH=(custom, no user)
 LANG=de_DE.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.31-16.53-386
SourcePackage: file-roller
Uname: Linux 2.6.31-16-386 i686
XsessionErrors:
 (gnome-settings-daemon:3121): GLib-CRITICAL **: g_propagate_error: assertion `src != NULL' failed
 (gnome-settings-daemon:3121): GLib-CRITICAL **: g_propagate_error: assertion `src != NULL' failed
 (polkit-gnome-authentication-agent-1:3161): GLib-CRITICAL **: g_once_init_leave: assertion `initialization_value != 0' failed
 (nautilus:3155): Eel-CRITICAL **: eel_preferences_get_boolean: assertion `preferences_is_initialized ()' failed

markusd112 (markusd112) wrote :
summary: - Extracting of a file with "umlaut" in the filename doesn't work
+ Extracting a file with german "Umlaut" in the filename doesn't work
description: updated

Thanks for the bug report. This particular bug has already been reported, but feel free to report any other bugs you find.

Changed in file-roller (Ubuntu):
assignee: nobody → Ubuntu Desktop Bugs (desktop-bugs)
importance: Undecided → Low
status: New → Invalid
dabeatle28 (davagui) wrote :

I'm also affected by this bug, zip archives containing files with filenames containing accented or special characters like: á, é, í, ó, ú can not be extracted (I get this error message: "caution: filename not matched") nor renamed.

Also if I edit a file with spaces in its filename, File Roller refuse to update it inside the archive, I get this error message:
zip warning: name not matched: Doc%20con%20espacios.txt
zip error: Nothing to do! (/media/Documentos/Prueba con acentos y espacios.zip)

dabeatle28 (davagui) wrote :

Some info on my system, If any othe info needed let me know:

Ubuntu 2.6.31-17.54-generic
file-roller 2.28.1-0ubuntu1
Linux 2.6.31-17-generic i686

LANG=es_MX.UTF-8
SHELL=/bin/bash

Ubuntu 9.10
Executable path: /usr/bin/file-roller

Dear Developers,

I am very strongly convinced that this bug is a highly critical one. I cannot unzip most .zip files in Ubuntu in the Gnome Desktop Environment due to this bug. I can work around this problem because I can use the command line version (unzip), but most users cannot. After a few attempts MANY users will say they that Ubuntu is useless because such an every day task cannot be solved in a convenient way.

PLEASE FIX THIS PROBLEM AS SOON AS POSSIBLE!

Thanks!

markusd112 (markusd112) wrote :

I fully agree with Kovacs: this bugs makes the Ubuntu environment relatively useless for a non english user. Escpecially for non technical users, that cannot unzip such a file via command line this is an additional argument to switch back to windows, where this is absolutely no problem. In Ubuntu 10.04 this bug still exists...

fundriver (fundriver) wrote :

I'm also affected by this bug and agree with the two people above me. The problem is for users who use the german language critical if you dont use the terminal.

.cobnet (mattias-campe) wrote :

Can somebody explain why the status in this bug is 'invalid'? On my 10.10 system (file-roller 2.32.0-0ubuntu1) it's still not working. I also see that the importance of this bug is 'low', but for some languages it's very critical, eg. German (like already mentioned) and Dutch.

While unzipping http://openofficeplaza.org/images/lesmateriaal/nl/writer/cursus_writer.zip with file-roller I get:
caution: filename not matched: oo_oefen/Oefen_OpenOffice.org/Calc/Kopi\?ren.ods

Some people suggested to use the command line utility unzip, but that also gave errors:
error: cannot create oo_oefen/Oefen_OpenOffice.org/Calc/Kopi?ren.ods
        Invalid argument

These bugs seem to be related, but I don't know if they are really the same:
https://bugs.launchpad.net/ubuntu/+source/file-roller/+bug/674956
https://bugs.launchpad.net/ubuntu/+source/file-roller/+bug/177929
https://bugs.launchpad.net/ubuntu/+source/file-roller/+bug/225002

sokolov (daniel-sokolov) wrote :

Affects file-roller 3.2.1 on a German-localized Ubuntu 11.10 64 bit as well. It affects not only .zip-Files from Windows, but also .zip-files from Google Drive (even if the original .zip-file was uploaded to Google Drive from an Ubuntu system).

Changed in file-roller (Ubuntu):
status: Invalid → Confirmed
Ibrahim M. Ghazal (imgx64) wrote :

Note that right-clicking these files in Nautilus to compress and extract them *does work*, which is a handy workaround.

Lukas Aichhorn (luk-aichhorn) wrote :

This bug also affects me, running ubuntu 12.10, file roller version 3.6.1, in german language. Right clicking in Nautilus actually works, although filenames change to "Filename (invalid encoding/ungültiger code)".

peter.muster (durtreg) wrote :

This bug is unresolved for 3 years now. It's quite annoying actually.
Rightclicking in Nautilus>extract here doesn't work for me.
I have to go to the command line each time and enter $ unzip *

Herbert (no.herbert) wrote :

I was able to solve that bug for rar-files by downloading a new version of WinRAR, see
https://bugs.launchpad.net/ubuntu/+source/file-roller/+bug/632737

but that makes me think that maybe the problem isn't directly in file-roller.

Ma Hsiao-chun (mahsiaochun) wrote :

Hey guys :)

It is worth knowing that ZIP archives can come with different encodings for file names.

The old standard encoding for ZIP is CP437 [1]. Since CP437 only covers the need of certain regions of the world, people on Windows began to use whatever local encoding available, for example, ZIP archives created in Simplified Chinese version of Windows uses CP936 [2].

In 2007, optional UTF-8 support is added to ZIP standard [3]. Unforunately, unzip pre-installed on Linux/Mac OS X and built-in ZIP support of MS Windows don't support the new standard well.

I know some people want unzip be fixed but the unzip upstream seems inactive. And unzip is a program supporting so many platforms (including VMS!), so it may be a bit hard to hack.

I would recommend 7Z archvie to do cross-platform archive exchange since it seems to support Unicode-based filename from day one.

1. http://en.wikipedia.org/wiki/Code_page_437
2. http://en.wikipedia.org/wiki/Code_page_936
3. http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Ma Hsiao-chun (mahsiaochun) wrote :

As I tested the file in comment 3 on Ubuntu 12.04, this file should have encoding CP437.

$ unar -e cp437 Prueba\ con\ acentos\ y\ espacios.zip #Works perfectly
(You can install "unar" by "sudo apt-get install unar")

The result of unzip is a bit strange as it ask for password for one file, is this correct?

Anyway, the big picture is that:
* ZIP filename encoding is tricky
* unzip has broken UTF-8 filename support and limited support of encoding conversion after a patch.
* File Roller is just a frontend of tools like unzip, if you install p7zip it will invoke p7zip instead, which can result in working UTF-8 ZIP filename support.

sokolov (daniel-sokolov) wrote :

Ma said:
> I would recommend 7Z archvie to do cross-platform archive exchange since
>it seems to support Unicode-based filename from day one.

I didn't even DO a cross-platform exchange. I ZIPed it on one Ubuntu machine and tried to unzip it on another. No luck. Now I can't even change the names of the files in the ZIP-archive. So, just because they have a special character in the file name, the files seem to be stuck in the archive. Can't get them out. It is frustrating.

Ma Hsiao-chun (mahsiaochun) wrote :

Hi, sokolov

The round-trip issue you mentioned roots from bugs in unzip (from Info-Zip project) .

I already tried to reach Info-Zip people a few months ago, for exactly same issue affects my language (Chinese).
http://www.info-zip.org/phpBB3/viewtopic.php?f=4&t=403

So if you use File Roller, please install p7zip so that File Roller can use p7zip as its backend and there won't be round-trip issue any more.

If you works in command line, you may consider switching to other tools. zip/unzip seems to provide some unique and useful features, though.

summary: - Extracting a file with german "Umlaut" in the filename doesn't work
+ File Roller cannot handle archives that doesn't encode filenames in
+ UTF-8
summary: - File Roller cannot handle archives that doesn't encode filenames in
- UTF-8
+ File Roller cannot handle archive that doesn't encode filenames in UTF-8
Ma Hsiao-chun (mahsiaochun) wrote :

As I marked some bugs as duplicate of this one.

I'd repeat the workaround for 12.04+ is using "unar" command line tool.

Since the encoding issue usually happen with ZIP archives, -O/-I option of unzip (available 11.10+) also worth a try.

Misaki (myjunkmail311006) wrote :

The undocumented -O/-I option~

So unzip can handle different encodings, at least when I tested it. But it doesn't handle them automatically. p7zip doesn't work because, as noted in Bug #269482, it only handles UTF-8 and ASCII (or maybe ISO-8859).

When I uninstalled p7zip and p7zip-full and looked at archives, some had the pattern of mojibake that I'm used to, and some had a new type. None of them were correct. So it's possible there is some automatic correction and this is why Bug #580961 for unzip has been marked as fixed.

Automatic detection can probably usually be correct, but it could also sometimes be wrong. Either file-roller or unzip could probably improve their automatic detection, and in case certainty is low communicate this to users so they know there might be a problem.

Two other bugs, #592109 and #1199239, refer to non-ASCII filenames encoded in UTF-8, which is likely to be the output from Unix-like systems.

So non-UTF-8 encodings are probably from computers running Windows (?).

Some open-source programs that do automatic detection of encodings include Firefox and, I think, gedit. Also maybe a bit similar to magic numbers used to determine file types..?

This might be how those programs already do this, but I think the way to detect the proper encoding is to interpret the filenames as a certain type, and then look for characters, or character combinations, that are unlikely to appear in a normal filename, or that can't even be output on the file system type.

For example, the character '‚', U+0082 <control> BREAK PERMITTED HERE, often appears in some languages when common encodings are interpreted as iso8859(-number?). Either file-roller or unzip could test various combinations to see what looks valid.

This is worse than an archive file saying what encoding it uses, but basically it seems like some regions (like Japan) don't feel like using UTF-8. The zip format, which is probably worse than other formats in handling filenames, is probably still used because it encodes the contents of files separately, which means it's faster but gives worse compression than other archive formats. Maybe there are other reasons it's faster too.

Having a unique file suffix for a certain set of compression options or quality means that people don't have to worry about which options to choose, and can't argue about which ones other people should use for that suffix. There is probably also a sort of stigma attached to having a poor compression ratio for many types of files, compared to other formats. (For example, some non-zip archive formats can compress a windows bmp file so that it's smaller than a png of the same image, especially if the image has repeating portions.)

So either people can agree on a 'low-quality' archive suffix to use in cases where actual compression isn't important, that's also operating-system independent, or we will continue to encounter zips from different languages that don't tell you how to decode the file names.

Misaki (myjunkmail311006) wrote :

Along with the command-line version of unzip with the -O option, you can also use the convmv command to change filenames of previously extracted files. This works on an ext3 filesystem, but NTFS may give an error because filenames are invalid. ext3 says the encoding is invalid but still lets them be renamed to that.

So for anyone who encounters this bug report and is concerned specifically with extracting filenames from shift-jis encoded archives, these are the commands you can use:

(navigate to directory, and...)
unzip -O shift-jis <filename>
or
convmv * -f utf8 -t iso8859-1 -r
convmv * -f utf8 -t iso8859-1 --notest -r ; convmv -f shift-jis -t utf8 * --notest -r

One of the other bug reports links to this, which lists solutions and problems:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=483290

One consideration might be if files in the same archive use different encoding types. It seems reasonable that they appeared that way to the archive's creator, and thus they shouldn't be interpreted separately, but it could lead to the wrong conversion method being selected.

The patch linked in that bug report describes the options -O and -I, which aren't documented in unzip's manual pages, so it's possible that patch is already applied. But it still didn't detect 'proper' encodings for me when I tested file-roller, using unzip, on several archives after uninstalling p7zip.

The patch also talks the current locale charset. Making assumptions about the encoding used on files could be correct for many people, most of the time, but will be incorrect for other people, and so is at best only a partial solution. I don't know what the patch does after that though.

Just for reference, these are other Debian bugs mentioned in that report:

> Bug#197427: unzip: chinese filenames unwrapped on unix wrongly
> Bug#197428: unzip: zipinfo (and unzip) can't deal with chinese filenames like miniunzip can
> Bug#339021: unzip: incorrectly converts cyrillic file names from Windows-created ZIPs

Changed in file-roller:
importance: Unknown → Medium
status: Unknown → Confirmed
Changed in file-roller (Ubuntu):
status: Confirmed → Triaged
assignee: Ubuntu Desktop Bugs (desktop-bugs) → nobody
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.