Cannot load files whose path uses non-ASCII chars

Bug #1011999 reported by Jean-Michel Philippe
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux Stopmotion
Fix Released
Medium
Tim Band

Bug Description

This is a duplicate of a very annoying bug I posted on Debian, see http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=646871. We are using Stopmotion in DoudouLinux and this bug means children and parents have to take care of the characters they use when saving files. This is a bit contradictory with the ease of use we're trying to promote in DoudouLinux! So now let's talk about it…

Stopmotion can save files using any UTF-8 character in file names, but it cannot load them. The saved file appears with strange characters in the list of files recently used. Tests showed stopmotion also fails if any parent directory has such characters while file names are using pure ASCII chars. Trying to load such file gives the following error:

----------------------
I/O warning : failed to load external entity "(null)(null).dat"
Couldn't load XML file
error : string is not in UTF-8
encoding error : output conversion failed due to conv error, bytes 0xE9 0x2E 0x73 0x74
I/O error : encoder error
encoding error : output conversion failed due to conv error, bytes 0xE9 0x2E 0x73 0x74
I/O error : encoder error
encoding error : output conversion failed due to conv error, bytes 0xE9 0x2E 0x73 0x74
I/O error : encoder error
----------------------

As a result the file cannot be loaded anymore, you have to uncompress the STO file and change file names or directory paths in order to retrieve your work (example file attached). Note that trying to load such file also empties the configuration file $HOME/.stopmotion/preferences.xml so that Stopmotion is unable to restart afterwards.

Finally note that I've started to look at the source code to determine how to fix the issue. In my opinion this is due to the extensive use of C strings instead of QStrings for file names. Unfortunately I'm discovering Qt and I feel that many source files have to be patched to remove C strings and lists of C strings in favour of Qt objects. The work is then harder than expected.

Revision history for this message
Jean-Michel Philippe (jean-michel-philippe) wrote :
Revision history for this message
Tim Band (tim-band) wrote :

I have a patch for this (I will test it further then post it; it's not very impressive), but libtar is also bugged in a way that stops this from working (it writes out its checksums wrongly when a file has non-ASCII charcters in it). I will post my libtar patch to their mailing list (seems they have no bugtracker) and see how far I get. Oddly, the tar command-line tool can untar the files stopmotion produces... perhaps tar ignores broken checksums.

Revision history for this message
Tim Band (tim-band) wrote :

I will send this patch to the libtar mailing list; it is necessary to have it applied to libtar for my fix for this bug to work. Ubuntu is way behind on libtar, however. I suspect they aren't using the up-to-date repository (http://repo.or.cz/w/libtar.git), which would mean another round of chasing to get this working...

Revision history for this message
Tim Band (tim-band) wrote :

...and here is the patch for Stopmotion. It seems to work. Note that the libtar patch affects the way that tar files are SAVED, so any files saved with the unpatched version still won't load. This can be worked around by untarring the file and tarring it up again from the command-line, as the command-line tar program seems to tolerate incorrect checksums.

Revision history for this message
Tim Band (tim-band) wrote :

By the way, Jean-Michel, I disagree about the cause of this. I'm an ex-internationalization architect, and Qt is sadly just as bad as everything else.

The correct way to deal with path names would be to have a special path name class, specialized into all the different file systems that might exist. Many file systems have no idea about internationalization, storing file and directory names as strings of bytes with no defined or stored encoding (FAT, ext2, ext3 and ext4 are all like this, for example). Such a class would have a function for obtaining a Unicode "display name". This Unicode name would not be used to refer to the file later and may indeed be incorrect; but at least the underlying file name object would still refer to the file correctly even if we can't work out how to display its name properly.

Qt treats file names as Unicode strings, which not only requires (or at least encourages) the user to know (or guess) what the OS conventions are (/, \ or : for example) but also requires a totally unjustified translation to Unicode using the conversion specified in the locale. If the file name is in some other encoding, you might not be able to access the file, because you don't actually know its name.

My fix here will therefore fail to work for any file that has a name that has characters that fail to be decoded by the locale's specified converter. There is no way around this; Qt's code throws away the required information before we can get our hands on it.

Revision history for this message
Jean-Michel Philippe (jean-michel-philippe) wrote :

Thanks Tim for your efficiency. I've tested the patches on an ARM laptop with Debian Squeeze and could not reproduce the bug anymore. I tested a filename with non-ASCII chars as well as a directory name. In both cases I could reload the file after restarting the application. Note however that there is still an encoding issue with the menu list of previously opened files, certainly another conversion function to insert at the right place.

You're right the cause is much simpler than I initially thought. When I read your patches I realized that it was not a C-string issue. Indeed I don't know enough of C libraries and Qt to be able to find the kind of solution you found. Your explanations are perfectly clear and I think your patches should work in every case for us. Thanks again.

Revision history for this message
Tim Band (tim-band) wrote :

Thanks for testing it Jean-Michel! I'll raise another defect about Open Recent... really I should have fixed that too.

The libtar project have accepted my patch, and released v1.2.17 with it in! Now that's efficiency! As I say, I don't think Debian will pick up on it, unless we pester them, though.

Revision history for this message
Tim Band (tim-band) wrote :

I have provided a fix to the Open Recent issue here: https://bugs.launchpad.net/lsm/+bug/1028835

Sorry for not going the extra mile (well, 50 meters perhaps) on this bug!

Revision history for this message
Jean-Michel Philippe (jean-michel-philippe) wrote :

Ok, I'll recompile for x86(_64) and will put my children to contribution for deep testing!

Concerning Debian, if they want to solve the issue with Stopmotion in Wheezy, which is not blocking but very annoying, they'll have to apply the patch to their version of libtar. Wouldn't it be a sufficient motivation? Maybe not, Stopmotion is not that critical…

However what's the strategy, open a Debian bug for libtar?

Revision history for this message
Tim Band (tim-band) wrote :

I suspect that Debian hasn't noticed that there is a more up-to-date libtar available, or doesn't think it is important. So perhaps the first thing is to raise a bug against libtar explaining what the issue is. It would help if we had an idea when Stopmotion is going to be released and to know for sure that this patch will be in it.

Shall I ask on the mailing list or do you want to? The info is here: https://lists.sourceforge.net/lists/listinfo/linuxstopmotion-users-devs

Revision history for this message
Jean-Michel Philippe (jean-michel-philippe) wrote :

I haven't subscribed this list so if you have, I prefer you ask the question. However, as Wheezy is frozen, I fear that knowing when a new release is out will be of little help, don't you? Unless you want to indicate that next release of Stopmotion will require the latest release of libtar?

Concerning libtar, I think you should be more at ease to write the Debian bug report since you fixed it. Just link me the bug report page so that I can follow its resolution.

Revision history for this message
Jean-Michel Philippe (jean-michel-philippe) wrote :

I noticed today another issue related to non-ASCII characters in file names. I thought I had tested this too but it seems that not. When you have saved a project using non-ASCII characters in its file name, you can't export it as a video file, whatever the video file name. To build the video, Stopmotion places image files into a directory whose name is the project file name. When it calls the program that builds the video from image files, the command line is not using the correct encoding, leading to an error message. Here is an example of such message:

----------------------------------------
FFmpeg version SVN-r0.5.9-4:0.5.9-1, Copyright (c) 2000-2009 Fabrice Bellard, et al.
  configuration: --extra-version=4:0.5.9-1 --prefix=/usr --enable-avfilter --enable-avfilter-lavf --enable-vdpau --enable-bzlib --enable-libdirac --enable-libgsm --enable-libopenjpeg --enable-libschroedinger --enable-libspeex --enable-libtheora --enable-libvorbis --enable-pthreads --enable-zlib --disable-stripping --disable-vhook --enable-runtime-cpudetect --enable-gpl --enable-postproc --enable-swscale --enable-x11grab --enable-libfaad --enable-libdc1394 --enable-shared --disable-static
  libavutil 49.15. 0 / 49.15. 0
  libavcodec 52.20. 1 / 52.20. 1
  libavformat 52.31. 0 / 52.31. 0
  libavdevice 52. 1. 0 / 52. 1. 0
  libavfilter 0. 4. 0 / 0. 4. 0
  libswscale 0. 7. 1 / 0. 7. 1
  libpostproc 51. 2. 0 / 51. 2. 0
  built on Jun 10 2012 08:33:06, gcc: 4.4.5
/home/jm/.stopmotion/packer/teste ça/images/%06d.jpg: I/O error occurred
Usually that means that input file is truncated and/or corrupted.
----------------------------------------

It seems there are some erroneous toLatin1 conversions in the source code. I did a grep and saw that there are still many occurences of this function. Most of them are used to read/write the content of the preference file (should then not be so problematic) while few of them seem to apply to a file path. Do you think all the occurrences of toLatin1 should be replaced by toLocal8Bit?

Revision history for this message
Tim Band (tim-band) wrote :

I did a grep myself recently, and thought that some were correct and some were probably wrong, although if memory serves all the correct ones should probably have been toAscii. I'll have another look when I'm back from holiday.

Thanks for testing this.

Revision history for this message
Tim Band (tim-band) wrote :

OK, here's my opinion having looked into it. Here is a list of things that don't need to (or shouldn't change):

A1) in languagehandler.cpp, all the locale strings are in ASCII.
A2) in devicetab.cpp, importtab.cpp, exporttab.cpp, frameview.cpp and qtfrontend.cpp preference names (in the preferences.xml file) are ASCII. The reason QStrings are often used here is to make use of QStrings text manipulation functions. I did try converting them to std::strings, but the result was not pretty! Best to leave these as they are; there's no real problem here.
A3) launching GIMP in modelhandler.cpp; prepoll, startdeamon and stopdaemon commands in importtab.cpp, exporttab.cpp, frameview.cpp and qtfrontend.cpp; device names in devicetab.cpp. All these strings are commands that come from the preferences file and will be sent to the shell. What are we supposed to do if someone enters Unicode here? What does it mean? It probably depends on what is actually stored on the file system, and we have no information about that. Probably best to ignore the problem and not support Unicode in shell commands.
A4) in devicetab.cpp, similarly, device names are ASCII.

Some things that definitely should change:

B1) in modelhandler.cpp, getting the image name to pass to GIMP should use fromLocal8Bit, I assume, but this is because Qt uses QStrings for shell commands when it should not.
B2) in soundhandler.cpp, sound file names got from Qt GUI components should use toLocal8Bit for the same stupid reason.
B3) in mainwindowgui.cpp, filenames export to Cinelerra and export to video should use toLocal8Bit for the same reason.
B4) everywhere where Unicode might be stored in the preferences file, it should be converted to UTF8. I never tire of telling people that local encodings should NOT be used for storage (unless you also store what the encoding is).

Some things that would be better if they changed:

C1) in externalchangemonitor.h/.cpp, Unicode is used entirely gratuitously. Use std::sets of std::strings.
C2) in mainwindowgui.cpp, saveProjectAs converts to and from Unicode for no reason. Should use std::string.
C3) in qtfrontend.cpp, getenv("HOME") is converted to and from Unicode for no reason. Should use std::string.

Revision history for this message
Tim Band (tim-band) wrote :

Patch that fixes B1, B2, B3, C1, C2 and C3.

Revision history for this message
Tim Band (tim-band) wrote :

And a patch that fixes B4... but not properly! Making the preferences file UTF-8 requires a bump in the version number and some code to upgrade older files. Also, should probably change the Latin1 bits to ASCII just to make the point...

Revision history for this message
Tim Band (tim-band) wrote :

Oops, here's the patch. Try it out if you like.

Revision history for this message
Herman Robak (herman-skolelinux) wrote :

Is it just me, or are full file paths used in many places where relative paths or just filenames should be used instead? openat() instead of open() comes to mind...

Changed in lsm:
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
Tim Band (tim-band) wrote :

openat() is a new one on me. That would probably be a better answer than using full paths and having to worry about encodings.

Revision history for this message
Tim Band (tim-band) wrote :

My apologies. I missed one. This allows Export To Video to work from projects with non-ASCII names.

Revision history for this message
Jean-Michel Philippe (jean-michel-philippe) wrote :

Hi Tim,

I've tested the patches but still have the issue while exporting the project to a video file, although it seems to have changed of nature. The error message is now slightly different:

=============
/home/jm/.stopmotion/packer/teste ça/images/%06d.jpg: No such file or directory
=============

The image directory actually exists:

===========
$ ls .stopmotion/packer/
teste ça
$ ls .stopmotion/packer/teste\ ça/images/
000001.jpg 000002.jpg 000003.jpg 000004.jpg 000005.jpg 000006.jpg 000007.jpg 000008.jpg
===========

I had a look at the source code but can't figure out how the temporary images can be successfully created while the command line for ffmpeg gets wrongly encoded characters.

Revision history for this message
Tim Band (tim-band) wrote :

Arrrgh! Is there no way out of this ridiculous situation?

Thanks Jean-Michel for testing this. I have reproduced your problem. Oddly, it only occurs for me with ffmpeg, not mencoder. Does it work for you with mencoder?

Removing the quotes from the ffmpeg command line makes it work, but not if your project name or video name has spaces. Altering the code to add backslashes before any inserted spaces, weirdly, does not work; the path is still interpreted as being just the part before the space.

I really don't know what Qt (or, for that matter, the system calls underlying it) are doing here.

Revision history for this message
Tim Band (tim-band) wrote :

Correction: Actually it does work for me with double quotes. I had a -r 10 option, and this is invalid. with -r 30 it works even with a Thai character and a space in my project name. Here is my command line:

ffmpeg -r 30 -b 1800 -i "$IMAGEPATH/%06d.jpg" "$VIDEOFILE"

Jean-Michel: Are you sure you applied the Part 3 patch, in comment #20 (the patch in comment #17 is not necessary)? Your error message looks like the one I fixed with Part 3.

The reason it can work in one place and not in others is the fact that every time you convert between entries in the preference file (in Latin1 at the moment -- it should be in UTF8), results from standard C libraries (in local 8-bit) and Qt QStrings (UTF16) you have the possibility of error. It's the classic "I don't have any internationalization problems because I use Unicode" fallacy. You still have to get it right at every possible point of failure -- and there are lots.

Revision history for this message
Jean-Michel Philippe (jean-michel-philippe) wrote :

You're right Tim, I had missed the latest patch from comment #20. Sorry this was the reason of failure. I now have it work even with the initial option “-r 10”. I think Stopmotion should then work like a charm, which is very good news. I'm going to test on i386 and armel since we're using these platforms for DoudouLinux. Then I will test the “unicode in preference file” patch.

Revision history for this message
Tim Band (tim-band) wrote :

I have finally gathered this all up and pushed it to branch bug1011999 (along with the fix to bug 1028835). It works for me (as long as I have built my own libtar; I use v1.2.20, but Ubuntu uses 1.2.16, which is too old).

Tim Band (tim-band)
Changed in lsm:
status: In Progress → Fix Committed
Revision history for this message
Tim Band (tim-band) wrote :

released in 0.7

Changed in lsm:
status: Fix Committed → Fix Released
Tim Band (tim-band)
Changed in lsm:
assignee: nobody → Tim Band (tim-band)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.