errors on some filenames

Bug #872376 reported by Carl Karsten
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Inkscape
New
Undecided
Unassigned

Bug Description

Steps:
create a valid svg
rename it using some bashism (but still a valid file name)
try to open it, inkscape errors.

juser@trist:~$ echo "<svg />" > x.svg
juser@trist:~$ inkscape x.svg
(file opens, no errors)

juser@trist:~$ cp x.svg $'\xdf.svg'

juser@trist:~$ inkscape $'\xdf.svg'

** (inkscape:5942): CRITICAL **: Inkscape::XML::Document* sp_repr_read_file(const gchar*, const gchar*): assertion `Inkscape::IO::file_test( filename, G_FILE_TEST_EXISTS )' failed

simple example of some other app being able to accept this file name on the command line, open the file and read the data:

juser@trist:~$ file $'\xdf.svg'
�.svg: ASCII text

And maybe related: saving a png using the same bashism results in a different file name:

juser@trist:~$ inkscape x.svg --export-png $'\xdf.png'
Background RRGGBBAA: ffffff00
Area 0:0:744.094:1052.36 exported to 744 x 1052 pixels (90 dpi)
Bitmap saved as: �.png

juser@trist:~$ ls ?.svg | hexdump -C
00000000 df 2e 73 76 67 0a 78 2e 73 76 67 0a |�.svg.x.svg.|

juser@trist:~$ ls ??.png | hexdump -C
00000000 c3 9f 2e 70 6e 67 0a |�..png.|

In the above line, I would expect df, not c3 9f

Fairly vinalla ubuntu install, inkscape from trunk ppa (but not sure how current it really is)

juser@trist:~$ apt-cache policy inkscape
inkscape:
  Installed: 0.48+devel+10324+10~natty1
  Candidate: 0.48.1-2ubuntu2
  Version table:
     0.48.1-2ubuntu2 0
        500 http://us.archive.ubuntu.com/ubuntu/ natty/main amd64 Packages
 *** 0.48+devel+10324+10~natty1 0
        500 http://ppa.launchpad.net/inkscape.dev/trunk/ubuntu/ natty/main amd64 Packages
        100 /var/lib/dpkg/status

juser@trist:~$ uname -a
Linux trist 2.6.38-11-generic #50-Ubuntu SMP Mon Sep 12 21:17:25 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

Tags: encoding linux
Revision history for this message
Carl Karsten (carlfk) wrote :

similar bugs:

https://bugs.launchpad.net/inkscape/+bug/167923 Open dialog crashes when in non-ASCII path name

https://bugs.launchpad.net/inkscape/+bug/166340 yet another crash with non-ascii filenames

juser@trist:~$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Revision history for this message
su_v (suv-lp) wrote :

Based on the lengthy discussion on #inkscape (irc): could you attach the original python script which created those file names with mixed or undefined encodings in the first place? (script + input (file, variables, command line parameters) + generated SVG file):

> 15:28 < CarlFK> su-v: no AI... process is this:
> 1) create pyconde.svg with id="title" and id="authors" on 2 text elements.
> 2) use python to read svg, set title/authors, save as <title>.svg,
> 3. shell out inkscape -- <title>.svg --export-png <title>.png

More details about step 1 could possibly be helpful, too:
- how is 'pyconde.svg' created?
- is it UTF-8 (default Inkscape SVG file)?
  <?xml version="1.0" encoding="UTF-8" standalone="no"?>
- is the content (string) of the <title> tag (which the script seems to reuse as file name) properly UTF-8 encoded as well?
- or is 'pyconde.svg' generated by a third-party tool (possibly encoded in ISO-8859-1)?
  <?xml version="1.0" encoding="ISO-8859-1" standalone="no"?>

It seems to me that if anything, this issue needs to be addressed in the python script to produce file names consistent with the user locale setting (you originally reported the locale to be "en_US" (non-UTF-8 locale) <http://paste.ubuntu.com/706185/>). However, the default encoding of file names on Ubuntu is 'UTF-8' (whichever version "Fairly vinalla ubuntu install" refers to).

AFAICT your bashism "cp x.svg $'\xdf.svg'" produces latin-1 (ISO-8859) encoded characters, whereas to generate UTF-8 encoded output, you'd have to use "cp x.svg $'\xC3\x9F'.svg" in bash.

tags: added: encoding
tags: added: linux
Revision history for this message
Carl Karsten (carlfk) wrote :

$ head pyconde.svg
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!-- Created with Inkscape (http://www.inkscape.org/) -->
<svg
   xmlns:dc="http://purl.org/dc/elements/1.1/"

- is the content (string) of the <title> tag (which the script seems to reuse as file name) properly UTF-8 encoded as well?
Don't know. I think not, but seems to be working anyway: inkscape renders the expected characters.

Here is the code that creates the <title>.svg:

open(cooked_svg_name,'w').write(cooked_svg)
https://github.com/CarlFK/veyepar/blob/master/dj/scripts/enc.py#L197
The value of cooked_svg_name can be traced back to a .json file.

I think what needs to be figured out is What is/isn't a valid file name?

According to #linux,
(11:20:44 AM) CarlFK: what are valid chars for ext3 fs file names?
(11:20:55 AM) amrit|SEA: anything but /
I searched for docs stating that, couldn't find anything either way.
http://e2fsprogs.sourceforge.net/ext2intro.html seemed like the obvious place to look.

assuming that is correct,
cp x.svg $'\xdf.svg' creates a valid file, and there should be some way for inkscape to open it.
Seems there is no way to pass it on the command line, or select it using the gui open file dialog.

Also from #linux:
(11:28:30 AM) amrit|SEA: CarlFK: from a quick google search, it appears it uses the boost library
(11:29:13 AM) amrit|SEA: CarlFK: if they're using the boost::fs, that lib is more restrictive than the base OS, in favor of cross platform compatibility
(11:29:31 AM) amrit|SEA: so that *could* be why they're claiming that's not a valid filename

If this is the case, then this is just a limitation we have to live with, should be documented, and no more time needs to be spent on the code.

Revision history for this message
Carl Karsten (carlfk) wrote :

continuing my assumptions, this looks like the same outcome:

"""
NFSv4 requires that all filenames be exchanged using UTF-8 over the wire. The NFSv4 specification, RFC 3530, says that filenames should be UTF-8 encoded in section 1.4.3: "In a slight departure, file and directory names are encoded with UTF-8 to deal with the basics of internationalization." The same text is also found in the newer NFS 4.1 RFC (RFC 5661) section 1.7.3. The current Linux NFS client simply passes filenames straight through, without any conversion from the current locale to and from UTF-8. Using non-UTF-8 filenames could be a real problem on a system using a remote NFSv4 system; any NFS server that follows the NFS specification is supposed to reject non-UTF-8 filenames. So if you want to ensure that your files can actually be stored from a Linux client to an NFS server, you must currently use UTF-8 filenames. In other words, although some people think that Linux doesn't force a particular character encoding on filenames, in practice it already requires UTF-8 encoding for filenames in certain cases.
"
http://lwn.net/Articles/71472/ "The kernel and character set encodings"

Revision history for this message
Carl Karsten (carlfk) wrote :
Revision history for this message
Carl Karsten (carlfk) wrote :

whoops, above quote is from
http://www.dwheeler.com/essays/fixing-unix-linux-filenames.html#utf8
Fixing Unix/Linux/POSIX Filenames:
Control Characters (such as Newline), Leading Dashes, and Other Problems
David A. Wheeler
2009-03-24 (revised 2011-07-29)

which states:
"""
Traditionally, Unix/Linux/POSIX pathnames and filenames can be almost any sequence of bytes. A pathname lets you select a particular file, and may include zero or more “/” characters. Each pathname component (separated by “/”) is a filename; filenames cannot contain “/”. Neither filenames nor pathnames can contain the ASCII NUL character (\0), because that is the terminator.
"""

Revision history for this message
Jon A. Cruz (jon-joncruz) wrote :

I'll have to look into the specifics here, but we follow glib's behavior in validating names.

There are three main encodings in play: internal (UTF-8), system and filesystem. The OS settings can affect the latter. To work without problems, the filesystem encoding needs to be set to match that of the filesystem. Modern Linux config set these to be UTF-8. If the filesystem encoding glib is using has been set to UTF-8, then all byte sequences must be valid UTF-8.

If random bytes are desired, then the OS needs to be set to report a different filesystem encoding - one that allows random byte sequences. ISO-8859-1 might be suitable.

Bottom line is that for code and user safety, Inkscape enforces what the system has been configured to. If one needs sequences that are illegall UTF-8, then one needs to configure to not report UTF-8 as the filesystem encoding.

Revision history for this message
Jon A. Cruz (jon-joncruz) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.