Bug #1317883 “calibre should care of 'decomposed UTF-8' filename...” : Bugs : calibre

Revision history for this message

Robert Błaut (1-robert) wrote on 2014-05-09:

#1

Test case showing problem Edit (824.9 KiB, application/octet-stream)

Revision history for this message

Robert Błaut (1-robert) wrote on 2014-05-09:

#2

calibre-windows.png Edit (111.9 KiB, image/png)

Revision history for this message

Robert Błaut (1-robert) wrote on 2014-05-09:

#3

calibre-mac-os-x.png Edit (306.4 KiB, image/png)

description:

updated

Revision history for this message

Kovid Goyal (kovid) wrote on 2014-05-09:

#4

This is on my TODO list. But its not a priority. You should absolutely
not be using unicode characters in filenames in EPUB, they will cause
endless problems.

Changed in calibre:
status:	New → Won't Fix

Revision history for this message

Robert Błaut (1-robert) wrote on 2014-05-12:

#5

Kovid, I know that using unicode is problematic in epubs, but I often edit books bought elsewhere with the above described problem. Even if I want to correct it the reported by calibre errors are misleading.

Revision history for this message

Kovid Goyal (kovid) wrote on 2014-05-12: Re: calibre bug 1317883

#6

Then you need to fix the filenames on a non OS X computer first.

Revision history for this message

Robert Błaut (1-robert) wrote on 2014-05-12:

#7

But I usually works on Mac OS X :( Is it really a huge work to write NFD normalization for Mac OS X in ebook-edit?

Revision history for this message

Kovid Goyal (kovid) wrote on 2014-05-12:

#8

The problem is much larger than unicode normalization. Edit book needs to
match filenames referred to in XML which can be arbitrary unicode to
filenames in the file system, which can be

1) In a different unicode normalization
2) case insensitive/sensitive depending on OS/filesystem driver
3) have other restrictions on the characters allowed in them, their
total length and so on

The only way to robustly solve all those issues is to implement a
virtual filesystem layer, to conceal the inadequacies of file systems
from the rest of the code. Anything less than that would just be a
temporary bandaid, and not something I am willing to waste time on.

Revision history for this message

Robert Błaut (1-robert) wrote on 2014-05-12:

#9

Kovid, what about automatically transliterate all unicode filenames, URLs, etc. to ASCII using for example: https://pypi.python.org/pypi/Unidecode ?

Revision history for this message

Kovid Goyal (kovid) wrote on 2014-05-12:

#10

Download full text (3.5 KiB)

If you transliterate filenames you still need to maintain some kind of
mapping from the original to the transliterated and back, in other words
a VFS. And if you want a much more sophisticated implementation of file
name sanitization than unidecode look at calibre's source code, grep for
ascii_filename and sanitize_file_name.

The hard part is not sanitizing filenames, the hard part is implementing
the mapping in a way that is transparent to existing code.

Rather than suggesting stuff, all of which, I can assure you I am
already aware of, I suggest you check out the calibre source code and
start working on a fix yourself, I will not look at this further till I
am ready to implement a VFS. If all you want is to workaround OS X's
limitations, it should be a trivial patch, as long as you can
assume that all text in the XML files is pre-normalized to NFC.

Here's something to get you started:

diff --git a/src/calibre/ebooks/oeb/polish/container.py
b/src/calibre/ebooks/oeb/polish/container.py
index b8fd8e2c18..7abbeb6416 100644
--- a/src/calibre/ebooks/oeb/polish/container.py
+++ b/src/calibre/ebooks/oeb/polish/container.py
@@ -7,7 +7,7 @@ __license__ = 'GPL v3'
__copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
__docformat__ = 'restructuredtext en'

   -import os, logging, sys, hashlib, uuid, re, shutil
   +import os, logging, sys, hashlib, uuid, re, shutil, unicodedata
    from collections import defaultdict
     from io import BytesIO
      from urlparse import urlparse
      @@ -125,6 +125,7 @@ class Container(object): # {{{
                   for f in filenames:
                                    path = join(dirpath, f)
                                                     name =
                                                     self.abspath_to_name(path)
                                                     +
                                                     name =
                                                     unicodedata.normalize('NFC',
                                                     name)
                                                                      self.name_path_map[name]
                                                                      =
                                                                      path
                                                                                       self.mime_map[name]
                                                                                       =
                                                                                       guess_type(path)
                                                                                                        # Special
                                                                                                        # case
                                                                                                        # if
                                                                                                        # we
                                                                                                        # have
                                     ...

If you transliterate filenames you still need to maintain some kind of
mapping from the original to the transliterated and back, in other words
a VFS. And if you want a much more sophisticated implementation of file
name sanitization than unidecode look at calibre's source code, grep for
ascii_filename and sanitize_file_name.

The hard part is not sanitizing filenames, the hard part is implementing
the mapping in a way that is transparent to existing code.

Rather than suggesting stuff, all of which, I can assure you I am
already aware of, I suggest you check out the calibre source code and
start working on a fix yourself, I will not look at this further till I
am ready to implement a VFS. If all you want is to workaround OS X's
limitations, it should be a trivial patch, as long as you can
assume that all text in the XML files is pre-normalized to NFC.

Here's something to get you started:

diff --git a/src/calibre/ebooks/oeb/polish/container.py
b/src/calibre/ebooks/oeb/polish/container.py
index b8fd8e2c18..7abbeb6416 100644
--- a/src/calibre/ebooks/oeb/polish/container.py
+++ b/src/calibre/ebooks/oeb/polish/container.py
@@ -7,7 +7,7 @@ __license__   = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
  __docformat__ = 'restructuredtext en'
   
   -import os, logging, sys, hashlib, uuid, re, shutil
   +import os, logging, sys, hashlib, uuid, re, shutil, unicodedata
    from collections import defaultdict
     from io import BytesIO
      from urlparse import urlparse
      @@ -125,6 +125,7 @@ class Container(object):  # {{{
                   for f in filenames:
                                    path = join(dirpath, f)
                                                     name =
                                                     self.abspath_to_name(path)
                                                     +
                                                     name =
                                                     unicodedata.normalize('NFC',
                                                     name)
                                                                      self.name_path_map[name]
                                                                      =
                                                                      path
                                                                                       self.mime_map[name]
                                                                                       =
                                                                                       guess_type(path)
                                                                                                        # Special
                                                                                                        # case
                                                                                                        # if
                                                                                                        # we
                                                                                                        # have
                                                                                                        # stumbled
                                                                                                        # onto
                                                                                                        # the
                                                                                                        # opf

Revision history for this message

Robert Błaut (1-robert) wrote on 2014-05-12:

#11

Kovid, all XML files have URLs properly UTF-8 encoded.
Only filenames have 'decomposed UTF-8'.

So after applying your patch calibre on Mac OS X behaves the same as calibre on Windows. Thank you :)

I think it would be safe to apply this to calibre tree:

if sys.platform == 'darwin':
name = unicodedata.normalize('NFC', name)

Any chance?

Revision history for this message

Kovid Goyal (kovid) wrote on 2014-05-12:

#12

UTF-8 is irrelevant. The UTF-8 encoding of the NFC and NFD normalizations
of the same text are not equal. That patch only works because it so
happens that your book has its filenames in the XML files in NFC form.
If it had them in NFD form it would not work. As I said, this solution
is simply a temporary bandaid that does not actually fix anything.

And just bye the way: XML (and HTML) can be in any encoding whatsoever,
it does not have to be UTF-8. Although to repeat, the byte encoding of
the XML files is irrelevant. WHat matters is the the unicode code points
that byte encoding represents. And those unicode code points could be in
in ether NFC or NFD.

Anyway, I am done posting on this bug report. I have other work to do :)

As I said, I will implement a proper solution for this when I am ready
to implement a VFS. Until then, use that patch and hope that all your
books use the NFC form.

Revision history for this message

Kovid Goyal (kovid) wrote on 2014-06-22: Fixed in master

#13

Fixed in branch master. The fix will be in the next release. calibre is usually released every Friday.

status fixreleased

Changed in calibre:
status:	Won't Fix → Fix Released

calibre

calibre should care of 'decomposed UTF-8' filenames on Darwin platform

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches