calibre should care of 'decomposed UTF-8' filenames on Darwin platform

Bug #1317883 reported by Robert Błaut on 2014-05-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Undecided
Unassigned

Bug Description

Mac OS X uses a sort of 'decomposed UTF-8' for storing filenames. calibre incorrectly "thinks" it's a 'simple UTF-8'.

Details: http://stackoverflow.com/questions/9757843/unicode-encoding-for-filesystem-in-mac-os-x-not-correct-in-python

So ebook-edit checking is completely useless if one of filenames has unicode characters. Check attached screenshots and compare Windows and Mac OS X calibre output.

I've also attached test case (epub file) demonstrating a problem.

Robert Błaut (1-robert) wrote :
Robert Błaut (1-robert) wrote :
Robert Błaut (1-robert) wrote :
description: updated
Kovid Goyal (kovid) wrote :

This is on my TODO list. But its not a priority. You should absolutely
not be using unicode characters in filenames in EPUB, they will cause
endless problems.

Changed in calibre:
status: New → Won't Fix
Robert Błaut (1-robert) wrote :

Kovid, I know that using unicode is problematic in epubs, but I often edit books bought elsewhere with the above described problem. Even if I want to correct it the reported by calibre errors are misleading.

Then you need to fix the filenames on a non OS X computer first.

Robert Błaut (1-robert) wrote :

But I usually works on Mac OS X :( Is it really a huge work to write NFD normalization for Mac OS X in ebook-edit?

Kovid Goyal (kovid) wrote :

The problem is much larger than unicode normalization. Edit book needs to
match filenames referred to in XML which can be arbitrary unicode to
filenames in the file system, which can be

1) In a different unicode normalization
2) case insensitive/sensitive depending on OS/filesystem driver
3) have other restrictions on the characters allowed in them, their
total length and so on

The only way to robustly solve all those issues is to implement a
virtual filesystem layer, to conceal the inadequacies of file systems
from the rest of the code. Anything less than that would just be a
temporary bandaid, and not something I am willing to waste time on.

Robert Błaut (1-robert) wrote :

Kovid, what about automatically transliterate all unicode filenames, URLs, etc. to ASCII using for example: https://pypi.python.org/pypi/Unidecode ?

Kovid Goyal (kovid) wrote :
Download full text (3.5 KiB)

If you transliterate filenames you still need to maintain some kind of
mapping from the original to the transliterated and back, in other words
a VFS. And if you want a much more sophisticated implementation of file
name sanitization than unidecode look at calibre's source code, grep for
ascii_filename and sanitize_file_name.

The hard part is not sanitizing filenames, the hard part is implementing
the mapping in a way that is transparent to existing code.

Rather than suggesting stuff, all of which, I can assure you I am
already aware of, I suggest you check out the calibre source code and
start working on a fix yourself, I will not look at this further till I
am ready to implement a VFS. If all you want is to workaround OS X's
limitations, it should be a trivial patch, as long as you can
assume that all text in the XML files is pre-normalized to NFC.

Here's something to get you started:

diff --git a/src/calibre/ebooks/oeb/polish/container.py
b/src/calibre/ebooks/oeb/polish/container.py
index b8fd8e2c18..7abbeb6416 100644
--- a/src/calibre/ebooks/oeb/polish/container.py
+++ b/src/calibre/ebooks/oeb/polish/container.py
@@ -7,7 +7,7 @@ __license__ = 'GPL v3'
 __copyright__ = '2013, Kovid Goyal <kovid at kovidgoyal.net>'
  __docformat__ = 'restructuredtext en'

   -import os, logging, sys, hashlib, uuid, re, shutil
   +import os, logging, sys, hashlib, uuid, re, shutil, unicodedata
    from collections import defaultdict
     from io import BytesIO
      from urlparse import urlparse
      @@ -125,6 +125,7 @@ class Container(object): # {{{
                   for f in filenames:
                                    path = join(dirpath, f)
                                                     name =
                                                     self.abspath_to_name(path)
                                                     +
                                                     name =
                                                     unicodedata.normalize('NFC',
                                                     name)
                                                                      self.name_path_map[name]
                                                                      =
                                                                      path
                                                                                       self.mime_map[name]
                                                                                       =
                                                                                       guess_type(path)
                                                                                                        # Special
                                                                                                        # case
                                                                                                        # if
                                                                                                        # we
                                                                                                        # have
                                     ...

Read more...

Robert Błaut (1-robert) wrote :

Kovid, all XML files have URLs properly UTF-8 encoded.
Only filenames have 'decomposed UTF-8'.

So after applying your patch calibre on Mac OS X behaves the same as calibre on Windows. Thank you :)

I think it would be safe to apply this to calibre tree:

                if sys.platform == 'darwin':
                    name = unicodedata.normalize('NFC', name)

Any chance?

Kovid Goyal (kovid) wrote :

UTF-8 is irrelevant. The UTF-8 encoding of the NFC and NFD normalizations
of the same text are not equal. That patch only works because it so
happens that your book has its filenames in the XML files in NFC form.
If it had them in NFD form it would not work. As I said, this solution
is simply a temporary bandaid that does not actually fix anything.

And just bye the way: XML (and HTML) can be in any encoding whatsoever,
it does not have to be UTF-8. Although to repeat, the byte encoding of
the XML files is irrelevant. WHat matters is the the unicode code points
that byte encoding represents. And those unicode code points could be in
in ether NFC or NFD.

Anyway, I am done posting on this bug report. I have other work to do :)

As I said, I will implement a proper solution for this when I am ready
to implement a VFS. Until then, use that patch and hope that all your
books use the NFC form.

Fixed in branch master. The fix will be in the next release. calibre is usually released every Friday.

 status fixreleased

Changed in calibre:
status: Won't Fix → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers