[master] can't cope with NFD Unicode normalization on Mac OS X

Bug #172383 reported by n[ate]vw on 2007-11-27
136
This bug affects 15 people
Affects Status Importance Assigned to Milestone
Bazaar
Medium
Unassigned
Breezy
Low
Unassigned

Bug Description

The effect of this bug is that unicode filenames do not work (or at least don't generally work) with bzr on OS X, because the OS changes the unicode representation from what was passed in.

----

OS X 10.5.1 (HFS+, case-sensitive, journaled) / bzr 0.92.0, trying to 'bzr add' a folder named "süb" fails. This may be related to https://bugs.launchpad.net/bzr/+bug/102935.

Example:

stravinsky:test_bzr nathan$ bzr add süb
added "süb"
added "süb"
bzr: ERROR: exceptions.AttributeError: 'NoneType' object has no attribute 'file_id'

Traceback (most recent call last):
  File "/Users/nathan/lib/python/bzrlib/commands.py", line 802, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/Users/nathan/lib/python/bzrlib/commands.py", line 758, in run_bzr
    ret = run(*run_argv)
  File "/Users/nathan/lib/python/bzrlib/commands.py", line 492, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/Users/nathan/lib/python/bzrlib/builtins.py", line 384, in run
    no_recurse, action=action, save=not dry_run)
  File "/Users/nathan/lib/python/bzrlib/mutabletree.py", line 51, in tree_write_locked
    return unbound(self, *args, **kwargs)
  File "/Users/nathan/lib/python/bzrlib/mutabletree.py", line 384, in smart_add
    _add_one(self, inv, parent_ie, directory, kind, action)
  File "/Users/nathan/lib/python/bzrlib/mutabletree.py", line 526, in _add_one
    entry = inv.make_entry(kind, path.base_path, parent_ie.file_id,
AttributeError: 'NoneType' object has no attribute 'file_id'

bzr 0.92.0 on python 2.5.1.final.0 (darwin)
arguments: ['/Users/nathan/bin/bzr', 'add', 'su\xcc\x88b']
encoding: 'UTF-8', fsenc: 'utf-8', lang: 'en_US.UTF-8'

n[ate]vw (natevw) wrote :

Doing a recursive add on the enclosing directory results in a similar error, though the combining mark is printed using its Unicode address, instead of two escaped bytes, in the error messages:

stravinsky:test_bzr nathan$ bzr add
added "süb"
bzr: ERROR: exceptions.KeyError: u'su\u0308b'

Traceback (most recent call last):
  File "/Users/nathan/lib/python/bzrlib/commands.py", line 802, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/Users/nathan/lib/python/bzrlib/commands.py", line 758, in run_bzr
    ret = run(*run_argv)
  File "/Users/nathan/lib/python/bzrlib/commands.py", line 492, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/Users/nathan/lib/python/bzrlib/builtins.py", line 384, in run
    no_recurse, action=action, save=not dry_run)
  File "/Users/nathan/lib/python/bzrlib/mutabletree.py", line 51, in tree_write_locked
    return unbound(self, *args, **kwargs)
  File "/Users/nathan/lib/python/bzrlib/mutabletree.py", line 390, in smart_add
    this_ie = parent_ie.children[directory.base_path]
KeyError: u'su\u0308b'

bzr 0.92.0 on python 2.5.1.final.0 (darwin)
arguments: ['/Users/nathan/bin/bzr', 'add']
encoding: 'UTF-8', fsenc: 'utf-8', lang: 'en_US.UTF-8'

This removes the last of the internal normalization checks. We need to update the test suite to match.
But this will leave the filenames alone. Which means we will track them in whatever form they are on disk.
=== modified file 'bzrlib/dirstate.py'
--- bzrlib/dirstate.py 2007-11-15 01:07:51 +0000
+++ bzrlib/dirstate.py 2007-11-27 18:54:08 +0000
@@ -373,14 +373,6 @@
         #------- copied from inventory.ensure_normalized_name - keep synced.
         # --- normalized_filename wants a unicode basename only, so get one.
         dirname, basename = osutils.split(path)
- # we dont import normalized_filename directly because we want to be
- # able to change the implementation at runtime for tests.
- norm_name, can_access = osutils.normalized_filename(basename)
- if norm_name != basename:
- if can_access:
- basename = norm_name
- else:
- raise errors.InvalidNormalization(path)
         # you should never have files called . or ..; just add the directory
         # in the parent, or according to the special treatment for the root
         if basename == '.' or basename == '..':

=== modified file 'bzrlib/inventory.py'
--- bzrlib/inventory.py 2007-10-24 20:38:50 +0000
+++ bzrlib/inventory.py 2007-11-27 18:55:28 +0000
@@ -1367,7 +1367,6 @@

         This does not move the working file.
         """
- new_name = ensure_normalized_name(new_name)
         if not is_valid_name(new_name):
             raise BzrError("not an acceptable filename: %r" % new_name)

@@ -1412,7 +1411,6 @@
     """
     if file_id is None:
         file_id = generate_ids.gen_file_id(name)
- name = ensure_normalized_name(name)
     try:
         factory = entry_factory[kind]
     except KeyError:

This finishes up bug 165071 (making it official, rather than in the current semi-broken state.)

Changed in bzr:
importance: Undecided → Medium
status: New → Triaged
John A Meinel (jameinel) wrote :

Sorry, I meant bug #102935, not 165071.

n[ate]vw (natevw) wrote :

I applied the patch to 1.0rc1 (bzrlib/dirstate.py manually) and I can now add a directory named "süb" to my repository. I uploaded a tarball of the repo to a Debian box, and got no bad statuses.

I also created a repository on the Debian box (needed to use patched version, otherwise it complained that "Path XXX is not unicode normalized") and after copying it back to OS X got no bad statuses.

So the patch seems to be working for me, and also seems to be necessary on Debian as well.

John A Meinel (jameinel) wrote :

That sounds about like what I would expect. Because the files will be created with expanded form on the Linux (and windows) machines.

If you had created it on Linux, it would show up as missing on Mac (with a similarly named file showing up as unknown).
If you marked it as renamed or deleted one and added the other, then it would be expanded on the other side.

Also, I wanted to include this link
http://drewthaler.blogspot.com/2007/12/case-against-insensitivity.html#comment-2279674399335241896

Just as something to reference later.

n[ate]vw (natevw) wrote :

According to http://developers.sun.com/global/products_platforms/solaris/reference/presentations/IUC29-FileSystems.pdf (<--- note extension) there don't seem to be any other notable filesystems that use a competing normal form. So it seems that, at least among the big three OSes nowadays, a pragmatic approach would be to switch to a OS X normalized inventory upon detecting this issue. If bzr finds that a Unicode filename has been changed to the normalized form in the working copy, it would be reasonable to expect that all Unicode-named files will appear under normalized names for that repo.

Note the "per repo"-ness of the above approach. That'd probably be the easiest way to implement it anyway, but also should help on the couple of filesystems on OS X that behave more like Linux/Windows: "NFS volumes can be shared with non-Mac clients that create files with precomposed characters in their names, and the Mac OS X NFS client does not decompose them before returning them to applications." (http://developer.apple.com/qa/qa2001/qa1173.html)

n[ate]vw (natevw) wrote :

Looks like MacFUSE is pondering similar issues and currently does not do any normalization: http://code.google.com/p/macfuse/wiki/FILENAME_ENCODING_PROPOSAL

So I understand the circumstances on OS X to be as follows:
1. HFS+, the main filesystem, always decomposes most (http://developer.apple.com/technotes/tn/tn1150.html#UnicodeSubtleties) characters internally.
2. Other filesystems are encouraged to return decomposed filenames (see QA1173 link above). Whether this is the HFS+ decomposition or pure NFD I am not sure. The links to the HFS+ decomposition implementation seem to suggest the former.
3. A few filesystems (NFS, current MacFUSE) do not decompose filenames.

I still think the pragmatic approach outlined above would give good mileage in each of the above situations. If you'd like I can see how OS X's FAT32 implementation deals with normalization and report back.

Manfred Bergmann (mdbergmann) wrote :

Hi.

Sorry, I'm quite new to launchbar and couldn't see if this bug has been fixed already.
However it doesn't seem to be fixed as I have the same issue as reported with version 1.7 of Bazaar and Mac OSX 10.5.5.

Regards,
Manfred

Martin Pool (mbp) on 2009-09-09
description: updated
tags: added: mac unicode
summary: - Cannot add NFD normalized Unicode file to repo
+ [master] can't cope with NFD Unicode normalization on Mac OS X
Changed in bzr:
status: Triaged → Confirmed
INADA Naoki (songofacandy) wrote :

I think bzr should handle filename as canonical (NFC or NFD) unicode.

On 25 March 2010 02:04, INADA Naoki <email address hidden> wrote:
> I think bzr should handle filename as canonical (NFC or NFD) unicode.

This seems reasonable.

It will mean we can't represent a directory of for example test data
containing filenames with different representations of the same
characters. You could have such a directory on Linux (and Windows?)
but you could not check it out on Mac OS. You can see this as
inconsistent with how we handle case-sensitivity: we don't fold
everything to lowercase because some systems are case-insensitive.
But perhaps they are not really analogous.

--
Martin <http://launchpad.net/~mbp/>

Mitsuhiro Koga (shiena-jp) wrote :

Hi.
I made the patch to which the filename was normalized with NFC on OSX.
I am seen the normal performance by some commands.
(add, delete, status, commit, uncommit, revert, etc...)
However, all the commands are not tested.
Could anyone try?

INADA Naoki (songofacandy) wrote :

Is this bug still alive?

INADA Naoki пишет:
> Is this bug still alive?

AFAIK, yes.

--
All the dude wanted was his rug back

Katsumi Honda (k-qox) wrote :

I tried normalized_unicode_filename.bundle in Japanese (ja_JP.UTF-8) on Snow Leopard(10.6.7).
It's fine :)

Shigenobu Hirose (shirose) wrote :

Could you tell us how to apply normalized_unicode_filename.bundle?
Thank you.

On Tue, Apr 5, 2011 at 15:58, Katsumi Honda <email address hidden> wrote:
> I tried normalized_unicode_filename.bundle in Japanese (ja_JP.UTF-8) on Snow Leopard(10.6.7).
> It's fine :)
>
> --
> You received this bug notification because you are a direct subscriber
> of a duplicate bug (684002).
> https://bugs.launchpad.net/bugs/172383
>
> Title:
>  [master] can't cope with NFD Unicode normalization on Mac OS X
>
> Status in Bazaar Version Control System:
>  Confirmed
>
> Bug description:
>  The effect of this bug is that unicode filenames do not work (or at
>  least don't generally work) with bzr on OS X, because the OS changes
>  the unicode representation from what was passed in.
>
>  ----
>
>  OS X 10.5.1 (HFS+, case-sensitive, journaled) / bzr 0.92.0, trying to
>  'bzr add' a folder named "süb" fails. This may be related to
>  https://bugs.launchpad.net/bzr/+bug/102935.
>
>  Example:
>
>  stravinsky:test_bzr nathan$ bzr add süb
>  added "süb"
>  added "süb"
>  bzr: ERROR: exceptions.AttributeError: 'NoneType' object has no attribute 'file_id'
>
>  Traceback (most recent call last):
>    File "/Users/nathan/lib/python/bzrlib/commands.py", line 802, in run_bzr_catch_errors
>      return run_bzr(argv)
>    File "/Users/nathan/lib/python/bzrlib/commands.py", line 758, in run_bzr
>      ret = run(*run_argv)
>    File "/Users/nathan/lib/python/bzrlib/commands.py", line 492, in run_argv_aliases
>      return self.run(**all_cmd_args)
>    File "/Users/nathan/lib/python/bzrlib/builtins.py", line 384, in run
>      no_recurse, action=action, save=not dry_run)
>    File "/Users/nathan/lib/python/bzrlib/mutabletree.py", line 51, in tree_write_locked
>      return unbound(self, *args, **kwargs)
>    File "/Users/nathan/lib/python/bzrlib/mutabletree.py", line 384, in smart_add
>      _add_one(self, inv, parent_ie, directory, kind, action)
>    File "/Users/nathan/lib/python/bzrlib/mutabletree.py", line 526, in _add_one
>      entry = inv.make_entry(kind, path.base_path, parent_ie.file_id,
>  AttributeError: 'NoneType' object has no attribute 'file_id'
>
>  bzr 0.92.0 on python 2.5.1.final.0 (darwin)
>  arguments: ['/Users/nathan/bin/bzr', 'add', 'su\xcc\x88b']
>  encoding: 'UTF-8', fsenc: 'utf-8', lang: 'en_US.UTF-8'
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/bzr/+bug/172383/+subscribe
>

Katsumi Honda (k-qox) wrote :

OK, it is how to apply normalized_unicode_filename.bundle in Max OS X.
(maybe same other platform.)

1. open Terminal.app

2. download normalized_unicode_filename.bundle to HOME directory
  $ cd $HOME
  $ curl -O https://launchpadlibrarian.net/52428742/normalized_unicode_filename.bundle

3. find bzrlib directory.
  $ find / -type d -name bzrlib 2>/dev/null
  /Library/Python/2.6/site-packages/bzrlib

 4. change to bzrlib directory
   $ cd /Library/Python/2.6/site-packages/bzrlib

 5. apply patch
   $ sudo patch -p1 < $HOME/normalized_unicode_filename.bundle
   Password: (Enter your password)
   patching file mutabletree.py
   Hunk #1 succeeded at 682 (offset 14 lines).
   patching file osutils.py
   Hunk #1 succeeded at 1754 (offset 5 lines).
   patching file tests /test_osutils.py
   Hunk #1 succeeded at 1325 (offset -3 lines).

 FINISH :)

Jelmer Vernooij (jelmer) on 2017-06-23
Changed in brz:
status: New → Triaged
importance: Undecided → Low
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Related questions