"combined" unicode characters are renamed on Mac

Bug #102935 reported by John A Meinel
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar
Confirmed
Wishlist
Unassigned

Bug Description

On Mac OS X, it maintains the filesystem as normalized unicode.
http://unicode.org/reports/tr15/

Specifically, it forces all filenames to be stored as the NFD variant.

As an example the name u'B\xe5gfors' ("Bågfors") gets normalized to u'Ba\u030agfors'. (That is a with combining u'\u030a' circle).

The complication is that these files can be read and written to using their alternative name, but when you list the directory, you will only get one name.

Most systems prefer the NFC form (it is the standard XML interchange and it maps iso-8859-1 characters cleanly into utf-8.)

For example Unicode u'\xe5' == iso-8859-1 '\xe5'. But u'a\u030' has no representation in iso-8859-1.

In earlier versions of bzr, we tried to handle this by requiring all filenames to be NFC normalized. On platforms other than Mac, it was considered an error to try and version an NFD normalizied filename. And then on Mac, we normalized the filenames back to NFC.

This meant that checking in the utf-8 filename 'B\xc3\xa5gfors' on Linux would internally translate that to u'B\xe5gfors'. And then when doing a checkout on Mac, it gets opened as u'B\xe5gfors'. Mac changes the filename on disk to u'Ba\u030agfors' ('Ba\xcc\x8agfors' in utf-8). When we do _iter_changes, we would see a miss for u'Ba\u030agfors' and we would try to re-normalize it, and see if that matched. Since it does, we would know that the file is present, and not treat it as missing.

Current versions (dirstate) do not do this normalization lookup. There are 3 reasons...

1) Performance. re-normalizing unicode names is a little bit expensive, especially when 99.9% of all filenames are ascii only (or lots of Unicode characters that aren't "combined" chars). We had a reasonable tradeoff by only doing a normalization when we had a 'miss'. This is a little bit more complicated with dirstate, because we use double iterators (so we generally only have the 'current' file being processed, rather than all filenames).

2) Sometimes on non-mac platforms NFD names are generated. We found this with Japanese Windows, Microsoft Office preferred to generate wide-character '(', rather than narrow '('. Requiring names to be NFC means changing other programs to properly produce them, which is not under our control. If we allow people to version non NFC names, then we don't know the "proper" name for a file. (A filename could be produced with mixed normalization, so forcing it one way or the other would not be possible).

3) No other system we have come across accounts for this. (svn, cvs, git, hg, etc). Mac is the only platform that seems to change your filenames automatically. (Other platforms seem to leave the names alone, even if they don't seem to properly display them). Because Mac breaks all these other systems, it seems that it is not a high priority to support them.

John A Meinel (jameinel)
Changed in bzr:
importance: Undecided → Wishlist
status: Unconfirmed → Confirmed
Revision history for this message
Horst Gutmann (zerok) wrote :

Just some observations under MacOSX with a custom bash and coreutils (to get at least some UTF-8 support in there ;-))

---------------------------------------------------------------------
zerok@akira:~/tmp/test$ echo $SHELL
/opt/sw/bin/bash
zerok@akira:~/tmp/test$ $SHELL --version
GNU bash, version 3.2.0(1)-release (powerpc-apple-darwin8.8.0)
Copyright (C) 2005 Free Software Foundation, Inc.
zerok@akira:~/tmp/test$ echo $LC_ALL
en_US.UTF-8
zerok@akira:~/tmp/test$ gls -a
. ..
zerok@akira:~/tmp/test$ gtouch "æøÜÉ"
zerok@akira:~/tmp/test$ gls -a
. .. æøÜÉ
zerok@akira:~/tmp/test$ bzr init
zerok@akira:~/tmp/test$ bzr add
added "æøÜÉ"
zerok@akira:~/tmp/test$ gls -a
. .. .bzr æøÜÉ
zerok@akira:~/tmp/test$ bzr commit -m "test"
added æøÜÉ
Committed revision 1.
zerok@akira:~/tmp/test$ echo "test" >> "æøÜÉ"
zerok@akira:~/tmp/test$ gls -a
. .. .bzr æøÜÉ
zerok@akira:~/tmp/test$ bzr status
removed:
  æøÜÉ
unknown:
  æøÜÉ
zerok@akira:~/tmp/test$ bzr diff
=== removed file '\xc3\xa6\xc3\xb8\xc3\x9c\xc3\x89'
---------------------------------------------------------------------

The difference in the filenames (removed vs. unknown) is not visible in the terminal. There both look like the first one.

hg somehow manages the same situation and also seems to stay workable on a more or less similarly configured Linux box.

---------------------------------------------------------------------
zerok@akira:~/tmp/test$ gtouch "æøÜÉ"
zerok@akira:~/tmp/test$ hg init
zerok@akira:~/tmp/test$ hg add
adding æøÜÉ
zerok@akira:~/tmp/test$ hg commit
zerok@akira:~/tmp/test$ gls -a
. .. .hg æøÜÉ
zerok@akira:~/tmp/test$ echo "test" >> "æøÜÉ"
zerok@akira:~/tmp/test$ hg status
M æøÜÉ
zerok@akira:~/tmp/test$ hg --version
Mercurial Distributed SCM (version 0.9.4)

Copyright (C) 2005-2007 Matt Mackall <email address hidden> and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
zerok@akira:~/tmp/test$
---------------------------------------------------------------------

---------------------------------------------------------------------
zerok@galaxy:~/tmp/test$ hg update
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
zerok@galaxy:~/tmp/test$ ls -a
. .. æøÜÉ .hg
---------------------------------------------------------------------

(Note: galaxy had version 0.9.3 of hg (with python 2.4.4) running while this test was conducted, akira was on hg 0.9.4 with python 2.5.1)

At least when I look at this output it seems like hg is uniformly expanding the characters (similar to what the coreutils do) while bzr operates on two different names.

I hope this is useful.

Revision history for this message
n[ate]vw (natevw) wrote :

Apple have a tech note regarding this behaviour:

http://developer.apple.com/qa/qa2001/qa1173.html

Especially:

"In Mac OS X's VFS API file names are, by definition, canonically decomposed Unicode, encoded using UTF-8."
-and-
"When returning names [from a VFS plugin] to higher layers (for example, from your VOP_READDIR entry point), you should always return decomposed names. If your underlying volume format uses precomposed names, you should convert any precomposed characters to their decomposed equivalents before returning them to the system."

Samuel Bronson (naesten)
tags: added: mac
Jelmer Vernooij (jelmer)
tags: added: check-for-breezy
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.