"combined" unicode characters are renamed on Mac
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Bazaar |
Confirmed
|
Wishlist
|
Unassigned |
Bug Description
On Mac OS X, it maintains the filesystem as normalized unicode.
http://
Specifically, it forces all filenames to be stored as the NFD variant.
As an example the name u'B\xe5gfors' ("Bågfors") gets normalized to u'Ba\u030agfors'. (That is a with combining u'\u030a' circle).
The complication is that these files can be read and written to using their alternative name, but when you list the directory, you will only get one name.
Most systems prefer the NFC form (it is the standard XML interchange and it maps iso-8859-1 characters cleanly into utf-8.)
For example Unicode u'\xe5' == iso-8859-1 '\xe5'. But u'a\u030' has no representation in iso-8859-1.
In earlier versions of bzr, we tried to handle this by requiring all filenames to be NFC normalized. On platforms other than Mac, it was considered an error to try and version an NFD normalizied filename. And then on Mac, we normalized the filenames back to NFC.
This meant that checking in the utf-8 filename 'B\xc3\xa5gfors' on Linux would internally translate that to u'B\xe5gfors'. And then when doing a checkout on Mac, it gets opened as u'B\xe5gfors'. Mac changes the filename on disk to u'Ba\u030agfors' ('Ba\xcc\x8agfors' in utf-8). When we do _iter_changes, we would see a miss for u'Ba\u030agfors' and we would try to re-normalize it, and see if that matched. Since it does, we would know that the file is present, and not treat it as missing.
Current versions (dirstate) do not do this normalization lookup. There are 3 reasons...
1) Performance. re-normalizing unicode names is a little bit expensive, especially when 99.9% of all filenames are ascii only (or lots of Unicode characters that aren't "combined" chars). We had a reasonable tradeoff by only doing a normalization when we had a 'miss'. This is a little bit more complicated with dirstate, because we use double iterators (so we generally only have the 'current' file being processed, rather than all filenames).
2) Sometimes on non-mac platforms NFD names are generated. We found this with Japanese Windows, Microsoft Office preferred to generate wide-character '(', rather than narrow '('. Requiring names to be NFC means changing other programs to properly produce them, which is not under our control. If we allow people to version non NFC names, then we don't know the "proper" name for a file. (A filename could be produced with mixed normalization, so forcing it one way or the other would not be possible).
3) No other system we have come across accounts for this. (svn, cvs, git, hg, etc). Mac is the only platform that seems to change your filenames automatically. (Other platforms seem to leave the names alone, even if they don't seem to properly display them). Because Mac breaks all these other systems, it seems that it is not a high priority to support them.
Changed in bzr: | |
importance: | Undecided → Wishlist |
status: | Unconfirmed → Confirmed |
tags: | added: mac |
tags: | added: check-for-breezy |
Just some observations under MacOSX with a custom bash and coreutils (to get at least some UTF-8 support in there ;-))
------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ~/tmp/test$ echo $SHELL ~/tmp/test$ $SHELL --version apple-darwin8. 8.0) ~/tmp/test$ echo $LC_ALL ~/tmp/test$ gls -a ~/tmp/test$ gtouch "æøÜÉ" ~/tmp/test$ gls -a ~/tmp/test$ bzr init ~/tmp/test$ bzr add ~/tmp/test$ gls -a ~/tmp/test$ bzr commit -m "test" ~/tmp/test$ echo "test" >> "æøÜÉ" ~/tmp/test$ gls -a ~/tmp/test$ bzr status ~/tmp/test$ bzr diff xc3\xb8\ xc3\x9c\ xc3\x89' ------- ------- ------- ------- ------- ------- ------- ------- ------
zerok@akira:
/opt/sw/bin/bash
zerok@akira:
GNU bash, version 3.2.0(1)-release (powerpc-
Copyright (C) 2005 Free Software Foundation, Inc.
zerok@akira:
en_US.UTF-8
zerok@akira:
. ..
zerok@akira:
zerok@akira:
. .. æøÜÉ
zerok@akira:
zerok@akira:
added "æøÜÉ"
zerok@akira:
. .. .bzr æøÜÉ
zerok@akira:
added æøÜÉ
Committed revision 1.
zerok@akira:
zerok@akira:
. .. .bzr æøÜÉ
zerok@akira:
removed:
æøÜÉ
unknown:
æøÜÉ
zerok@akira:
=== removed file '\xc3\xa6\
-------
The difference in the filenames (removed vs. unknown) is not visible in the terminal. There both look like the first one.
hg somehow manages the same situation and also seems to stay workable on a more or less similarly configured Linux box.
------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ~/tmp/test$ gtouch "æøÜÉ" ~/tmp/test$ hg init ~/tmp/test$ hg add ~/tmp/test$ hg commit ~/tmp/test$ gls -a ~/tmp/test$ echo "test" >> "æøÜÉ" ~/tmp/test$ hg status ~/tmp/test$ hg --version
zerok@akira:
zerok@akira:
zerok@akira:
adding æøÜÉ
zerok@akira:
zerok@akira:
. .. .hg æøÜÉ
zerok@akira:
zerok@akira:
M æøÜÉ
zerok@akira:
Mercurial Distributed SCM (version 0.9.4)
Copyright (C) 2005-2007 Matt Mackall <email address hidden> and others ~/tmp/test$ ------- ------- ------- ------- ------- ------- ------- ------- ------
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
zerok@akira:
-------
------- ------- ------- ------- ------- ------- ------- ------- ------- ------ ~/tmp/test$ hg update ~/tmp/test$ ls -a ------- ------- ------- ------- ------- ------- ------- ------- ------
zerok@galaxy:
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
zerok@galaxy:
. .. æøÜÉ .hg
-------
(Note: galaxy had version 0.9.3 of hg (with python 2.4.4) running while this test was conducted, akira was on hg 0.9.4 with python 2.5.1)
At least when I look at this output it seems like hg is uniformly expanding the characters (similar to what the coreutils do) while bzr operates on two different names.
I hope this is useful.