Handling of names in UTF-8 (Unicode)

Bug #238365 reported by Daniel Clemente on 2008-06-08
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Bazaar Fast Import
Daniel Clemente

Bug Description

bzr fast-import fails while importing a repository where there are symbolic links pointing to non-ASCII names.

* How to reproduce it:

First make sure that your locale is UTF-8. Following command should display 2: echo -n é | wc -c

Then do:

mkdir tres
cd tres
git init
touch més
ln -s més prova
git add més prova
git commit -a -m "link to a file with a name in utf-8"
git-fast-export --all >expo
mkdir enbzr
cd enbzr
bzr init
bzr fast-import ../expo

* The result:

Traceback (most recent call last):
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 846, in run_bzr_catch_errors
    return run_bzr(argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 797, in run_bzr
    ret = run(*run_argv)
  File "/usr/lib/python2.5/site-packages/bzrlib/commands.py", line 499, in run_argv_aliases
    return self.run(**all_cmd_args)
  File "/home/dc/.bazaar/plugins/fastimport_dev/__init__.py", line 166, in run
  File "/home/dc/.bazaar/plugins/fastimport_dev/__init__.py", line 50, in _run
  File "/home/dc/.bazaar/plugins/fastimport/processor.py", line 83, in process
  File "/home/dc/.bazaar/plugins/fastimport/processors/generic_processor.py", line 251, in _process
    processor.ImportProcessor._process(self, command_iter)
  File "/home/dc/.bazaar/plugins/fastimport/processor.py", line 105, in _process
    handler(self, cmd)
  File "/home/dc/.bazaar/plugins/fastimport/processors/generic_processor.py", line 413, in commit_handler
  File "/home/dc/.bazaar/plugins/fastimport/processor.py", line 170, in process
    handler(self, fc)
  File "/home/dc/.bazaar/plugins/fastimport/processors/generic_processor.py", line 694, in modify_handler
    filecmd.is_executable, data)
  File "/home/dc/.bazaar/plugins/fastimport/processors/generic_processor.py", line 835, in _modify_inventory
    ie.symlink_target = data.encode('utf8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)

bzr 1.5 on python 2.5.2 (linux2)
arguments: ['/usr/bin/bzr', 'fast-import', '../expo']
encoding: 'UTF-8', fsenc: 'UTF-8', lang: 'es_ES.UTF-8'
  bzrtools /usr/lib/python2.5/site-packages/bzrlib/plugins/bzrtools [1.5.0]
  dbus /usr/lib/python2.5/site-packages/bzrlib/plugins/dbus [unknown]
  fastimport /home/dc/.bazaar/plugins/fastimport [unknown]
  gtk /usr/lib/python2.5/site-packages/bzrlib/plugins/gtk [0.94.0]
  launchpad /usr/lib/python2.5/site-packages/bzrlib/plugins/launchpad [unknown]
... Bazaar has encountered an internal error.
    Please report a bug at https://bugs.launchpad.net/bzr/+filebug
    including this traceback, and a description of what you
    were doing when the error occurred.

Tested with stable version of fastimport and also the one from fastimport.dev from today.

Related branches

Daniel Clemente (n142857) wrote :

This line:
  ie.symlink_target = data.encode('utf8')
could be changed to:
  ie.symlink_target = data
to prevent a failed conversion (¿from utf8 to utf8?) and to store the symlink really as it is. However, Bazaar doesn't support symlinks to Unicode (in fact, non-ASCII) filenames yet. This bug depends on bug 272444.

Daniel Clemente (n142857) wrote :

Now that bug 272444 is fixed, the change in comment 1 can be applied. I attach it as a patch. With it I could correctly import a repository that didn't work before.

Daniel Clemente (n142857) wrote :

Could the patch please be checked in?

Daniel Clemente (n142857) wrote :

While the previous patch worked before, now a decode("utf8") is needed in order to pass CHKInventory._entry_to_bytes a unicode object, not a str.

I attach an updated patch ready to check in.

To test this bug you can use this line:

cd /tmp; rm -rf tres enbzr; mkdir tres; cd tres; git init; touch més; ln -s més prova; git add més prova; git commit -a -m "link to a file with a name in utf-8"; git fast-export --all >expo; mkdir enbzr; cd enbzr; bzr init; bzr fast-import ../expo

Daniel Clemente (n142857) wrote :

With a larger repository I found another case that didn't work. Corrected in this new patch.

Daniel Clemente (n142857) wrote :

It seems there are more cases where a decode() may be needed, in particular with rename_item. I attach a new patch, but a new branch would be better since there may be many other changes.
I don't know if adding decode("utf-8") is the correct approach. With this patch v4 I could get further in the conversion of a large branch.

Daniel Clemente (n142857) wrote :

Sorry for so many patches -- I should use a branch. But with this one (v5) I got no Unicode errors exporting the biggest branch I have (3873 rev. with many error-prone names). It stopped in another error (bug #458260, also about file names).

Ian Clatworthy (ian-clatworthy) wrote :

Thanks Daniel. I'm looking forward to getting this sorted out. BTW, I tried this:

mkdir tres; cd tres; bzr init; touch més; ln -s més prova; bzr add més prova; bzr commit -m "link to a file with a name in utf-8"

and bzr appears to do the right thing. Running "bzr fast-export ." on the resulting branching falls over though. I suspect it needs a few tweaks like you've done on the import side.

So altogether, we need to get several things working:

1. import with unicode filenames and symlinks
2. export with unicode filenames and symlinks
3. "bzr fast-import-filter in.fi > out.fi" needs to produce an out.fi equivalent to the in.fi.

Could you put together a branch, apply your patch and push it to Launchpad? We can then work through these issues, add some tests and merge your fixes.

Daniel Clemente (n142857) wrote :

I created a branch at lp:~n142857/bzr-fastimport/unicode-symlinks
It has the 3 points you asked for: import and export work (I tried simple test cases for both and a complex test case for import), and the filter is neutral. As a plus, the current fast-import tests still pass.

About the 3rd point, fast-import-filter, I should mention two differences which I don't think are relevant, but just in case…:
1. git input file uses 100644 as a mode, but bzr exports 644
2. bzr produces one more blob than git in a "mv" operation. I can send a diff.

I used this script to do more complex testing with symlinks:
 cd /n; rm -rf quatre enbzr; mkdir quatre; cd quatre; git init; touch més; ln -s més prova; git add més prova; git commit -a -m "link to a file with a name in utf-8"; cp -l prova prova2; git add prova2; git commit -a -m "copied symlink"; git mv prova provab; git commit -a -m "moved symlink"; rm provab; touch més2; ln -s més2 provab; git commit -a -m "modified symlink destination"; git rm provab; git commit -a -m "deleted symlink"; git fast-export --all >expo; mkdir enbzr; cd enbzr; bzr init; bzr fast-import ../expo

But of course we need better tests.

Jelmer Vernooij (jelmer) wrote :

Thanks, merged (with some tweaks). Sorry it took so long!

Changed in bzr-fastimport:
status: New → Fix Committed
importance: Undecided → Medium
assignee: nobody → Daniel Clemente (n142857)
Jelmer Vernooij (jelmer) on 2010-12-11
summary: - Symbolic links to files with names in UTF-8 (Unicode)
+ Handling of names in UTF-8 (Unicode)
Jelmer Vernooij (jelmer) on 2011-03-11
Changed in bzr-fastimport:
status: Fix Committed → Fix Released
milestone: none → 0.10.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers