exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u011f' in position 79: ordinal not in range(128)

Bug #1031679 reported by David Sveningsson
48
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Bazaar Fast Import
Confirmed
Undecided
Unassigned

Bug Description

Trying to use bzr fast-export to migrate a repository but it always fails. I think the problem is with files with unicode characters in the filename but I could not reproduce it with a new simple repo.

Revision history for this message
David Sveningsson (ext-launchpad-net) wrote :
Revision history for this message
David Sveningsson (ext-launchpad-net) wrote :

I was able to produce a simple test case:

# bzr init
# touch ☃
# ln -s ☃ foo
# bzr add ☃ foo
# bzr commit
# bzr fast-export

Revision history for this message
ilia (ilia) wrote :

Also reported as "private" bug 963382, see
https://lists.ubuntu.com/archives/foundations-bugs/2012-March/075044.html

I am experiencing this bug as well, just like described in above link.

Changed in python-fastimport:
status: New → Confirmed
Revision history for this message
Jelmer Vernooij (jelmer) wrote :

bzr-fastimport should encode this character before passing it on to python-fastimport.

affects: python-fastimport → bzr-fastimport
Revision history for this message
Oliver (oliver-assarbad) wrote :

I am also affected. Use case is the export of a bzr-based etckeeper repository into a fast-import file. I am using the following command line:

    bzr fast-export -v --rewrite-tag-names --plain /root/etckeeper.fi

Output is virtually the same, except for version and line numbers.

Revision history for this message
Oliver (oliver-assarbad) wrote :

Okay, found something worth noting and even though my Python and pdb foo may be weak compared to some of the pythonistas here, this looks like the root cause for the problem at hand.

Observe:

------------------------------
$ PYTHONIOENCODING=utf-8 python -m pdb $(which bzr) fast-export --plain ~/etc/
Breakpoint 1 at /usr/lib/python2.7/dist-packages/fastimport/commands.py:333
> /usr/bin/bzr(19)<module>()
-> from __future__ import absolute_import
(Pdb) import sys; print(sys.getdefaultencoding())
ascii
(Pdb) sys.setdefaultencoding('utf-8')
*** AttributeError: 'module' object has no attribute 'setdefaultencoding'
(Pdb) reload(sys)
<module 'sys' (built-in)>
(Pdb) sys.setdefaultencoding('utf-8')
(Pdb) print(sys.getdefaultencoding())
utf-8
------------------------------

So I tell Python explicitly to use utf-8 as the I/O encoding, which it obviously refuses. Worse, even, when I try to set it, I get the AttributeError.

However, if I reload the sys module, suddenly the function sys.setdefaultencoding becomes available and after I call it, I can see with

   print(sys.getdefaultencoding())

that it was successful.

Obviously, since the PYTHONIOENCODING environment variable is ignored, running this outside the debugger yields the same error as can be seen above. I am investigating possibilities for a workaround, but I reckon the info so far may be valuable, even if I get distracted or bored ;)

Revision history for this message
Oliver (oliver-assarbad) wrote :

Here's a workaround. Check if your shell has an alias for bzr:

   alias|grep bzr

if not define:

   alias bzr='BZR_PLUGIN_PATH=$HOME/.bazaar/plugins bzr'

Use the attached .tgz by unpacking it via

   tar -C "$HOME" -xzf workaround_1031679.tgz

this ensures that the base folder ($HOME/.bazaar/plugins) also exists. Note that this relies on the path in the above alias (as BZR_PLUGIN_PATH) being identical to the one in the .tgz!

Also note: this will likely fail on Windows unless you're using GNU tar, Bash (or another shell with aliases etc) or something like Cygwin right away.

Comparison of with and without workaround below (stripped the traceback and replaced it by [...]):
--------------------------------------------------------------------------------
$ rm -rf $HOME/.bazaar/plugins
$ bzr fast-export --plain ~/etc/ > ~/file.fi
03:57:30 Calculating the revisions to include ...
03:57:30 Starting export of 278 revisions ...
bzr: ERROR: exceptions.UnicodeEncodeError: 'ascii' codec can't encode character u'\u011f' in position 79: ordinal not in range(128)
[...]
$ alias bzr='BZR_PLUGIN_PATH=$HOME/.bazaar/plugins bzr'
$ tar -C "$HOME" -xzvf workaround_1031679.tgz
.bazaar/plugins/workaround_1031679/
.bazaar/plugins/workaround_1031679/__init__.py
$ bzr fast-export --plain ~/etc/ > ~/file.fi
Default encoding before reload: ascii
Default encoding after reload: utf-8
03:59:18 Calculating the revisions to include ...
03:59:18 Starting export of 278 revisions ...
03:59:19 Exported 278 revisions in 0:00:01

Revision history for this message
Oliver (oliver-assarbad) wrote :

Apologies, the following line must be removed from the plugin source:

   print "In hook %s" % (repr(cmd))

otherwise it will appear in the output file. Fixed .tgz attached. Original has been removed.

Revision history for this message
andrew bezella (abezella) wrote :

thank you for the attempted workaround. in my case (attempting to convert an etckeeper repo from bzr to git) it looks like the workaround_1031679 plugin was loaded but the conversion still fails:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u011f' in position 79: ordinal not in range(128)

Revision history for this message
Oliver (oliver-assarbad) wrote :

@Andrew: have you tried running bzr under pydbg the way I outlined?

- Is sys loaded already?
- What does the attempt to call sys.setdefaultencoding('utf-8') give you?
- Does reload(sys) have an effect the way it had in my debug session?

If not, you're seeing a different problem. However, I'll see whether I can reproduce it on 12.04 which is what you seem to be running on.

Revision history for this message
Oliver (oliver-assarbad) wrote :

@Andrew: just wondering. I noticed that your output does not indicate that the plugin was loaded.

The first two lines (about the default encoding) after the command should appear. If they don't, the plugin is not active. These lines get printed on stderr.

$ bzr fast-export --plain ~/etc > ~/file.fi
Default encoding before reload: ascii
Default encoding after reload: utf-8

But anyway, it does not matter. The workaround only allows successful export. The import still is an issue because of an encoding issue.

Revision history for this message
Oliver (oliver-assarbad) wrote :

Okay, got it. The problem is that even though it appears to get the encoding right afterwards, it gives the wrong number of bytes/characters in the 'data' command. Example:

    M 120000 inline ssl/certs/3b2716e5.0
    data 47
    EBG_Elektronik_Sertifika_Hizmet_Sağlayıcısı.pem

as you'll hopefully agree first sight suggests 47 characters. For a latinized version

    EBG_Elektronik_Sertifika_Hizmet_Saglayicisi.pem

this would even be true. But due to the code points in UTF-8 this actually becomes 51 byte in length:

    0000000: 45 42 47 5f 45 6c 65 6b 74 72 6f 6e 69 6b 5f 53 EBG_Elektronik_S
    0000010: 65 72 74 69 66 69 6b 61 5f 48 69 7a 6d 65 74 5f ertifika_Hizmet_
    0000020: 53 61 c4 9f 6c 61 79 c4 b1 63 c4 b1 73 c4 b1 2e Sa..lay..c..s...
    0000030: 70 65 6d pem

This means that a code change is also needed down in the nethers of bzr-fastimport. I.e. my workaround fixes/d only part of the problem.

Revision history for this message
Oliver (oliver-assarbad) wrote :

Okay, so several sources on the web claim that it's not such a brilliant idea to reload(sys) in order to get back the function. The docs even state explicitly:

This function is only intended to be used by the site module implementation and, where needed, by sitecustomize. Once used by the site module, it is removed from the sys module’s namespace.

Source: https://docs.python.org/2/library/sys.html#sys.setdefaultencoding

Back to the drawing board.

Revision history for this message
Oliver (oliver-assarbad) wrote :

Using pydb, I am stepping through the code and ended up breaking here (.pydbrc contents):

b /home/oliver/.local/lib/python2.7/site-packages/fastimport/commands.py:348
condition 1 path == 'ssl/certs/3b2716e5.0'
continue

This way I break exactly when the condition occurs, although it's somewhat sluggish.

The line I am breaking on is:

348 return "M %s %s %s%s" % (self._format_mode(self.mode), dataref, path, datastr)

I am using the latest development code from launchpad for both bzr-fastimport and python-fastimport.

(Pydb) p datastr
u'\ndata 47\nEBG_Elektronik_...flay\u0131c\u0131s\u0131.pem'

Still a "wide" Unicode string (yeah, I know UTF-8 is also an encoding of Unicode). So I dug a bit further and Jelmer was perfectly right. The solution is pass the path encoded already.

The patch is trivial, I am attaching it.

A test run on the etckeeper repo that has given me the headaches gives:

[2] oliver@yggdrasil:~/bug-1031679$ git init etckeeper-gitconv
Initialized empty Git repository in /home/oliver/bug-1031679/etckeeper-gitconv/.git/
[2] oliver@yggdrasil:~/bug-1031679$ (cd etckeeper-gitconv/ && bzr fast-export --plain ~/etc | git fast-import)
01:49:16 Calculating the revisions to include ...
01:49:16 Starting export of 16 revisions ...
01:49:16 Exported 16 revisions in 0:00:01
git-fast-import statistics:
---------------------------------------------------------------------
Alloc'd objects: 5000
Total objects: 1872 ( 331 duplicates )
      blobs : 1657 ( 320 duplicates 577 deltas of 1521 attempts)
      trees : 199 ( 11 duplicates 34 deltas of 198 attempts)
      commits: 16 ( 0 duplicates 0 deltas of 0 attempts)
      tags : 0 ( 0 duplicates 0 deltas of 0 attempts)
Total branches: 1 ( 1 loads )
      marks: 1024 ( 16 unique )
      atoms: 1812
Memory total: 2469 KiB
       pools: 2235 KiB
     objects: 234 KiB
---------------------------------------------------------------------
pack_report: getpagesize() = 4096
pack_report: core.packedGitWindowSize = 1073741824
pack_report: core.packedGitLimit = 8589934592
pack_report: pack_used_ctr = 60
pack_report: pack_mmap_calls = 31
pack_report: pack_open_windows = 1 / 1
pack_report: pack_mapped = 1293621 / 1293621
---------------------------------------------------------------------

Revision history for this message
Oliver (oliver-assarbad) wrote :

Also attaching the plain-text version of the patch

Revision history for this message
andrew bezella (abezella) wrote :

i applied the patch and was able to successfully export etckeeper's bzr repo. thank you!

Revision history for this message
Chris Peach (peachris+ubuntu) wrote :

Thanks, Oliver! Your little patch works well now. It let me convert my etckeeper repo that had been created with Bazaar by default.

Revision history for this message
Pawel Tecza (ptecza) wrote :

Oliver, thanks a lot for your patch! It was very helpful for me in order to convert Etckeeper bzr repo to git under Ubuntu Trusty.

Revision history for this message
Sathors (sathors) wrote :

For those who have this problem and the patch from Oliver did not solve it, maybe try this patch.

It resolves a bug when a committer has a unicode character in its name (in my case Nuñez).

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.