Failure to import when decoding changelog authors

Bug #508251 reported by Andrew Starr-Bochicchio on 2010-01-16
32
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Ubuntu Distributed Development
Medium
Unassigned

Bug Description

There are no bazaar source branches for the “evolution” package in Ubuntu on Launchpad. See:

https://code.edge.launchpad.net/ubuntu/+source/evolution/+branches

http://package-import.ubuntu.com/failures/evolution

Failed at 2010-01-12 21:57:53.778573

Traceback (most recent call last):
  File "./import_package.py", line 788, in <module>
    no_existing=options.no_existing))
  File "./import_package.py", line 713, in main
    import_package(temp_dir, importp, revid_db, bstore, possible_transports=possible_transports)
  File "./import_package.py", line 481, in import_package
    use_time_from_changelog=True)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1481, in import_package
    file_ids_from=file_ids_from)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1376, in _do_import_package
    timestamp=timestamp, file_ids_from=file_ids_from)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1261, in import_debian
    get_commit_info_from_changelog(changelog, self.branch)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/util.py", line 440, in get_commit_info_from_changelog
    authors += find_extra_authors(changes)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/util.py", line 388, in find_extra_authors
    match = extra_author_re.match(change.decode("utf-8"))
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 22-27: unsupported Unicode code range

Effects:
  evolution
  gnome-control-center
  totem (slightly different, see comment #1 below).
  gnome-panel

Related branches

description: updated
John A Meinel (jameinel) on 2010-01-27
summary: - No source branches for “evolution” package in Ubuntu
+ Failure to import when decoding changelog authors
description: updated
description: updated
Changed in udd:
importance: Undecided → Medium
status: New → Confirmed
John A Meinel (jameinel) on 2010-01-27
description: updated
Robert Collins (lifeless) wrote :

A related failure:
Traceback (most recent call last):
  File "./import_package.py", line 983, in <module>
    extra_debian=options.extra_debian))
  File "./import_package.py", line 941, in main
    import_package(temp_dir, importp, possible_transports=possible_transports)
  File "./import_package.py", line 563, in import_package
    use_time_from_changelog=True)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1541, in import_package
    timestamp=timestamp, author=author)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1428, in _do_import_package
    timestamp=timestamp)
  File "/srv/package-import.canonical.com/new/scripts/plugins/builddeb/import_dsc.py", line 1264, in import_debian
    revprops['authors'] = "\n".join(authors).decode("utf-8")
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf6' in position 12: ordinal not in range(128)

If Authors is made always unicode and the handling overhauled this will go away.

description: updated
John A Meinel (jameinel) on 2010-01-27
description: updated
John A Meinel (jameinel) wrote :

I tried to look into this a bit. As near as I can tell, the python-debian changelog parser doesn't decode anything. I would argue that all of the commit messages and authors should be Unicode strings as soon as we can reasonably make them. Which is either in import_dsc.py or in python-debian itself if we can hack that code.

James- I think I saw that you were one of the authors in python-debian. Is it code that we can reasonably hack? Or is it sort of debian-specific and we should be doing the changes in bzr-builddeb?

John A Meinel (jameinel) wrote :

In the case of gnome-panel, this fails during "find_extra_authors". It fails because it iterates the changelog and tries to decode('utf-8') each line, looking for an author.

In the case of gnome-panel, it lists Translators as:

(Pdb) pp changes
['* New upstream version:',

...
 ' Docs Translators:',
 ' - Maxim Dziumanenko (uk)',
 ' Translators:',
 ' - Vital Khilko (be)',
 " - J\xe9r\xe9my Le Floc'h (br)",
 ' - Pema Geyleg (dz)',
 ' - Ivar Smolin (et)',
 ' - Beno\xeet Dejean (fr)',
...

Note that I'm pretty certain this is iso-8859-1 encoding, as '\xe9' => é and '\xee' => î. Not to mention that iso-8859-2 and iso-8859-15 all decode it to the same characters. I guess that means it could be any of them...

Anyway,

#1) These won't match the extra author information anyway, because they aren't in the form [Author Name]. So we could just wait to decode them until after the match is run. The current author regex is:
extra_author_re = re.compile(r"\s*\[([^\]]+)]\s*", re.UNICODE)

Which IIRC, says "leading-space [ anything-but-] ] trailing space".

However, if this sort of data is then brought into the commit log, etc, it is going to fail anyway, when we try to create a Unicode commit message.

#2) Allow the decode to fail, and just assume there isn't an author there.

#3) Fall back to iso-8859-1 as the decoder.

James Westby (james-w) on 2010-04-22
Changed in udd:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers