Unicode Latin1 -> UTF-8 support

Bug #135921 reported by Adam Olsen
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Exaile
Incomplete
Medium
Exaile Bug Day Events

Bug Description

I have not had the time to narrow down where exactly everything dies, but there needs to be some kind of wrapper system to ensure that everything that is written to sqlite and later retrieved is UTF-8. Currently, it handles some unicode but cannot convert Latin1 to UTF-8.

This occurs while scanning directories for files, while storing tag data, and when retrieving songs from the DB. It seems to be a real showstopper, because sqlite supports only UTF-8.

I'm afraid I won't have more time to play with this for now, so I'm posting a ticket in hopes that someone else will get around to it. I suggest either refactoring the database inserts a little, so you can add some latin1->utf8 decode/encode logic before inserts or maybe play with something similar to:

{{{
db.text_factory = lambda x: unicode(x, "utf-8", "replace")
}}}

Same goes for file scanning:
{{{
xl/tracks.py

     for file in to_scan:
         try:
+ file = unicode(os.path.join(dir, file), 'latin-1').encode("utf-8", 'ignore')
+ #file = unicode(os.path.join(dir, file))
         except UnicodeDecodeError:
             xlmisc.log("Error decoding filename %s" % file)
             continue
}}}

This ticket was migrated from the old trac: re #79

Adam Olsen (arolsen)
Changed in exaile:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Marc Poulhiès (marc-poulhies) wrote :

Any news from this ?
I have filenames which have strange encodings. I guess something is wrong with them and I should fix this. But all previous player I used can still use them. Exaile fails to import them and stops, which is very anoying.

Revision history for this message
thanatos7 (thanatos7-deactivatedaccount) wrote :

I'm not sure if this will help, but I've included the output that exaile 0.2.11 gives:

loading tracks...
done loading tracks...
loading songs
Clearing tracks cache
Last playlist loaded
Starting scan timer at 25
Running is False
File count: 448
/usr/lib/exaile/xl/xlmisc.py:703: GtkWarning: gtk_text_buffer_emit_insert: assertion `g_utf8_validate (text, len, NULL)' failed
  self.buf.insert(iter, text)
Couldn't read tags from file: /home/username/documents/music/Rammstein/Rosenrot/07 Zerst�en.mp3
-----------------------
 run ( /usr/lib/exaile/xl/library.py @ 616):
-----------------------
Traceback (most recent call last):
  File "/usr/lib/exaile/xl/library.py", line 654, in run
    self.do_function(loc)
  File "/usr/lib/exaile/xl/library.py", line 804, in do_function
    path_id = get_column_id(db, 'paths', 'name', unicode(loc, xlmisc.get_default_encoding()))
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-60: invalid data

Created db for thread Thread-6
{'Thread-6': <sqlite3.Connection object at 0xb55ac890>}
Closed db for thread Thread-6
Count is now: 448
loading tracks...
done loading tracks...
loading songs
Traceback (most recent call last):
  File "/usr/lib/exaile/xl/panels/collection.py", line 545, in load_tree
    songs = self.search_tracks(self.keyword, self.all)
  File "/usr/lib/exaile/xl/panels/collection.py", line 561, in search_tracks
    self.keyword, None, self.where)
  File "/usr/lib/exaile/xl/library.py", line 232, in search_tracks
    for row in cur.fetchall():
sqlite3.OperationalError: Could not decode to UTF-8 column 'name' with text '/home/username/documents/music/Rammstein/Rosenrot/07 Zerst�en.mp3'
-----------------------
 select ( /usr/lib/exaile/xl/db.py @ 178):
-----------------------
Traceback (most recent call last):
  File "/usr/lib/exaile/xl/db.py", line 191, in select
    row = cur.fetchone()
OperationalError: Could not decode to UTF-8 column 'name' with text '/home/username/documents/music/Rammstein/Rosenrot/07 Zerst�en.mp3'

-----------------------
 load_tracks ( /usr/lib/exaile/xl/library.py @ 262):
-----------------------
Traceback (most recent call last):
  File "/usr/lib/exaile/xl/library.py", line 321, in load_tracks
    row = cur.fetchone()
OperationalError: Could not decode to UTF-8 column 'name' with text '/home/username/documents/music/Rammstein/Rosenrot/07 Zerst�en.mp3'

Clearing tracks cache

Revision history for this message
era (era) wrote :

I had various legacy encodings in my media library. Exaile pretended to import it just fine, but from then on, using Exaile was near impossible -- the degradation was just horrible, and my .xsession-errors filled up with Python tracebacks every time I tried to start it.

Could the severity of this be raised? The user experience is pretty horrible.

(Ubuntu 8.10 / Exaile 0.2.13)

Revision history for this message
era (era) wrote :
Download full text (3.3 KiB)

To add to the previous, the tracebacks don't even indicate where the problematic string is, so you have to be good at guessing, or just have the patience to hunt it down (a somewhat daunting task with 18398 files in the collection).

There is no user-visible error when started from the Gnome menu, just a steady firehose of errors to .xsession-errors. As a minimal workaround of sorts, there should be a dialog box saying there is a problem, if this cannot be fixed easily.

Here are some excerpts from .xsession-errors:

-----------------------
 run ( /usr/share/exaile/xl/library.py @ 682):
-----------------------
Traceback (most recent call last):
  File "/usr/share/exaile/xl/library.py", line 720, in run
    self.do_function(loc)
  File "/usr/share/exaile/xl/library.py", line 764, in do_function
    tr = read_track_from_db(db, unicode(loc, xlmisc.get_default_encoding()))
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-59: invalid data

-----------------------
 run ( /usr/share/exaile/xl/library.py @ 682):
-----------------------
Traceback (most recent call last):
  File "/usr/share/exaile/xl/library.py", line 720, in run
    self.do_function(loc)
  File "/usr/share/exaile/xl/library.py", line 764, in do_function
    tr = read_track_from_db(db, unicode(loc, xlmisc.get_default_encoding()))
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-59: invalid data

-----------------------
 run ( /usr/share/exaile/xl/library.py @ 682):
-----------------------
Traceback (most recent call last):
  File "/usr/share/exaile/xl/library.py", line 720, in run
    self.do_function(loc)
  File "/usr/share/exaile/xl/library.py", line 764, in do_function
    tr = read_track_from_db(db, unicode(loc, xlmisc.get_default_encoding()))
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-59: invalid data

-----------------------
 run ( /usr/share/exaile/xl/library.py @ 682):
-----------------------
Traceback (most recent call last):
  File "/usr/share/exaile/xl/library.py", line 720, in run
    self.do_function(loc)
  File "/usr/share/exaile/xl/library.py", line 764, in do_function
    tr = read_track_from_db(db, unicode(loc, xlmisc.get_default_encoding()))
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 57-59: invalid data

-----------------------
 run ( /usr/share/exaile/xl/library.py @ 682):
-----------------------
Traceback (most recent call last):
  File "/usr/share/exaile/xl/library.py", line 720, in run
    self.do_function(loc)
  File "/usr/share/exaile/xl/library.py", line 764, in do_function
    tr = read_track_from_db(db, unicode(loc, xlmisc.get_default_encoding()))
  File "/usr/lib/python2.5/encodings/utf_8.py", line 16, in decod...

Read more...

Revision history for this message
reacocard (reacocard) wrote :

If someone experiencing this issue could grab a copy of exaile from trunk and test to see whather to issue still exists, that'd be great.

Changed in exaile:
status: Confirmed → Incomplete
Changed in exaile:
assignee: nobody → Exaile Bug Day Events (exaile-bugday)
Revision history for this message
era (era) wrote :

This is not hard to reproduce, just change the file name or id3 tag (depending on how you have Exaile configured) of a sample song to Latin-1, and import it.

A simple test case is to change an a to an accented ä which is Latin-1 0xE4 and thus of course Unicode U+00E4 but encoded in UTF-8 it is the byte sequence 0xC3 0xA4

Revision history for this message
Steve Dodier-Lazaro (sidi) wrote :

Thanks for the information. I just need to figure how to turn a filename into latin 1 now :]

Revision history for this message
reacocard (reacocard) wrote :

> Thanks for the information. I just need to figure how to turn a filename into latin 1 now :]

its not quite so clear cut, since you also have to be able to turn it back later when you want to access the file again, say for playback. There's currently zero infrastructure to support this sort of thing and encoding has been a very tricksy issue in the past, so we should proceed very carefully with this, probably in a separate branch.

Revision history for this message
era (era) wrote :

How to mess up your file system depends also somewhat on how your locales are set up etc. I would perhaps suggest you use something like Perl, where you can unambiguously and portably express stuff like

vnix$ perl -e 'rename("Random.mp3", "R\xe4ndom.mp3") || die "Could not rename: $!\n"'

This still depends on the underlying file system and what not, but works for me on ext3 on Ubuntu Linux. (If you ls this file it will show as "R?ndom" and tab completion in Bash will produce a Unicode "unknown/invalid" glyph -- Nautilus does that in spades, adding an (invalid encoding) after the file name --, but raw access to the file system etc will show you that the file name is exactly as you requested it, with a Latin-1 ä, as in "Rändom.mp3", only of course here in Launchpad it is in Unicode.)

More generally, you can use iconv to force silly round-trip conversion errors:

vnix$ echo rändom | iconv -f latin1 -t utf8
rändom

(... assuming your shell correctly allows you to enter a proper ä and that your locale is set up to use UTF8.)

I'm not saying I know at all how to solve this, but any developer should be able to reproduce, trivially.

Revision history for this message
era (era) wrote :

Also for the record the proper name of the Rammstein song featured in comment #2 is apparently Zerstören <http://en.wikipedia.org/wiki/Rosenrot> -- the mojibake you see here in Launchpad is a typical example of a Latin-1 encoding interpreted as an invalid UTF8 sequence (the ö is 0xF6 in Latin-1 and so U+00F6 in Unicode; but 0xF6 is a prefix in UTF-8, so the following r gets eaten by the UTF-8 decoder before it realizes this is an invalid sequence).

Revision history for this message
era (era) wrote :

Another thing worth mentioning is that not all invalid Unicode is Latin1, of course. It could just as well be CP-850 or KOI-8R or ISO2022-JP or KSC-5601 or GB2312 or any of at least a hundred other legacy encodings.

Revision history for this message
reacocard (reacocard) wrote :

> Another thing worth mentioning is that not all invalid Unicode is Latin1, of course. It could just as well be CP-850 or KOI-8R or ISO2022-JP or KSC-5601 or GB2312 or any of at least a hundred other legacy encodings.

True, but it is essentially impossible for us to autodetect which encoding it is. That's the crux of the problem with legacy encodings.

Also, I still haven't heard whether you tried trunk to see if it has the same problem. In theory trunk should be avoiding touching the encoding as much as possible, which should allow it to work with any encoding. Theoretically, anyway.

Revision history for this message
era (era) wrote :

> > Another thing worth mentioning is that not all invalid Unicode
> > is Latin1, of course. It could just as well be CP-850 or KOI-8R
> > or ISO2022-JP or KSC-5601 or GB2312 or any of at least a
> > hundred other legacy encodings.
>
> True, but it is essentially impossible for us to autodetect which
> encoding it is. That's the crux of the problem with legacy encodings.

That is precisely the point I was trying to make. Some of the comments above allude to an automatic translation based on the assumption that any invalid UTF8 is Latin-1 which is not true, and should be avoided. The only robust approach to this problem is "I don't know what encoding you used in the file name, so I can't show it correctly, but I sure can play the music in that file". Anything else is bound to fall on its face in new and interesting ways (unless of course you manage to solve the very ambitious task to correctly guess encodings, which is hardly of central importance for a media player).

> Also, I still haven't heard whether you tried trunk to see
> if it has the same problem. In theory trunk should be
> avoiding touching the encoding as much as possible,
> which should allow it to work with any encoding.
> Theoretically, anyway.

I'm afraid I won't necessarily be able to make that kind of investment in this bug report. On my production system, I had to settle on a different music player because of this issue, and I'm not too familiar with Exaile, its development model, or Python. I'll be happy to offer input on how to reproduce this bug if somebody else would like to create test cases. Because this is a thorny issue, there should probably be several test cases for this in the test suite. And anyway, there are several other persons who have chimed in and reported that they too have this problem.

Still, if you could provide a pointer to a brief howto for running the trunk version in a virtual machine with Ubuntu 9.04 or 9.10 prerelease, I could try to find the time to do that.

Revision history for this message
era (era) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.