media file content and filename encoding is not consistient

Bug #135985 reported by Adam Olsen
82
This bug affects 7 people
Affects Status Importance Assigned to Milestone
Exaile
Confirmed
High
Unassigned

Bug Description

I see that there have been a number of bugs opened and fixed recently that deal with character encoding ([ticket:90], [ticket:165], [ticket:247], [ticket:292]). It's really a pain for a variety of historical reasons, but Exaile is going to have to deal with it. I'm going to try to explain the problems as I see them, and present the design for a solution. It's going to be more complicated than you want...

The two major problems are with '''MP3 tags''' and with '''filenames'''.

First, filenames. UNIX never had any encoding definition for filenames. The only restriction was that the name couldn't contain any slash (/). But it meant that for people who wanted to save files with non-English names had to make something up. It's a long story, but fairly recently the GNOME/gtk++ folks decided to assume UTF-8 and allow users to override this if needed. Check out the [http://developer.gnome.org/doc/API/2.0/glib/glib-Character-Set-Conversion.html GLib docs] for a little explanation. Look especially at the '''Checklist for Application Writers''' section.

To handle '''filename encoding''' robustly, Exaile will have to allow for a per-file option for which encoding to use. With nothing set, use UTF-8, but would be possible for someone to have a French song in Latin-1 and a Polish song in Latin-2. It might be nice to have a per-directory setting. But then again, these files are non-standard, so perhaps just an easy way to set the filename encoding for a lot of files at once. Yes, both filenames should be in UTF-8 (and a conversion option would be nice), but Exaile should be able to open the file no matter what. A simpler option would be to not care, but I think there are problems with SQLite and non-UTF8 strings. Something else to consider is external album art.

For '''media tags''', the problem is more complicated. For id3v1 and v1.1, the tags are "supposed" to be in ISO-8859-1 (Latin-1), but they are often not. id3v2.0 v2.1, v2.2 and v2.3 "should" be in ISO-8859-1, while the uncommon id3v2.4 should be in UTF8. But lots of these text strings are not encoded correctly. Check out [http://en.wikipedia.org/wiki/ID3 the Wikipedia entry] for more. One option would be to disallow these files, but that wouldn't be very nice. The other would be to have a field for each track like above, but this would have to be distinct from the filename field above! It would be likely that someone has renamed files to comply with GNOME standards, but leave the contents of the file alone.

APEv1 tags are rare (and ASCII only) and APEv2 and Vorbis comments (for ogg, flac and Speex) are all UTF-8 all the time. WMA and ACC have their own tagging standard (See the [http://www.id3.org/FAQ id3 FAQ]).

For '''radio streams''' the filename option is unneeded, but the tag option should be kept.

So, in short:
|| Type || Default Encoding || Overridable? ||
|| audio filenames || UTF-8 || per file, per directory(?) ||
|| coverart filenames || UTF-8 || per file, per directory(?) ||
|| mp3 id3v1 || ISO-8859-1 || per track, per directory, global for tag type(?) ||
|| mp3 id3v1.1 || ISO-8859-1 || per track, per directory, global for tag type(?) ||
|| mp3 id3v2 || ISO-8859-1 || per track, per directory, global for tag type(?) ||
|| mp3 id3v2.3 || ISO-8859-1 || per track, per directory, global for tag type(?) ||
|| mp3 id3v2.4 || UTF-8 || per track, per directory, global for tag type(?) ||
|| APE & Vorbis || UTF-8 || per track? ||
|| WMA || Unknown || per track? ||
|| AAC || Unknown || per track? ||
|| Radio Streams || From above tag type|| per track, global for tag type(?) ||

Yes, this is a pain. But either you have to make it easy for users to use their existing data, or make it easy to change it. I hope it's not too discouraging!

This ticket was migrated from the old trac: re #293

Adam Olsen (arolsen)
Changed in exaile:
importance: Undecided → High
status: New → Confirmed
Mathias Brodala (mathbr)
description: updated
Revision history for this message
CheolHan Yoon (mait) wrote :

@Adam

Thanks for good, detailed report.

I also hope to improve this problem.

Maybe https://bugs.launchpad.net/exaile/+bug/135950 has same suggest.

Revision history for this message
Johannes Sasongko (sjohannes) wrote :

Are we still having problems with filepath encoding? AFAIK the separation between Track.get_loc (previously Track.loc, for display purposes) and Track.get_loc_for_io (previously Track.io_loc, for file operations) theoretically should have fixed this.

As for tag encoding, something similar in spirit to the patch in bug 223547 could be implemented (trying vadious different encodings on the tag until one is "valid"). To be honest that's as far as I think we should go.

Revision history for this message
Steve Dodier-Lazaro (sidi) wrote :

What's the status of this bug ? Maybe we should plan unit-tests for file encoding / tag content for all common formats for 0.3.1 ?

Revision history for this message
reacocard (reacocard) wrote :

> What's the status of this bug ? Maybe we should plan unit-tests for file encoding / tag content for all common formats for 0.3.1 ?

Initial tests with gio.File indicate that it may be able to solve the filename encoding problems. As for in-tag encodings, there is nothing implemented yet nor any plans for it that i am aware of.

Revision history for this message
Jiahua Huang (huangjiahua) wrote :

Hi, it's my new patch

=== modified file 'xl/metadata/_id3.py'
--- xl/metadata/_id3.py 2009-08-25 21:35:45 +0000
+++ xl/metadata/_id3.py 2009-10-09 11:33:58 +0000
@@ -31,6 +31,20 @@ from mutagen import id3
 import logging
 logger = logging.getLogger(__name__)

+import locale
+if str(locale.getdefaultlocale()[0]).startswith('zh'):
+ _unicode=unicode
+ def unicode(string, encoding='utf8',errors='strict'):
+ try:
+ string = string.decode('utf8').encode('iso8859-1')
+ except:
+ return _unicode(string)
+ for enc in ('utf8', 'gb2312', 'big5', 'gb18030', 'big5hkscs', 'euc-jp', 'euc_kr', 'cp1251', 'utf16'):
+ try:
+ return string.decode(enc)
+ except:
+ pass
+ return string

 class ID3Format(BaseFormat):
     MutagenType = id3.ID3

Revision history for this message
Jiahua Huang (huangjiahua) wrote :

Changes some lines,

use it:

import locale
if str(locale.getdefaultlocale()[0]).startswith('zh'):
    _unicode=unicode
    def unicode(string, encoding='utf8',errors='strict'):
        try:
            string.decode('utf8').encode('iso8859-1')
        except:
            return _unicode(string)
        string = string.decode('utf8').encode('iso8859-1')
        for enc in ('utf8', 'gb2312', 'big5', 'gb18030', 'big5hkscs', 'euc-jp', 'euc_kr', 'cp1251', 'utf16'):
            try:
                return string.decode(enc)
            except:
                pass
        return string

Revision history for this message
Aron Xu (happyaron) wrote :

Hi, please consider to deal with this problem, it is an important thing for users who are not using only English, especially users for China, Japan and Korea (aka CJK).
Based on Jiahua Huang's work, here is the patch I've verified to be in good shape and can be applied to latest bzr trunk.

Thanks,
Aron

Revision history for this message
reacocard (reacocard) wrote :

isnt that basically the same as the previous comment? That one fails to work properly in all cases iirc, hence why it didn't live long after commit before we reverted it. (revisions 1754 and 1765, and bug 223547, if you're curious)

Really, I don't think there is _any_ safe way to autodetect the encoding, and until that statement is proven wrong (by logic, not example), I am very much inclined not to accept any such patches. What I would like to do is instead offer an option in the tag editor to correct misencoded tags manually, as that is safe and will restore the files to a standards-compliant format, which they should be in in the first place.

It's also worth noting that your patch only works at scan time - it won't help users who already have misencoded tags in their databases unless they force it to re-read the tags. Offering a reencode option in the tag editor would work for that case as well.

Revision history for this message
Aron Xu (happyaron) wrote :

Sure it is basically the same as previous patch as I've stated before.
But why do you say it doesn't work in any case? Here are two screenshots, which are captured in version 0.3.0.2 (Karmic, PPA) and the other trunk revision 2801 with the patch in my last comment. It is obviously that in 0.3.0.2 the last two songs are not shown in the correct form but in the patched r2801, all things get worked.
So please consider it a second time. As far as I know, Amarok provides auto encoding detection, which proves there is acceptable way on such thing, :)

Revision history for this message
Aron Xu (happyaron) wrote :

I agree with adding a option to convert the encodings, but I don't think it can solve the problem completely. Some users who are just turning from MS Windows, will have lots of media files with those legacy encodings, and most of them need to have them on Windows as well in a not-too-short time during there switch. Because media players on Windows always don't support UTF-8 very well, people don't like to convert the tags at once, and they will be *very* happy to see everything work without doing any extra things or even looks (just looks) to break things. Thus, adding legacy encodings support is the best way for those users, and I think it won't cost too much to check and really worth it.

Revision history for this message
reacocard (reacocard) wrote :

> As far as I know, Amarok provides auto encoding detection, which proves
> there is acceptable way on such thing, :)

Yes, but it _doesn't_ prove this implementation works fully. As I said, unless you can prove to me that this isn't going to misdetect any currently-working encodings, then I'm not going to commit it.

I understand that it will provide a very good user experience when this works, but if it causes regressions for users whose files ARE correct, then we can't do it. Period. I have get to see any substantial proof that this approach will not cause regressions - screenshots showing that it does work in some cases are not enough.

One additional note - the locale check at the top is a monstrous hack. The same logic should be applied to ALL locales, since with the internet it is not uncommon for people to have files originating from many different countries and we need to make sure it works well in all of them.

Revision history for this message
Aron Xu (happyaron) wrote :

After a long discussion on IRC with Aren Olson, I would leave here a comment to be a note, that is Amarok2 can do the right thing on auto encoding detection, so we will investigate this issue after Exaile 0.3.1 is out. :D

Revision history for this message
ChenXing (cxcxcxcx) wrote : Re: [Bug 135985] Re: media file content and filename encoding is not consistient

It seems that Amarok2 does not always do "encoding guessing" job very
well, either. However, I think audacious provides a easier and
acceptable solution: let end-user set a preferred tag encoding. eg. If
I input "gb2312" as the preferred encoding, the player will first try
to decode the tags with gb2312, and then try unicode if the former
attempt failed. If a user doesn't like this feature, he may simply
disable it.

I know this means a lot of work as the UI also need to be modified,
but the feature would surely greatly improve user experience for
Asian(especially CJK) users. If anybody will take some time on
implementing it, we will be very grateful~~

以Wiki模式创建Linux中文文档,欢迎加入
http://www.linux-wiki.cn/

2010/1/10 Aron Xu <email address hidden>:
> After a long discussion on IRC with Aren Olson, I would leave here a
> comment to be a note, that is Amarok2 can do the right thing on auto
> encoding detection, so we will investigate this issue after Exaile 0.3.1
> is out. :D
>
> --
> media file content and filename encoding is not consistient
> https://bugs.launchpad.net/bugs/135985
> You received this bug notification because you are a direct subscriber
> of a duplicate bug.
>

Revision history for this message
muzuiget (muzuiget) wrote :

I write to solve this problem too. It also work in remote file(open with url).

Provide a text field to let user to input fallback encode names in gui is enough. Not need to auto convert, because it is diffcult to do.

Revision history for this message
ChenXing (cxcxcxcx) wrote :

muzuiget's patch is similar to the previous patch in mechanism, this will not work all the time. A guy has shown that a French or German word can be "misdecoded".

I think a better solution is to allow user setting a preferred encoding like audacious do, but this do require some work.

Meanwhile, I want to recommend people suffering this problem to try mp3tagiconv ( http://code.google.com/p/mp3tagiconv/ ). It's like mid3iconv, but the converted mp3 can also be recognized by Windows Media Player, or some old mp3 players. As a user, one will not feel the change except your files can be recognized by more music players:)

Revision history for this message
muzuiget (muzuiget) wrote :

Yes, what I mean is just like audacious do.

the code look like

encode_list = get_from_gui_text_field().split(',')
# encode_list = ['gbk', 'big5', 'shift_jis']:
for encode in encode_list:
    # do convert

Revision history for this message
ChenXing (cxcxcxcx) wrote :

Yes, but then we need to change the GUI part, and the profile. I'm not familiar with exaile's ui code, if you'd like to take time to it, I'll be very grateful. I believe they will adopt such a patch if there is a good one:)

Revision history for this message
era (era) wrote :

Proper UTF8 should not be hard to identify, the high-bit sequences are required to follow a particular pattern (look up UTF8 in Wikipedia to see a good illustration). What's hard is deciding how to interpret a legacy 8-bit encoding which is not valid UTF8. You can probably figure out whether something is Latin or KOI etc based on trigram frequencies, for example, but which Latin? (Latin-1 aka ISO8859-1 and Latin-9 aka ISO8859-15 differ only by a single character, but then you can probably just assume it's Latin-9 and never be too badly wrong.) If you interpret Latin-2 as Latin-9 you get all the extended code points completely wrong (or should I say c$mpl^t^ly wr$ng ... look up "mojibake" in Wikipedia too). Anyway, the baseline should be able to identify correct UTF8 and handle it without the need of any preference or user interaction. How you handle the rest might need something like what is being specified in other comments above.

Revision history for this message
muzuiget (muzuiget) wrote :

This bug has a few years, each time after exaile update I need to re-patch the encoding patch on above.

I agree auto dectect string encoding is difficult, especially short string like audio file tags, there no perfect solution. But that patch can display correctly for most files tags, and it is simple.

Avoid hardcode the fallback encodings list, this time I add a GUI opiton, you can look at the attachment screenshot.

Here is my patch

http://bazaar.launchpad.net/~muzuiget/exaile/exaile/revision/4357

Revision history for this message
Mathias Brodala (mathbr) wrote :

I’m not opposed to the functionality provided by this patch at whole but I think this option is clearly misplaced in the settings dialog. It is technically not related to the appearance of Exaile.

The setting should be a hidden one for now since none of the current setting categories match and it does not justify a new category.

Revision history for this message
muzuiget (muzuiget) wrote :

Where the option UI place and display doesn't matter, the point is provide the option. At least I can manually edit ~/.config/exaile/settings.ini add the option line "fallback_encodings = L: ['gbk', 'big5']"

So, what is your suggestion? just hide it (set "visible" property to false in ui file), or move to a new file(as a new tab)?

Revision history for this message
Mathias Brodala (mathbr) wrote :

My suggestion is exactly what you have written. Hidden options in Exaile require the user to edit them in his settings.ini manually.

Revision history for this message
muzuiget (muzuiget) wrote :

All right, I re-commit the patch

http://bazaar.launchpad.net/~muzuiget/exaile/exaile/revision/4357

Note that I have to change xlgui/preferences/__init__.py one line "show_all()" to "show()". The PyGTK document says "The show_all() method recursively shows the widget, and any child widgets (if the widget is a container)." So "show_all()" will break the "visible" property in ui file.

Revision history for this message
Mathias Brodala (mathbr) wrote :

Is there a specific reason why you do the encoding conversion upon storing tag data and not upon reading it? This way loaded tag values will seem to be broken, even if we manage to store them in their original encoding.

Also, you should drop the GUI changes altogether because that’s what a hidden setting implies: no GUI for it.

Revision history for this message
muzuiget (muzuiget) wrote :

To first questtion, that part code just base on others' patch, I agree upon reading it is better.

Because I don't want to take too much time to hacking, this pacth maybe still a dirty-workaround patch in project quality standard, but I think it work fine in most cases. So left it as third-party feature patch, let users patch it themself.

I still consider provide a gui widget is better, change setting.ini is not convenient than hardcode, I revert the patch to first version.

http://bazaar.launchpad.net/~muzuiget/exaile/exaile/revision/4357

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.