Legacy encodings ID3 tags support

Bug #223547 reported by Jiahua Huang
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Exaile
Confirmed
Medium
Unassigned

Bug Description

Now the ID3v1 was supposed to be encoded in utf-8(Latin1), but
many mp3's need to use the legacy charset like gb18030 (big5, euc-jp or euc-kr) in order to see,
especially when obtaining the mp3's from p2p programs.

So it need to guesses and converts ID3 tags from legacy encodings to Unicode.

Revision history for this message
Jiahua Huang (huangjiahua) wrote :
Revision history for this message
ChenXing (cxcxcxcx) wrote :

I am also suffering from this bug, and strongly suggest the patch provided by Jiahua Huang be adopted.

Most mp3 files(about 90%+ without exaggeration) in mainland China still have tags encoded in gb18030. So this is a critical bug affecting all Chinese users. We Chinese are trying hard to find music players supporting legacy encoding in Linux, but failed. This is even the most important reason why my friends say Linux has poor support on Chinese language.

Besides, Huang's patch doesn't affect well encoded UTF-8 tags, but will add support for legacy encodings. The patch also works for Exaile 0.2.99, by adding the code block into xl/metadata/_id3.py before "class ID3Format", like:

...
from mutagen import id3

_unicode=unicode
def unicode(string, encoding='utf8',errors='strict'):
    try:
        string = string.decode('utf8').encode('iso8859-1')
    except:
        return _unicode(string)
    for enc in ('utf8', 'gb2312', 'big5', 'gb18030', 'big5hkscs', 'euc-jp', 'euc_kr', 'cp1251', 'utf16'):
        try:
            return string.decode(enc)
        except:
            pass
    return string

class ID3Format(BaseFormat):
...

Revision history for this message
Chen Tao (pro711) wrote :

Yes, I agree with ChenXing. If the patch could be merged into exaile, I think a lot of Chinese users will choose exaile as their music player. Exaile is an excellent music player in many aspects. If this problem can be fixed, exaile will be the one of the best choice of music players on linux for us Chinese users.

Revision history for this message
reacocard (reacocard) wrote :

I've committed the patch to trunk in r1754, please test and confirm that it resolves the issue.

Changed in exaile:
importance: Undecided → Medium
milestone: none → 0.3.0
status: New → Fix Committed
Revision history for this message
ljpsfree (caifen1985) wrote :

I have used the patch and it works for me. Thanks for ChenXing

Revision history for this message
Jiahua Huang (huangjiahua) wrote :

Huge thanks, it works.

Revision history for this message
ChenXing (cxcxcxcx) wrote :

It works. Thanks very much! This is exciting.

Revision history for this message
jiu (jacques-charroy) wrote :

Following the installation of the bzr version (1760 or so...) a few days ago, and after having built a music database from scratch in it, a lot of collection elements with utf-8 tags showed some bizarre/chinese/cyrillic/hexa characters instead of accents or umlauts.
Also see a fuller description of the problem here: http://www.exaile.org/forum/viewtopic.php?f=4&t=485
Following the suggestion of sjohannes, I checked out the 1753 version of exaile from bzr, deleted then rebuilt the music collection in it. This solved the problem. That means there is something wrong with either commit 1754 or some other one between 1754 and the version I had downloaded earlier (can't remember which one, but around 1760).

Revision history for this message
reacocard (reacocard) wrote :

I've reverted the patch as we have a confirmed regression, we'll need to find a better way to solve this bug.

Changed in exaile:
status: Fix Committed → Confirmed
Revision history for this message
Jiahua Huang (huangjiahua) wrote :

I'm sorry for that,
please try this patch, with iso8859-1 support (accents or umlauts.).

- for enc in ('utf8', 'gb2312', 'big5', 'gb18030', 'big5hkscs', 'euc-jp',
+ for enc in ('iso8859-1', 'utf8', 'gb2312', 'big5', 'gb18030', 'big5hkscs', 'euc-jp',

Revision history for this message
Jiahua Huang (huangjiahua) wrote :

No, it not works.
please ignore the latest patch.

We'll need to find a better way to solve it.

Revision history for this message
ChenXing (cxcxcxcx) wrote :

Is the problem caused by "string = string.decode('utf8').encode('iso8859-1')"? Maybe removing "string =" will solve the problem. Could anybody have a try?

from mutagen import id3

_unicode=unicode
def unicode(string, encoding='utf8',errors='strict'):
    try:
# string = string.decode('utf8').encode('iso8859-1')
        string.decode('utf8').encode('iso8859-1')
    except:
        return _unicode(string)
    for enc in ('utf8', 'gb2312', 'big5', 'gb18030', 'big5hkscs', 'euc-jp', 'euc_kr', 'cp1251', 'utf16'):
        try:
            return string.decode(enc)
        except:
            pass
    return string

class ID3Format(BaseFormat):

Revision history for this message
Jiahua Huang (huangjiahua) wrote :

thanks, it works.

I use this now:

import locale
if str(locale.getdefaultlocale()[0]).startswith('zh'):
    _unicode=unicode
    def unicode(string, encoding='utf8',errors='strict'):
        try:
            string.decode('utf8').encode('iso8859-1')
        except:
            return _unicode(string)
        string = string.decode('utf8').encode('iso8859-1')
        for enc in ('utf8', 'gb2312', 'big5', 'gb18030', 'big5hkscs', 'euc-jp', 'euc_kr', 'cp1251', 'utf16'):
            try:
                return string.decode(enc)
            except:
                pass
        return string

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.