Comment 5 for bug 507132

Revision history for this message
Guillaume Millet (guimillet) wrote :

The problem appears with UserTextFrame and, when the option --strict is on, it appears also with LyricsFrame and CommentFrame. I had a hard time to find the reason. Here are the explanations taking for example CommentFrame.
Actually, the error does not raise from encode() but decode() which seems to be called by sys.stdout.write (called by printMsg, function which I don't see the usefulness compared to print) in eyeD3, line 995:
    printMsg("%s: [Description: %s] [Lang: %s]\n%s" %\
                     (boldText("Comment"), cDesc, cLang,
                      cText.encode(ENCODING,"replace")));
with printMsg(s) = sys.stdout.write(s + '\n').

The problem is linked to cDesc. The strings cDesc and cText are set as Unicode strings in frames.py, line 1076:
    self.description = unicode(d, id3EncodingToString(self.encoding));
    self.comment = unicode(c, id3EncodingToString(self.encoding));
but then,
    if not strictID3():
        self.description = cleanNulls(self.description)
        self.comment = cleanNulls(self.comment)
with cleanNulls(s) = "/".join([x for x in s.split('\x00') if x]), which does not return a Unicode string. Therefore, with the option --strict, at the printing, cDesc is a Unicode string but cText.encode(ENCODING,"replace") is a byte string. A sample command showing the error is
    >>> print "%s %s" %(u'', (u'é').encode("utf-8","replace"))
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
whereas
    >>> print "%s %s" %('', (u'é').encode("utf-8","replace"))
     é
    >>> (u'é').encode("utf-8","replace") # returns a byte string
    '\xc3\xa9'

In Python 2.x (maybe different in 3.x with the new str type), if there is at least one Unicode string, the print formatting apparently tries to convert all the byte strings, if any, to Unicode with decode() which by default uses 'ascii' encoding, hence the UnicodeDecodeError.

I see two (explainable ;) ) ways out of the bug, either by modifying cleanNulls(s) to return a Unicode string (maybe contrary to the purpose of cleanNulls(s), I don't know), or by encoding cDesc at the printing with cDesc.encode(ENCODING,"replace"), which the attached patch accomplishes.

For UserTextFrame, the bug always appears because description is not processed through cleanNulls() whatever --strict, which seems to be another default compared to the behavior chosen for LyricsFrame and CommentFrame.