media scanner does not handle files with incorrectly encoded tags (mojibake)

Bug #1384857 reported by James Henstridge
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mediascanner2 (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

I don't really have enough information to know how prevalent this problem is, since it seems to be highly dependent on region. I was shown a Chinese user's phone where half the songs came up with garbage metadata. It seems that the problem is that the metadata in these files is tagged as ISO-8859-1, but is actually in the locale's legacy encoding (GBK in the case of these Chinese tracks).

It is not clear whether we can easily fix this in media scanner though, since GStreamer is providing tag data to us normalised to UTF-8. To unmangle the text, I needed to convert this UTF-8 to ISO-8859-1, and then convert that back to UTF-8 as if it was GBK.

GStreamer already includes some code to attempt to decode text according to the locale's encoding, but since we are using UTF-8 locales this doesn't do anything:

http://cgit.freedesktop.org/gstreamer/gst-plugins-base/tree/gst-libs/gst/tag/id3v2frames.c#n968

There is also an open upstream bug about guessing at a legacy encoding based on the the locale, but it hasn't seen any activity in a year:

https://bugzilla.gnome.org/show_bug.cgi?id=688367

summary: media scanner does not handle files with incorrectly encoded tags
+ (mojibake)
affects: mediascanner2 → mediascanner2 (Ubuntu)
Revision history for this message
Michi Henning (michihenning) wrote :

I don't believe that this is a legitimate bug. The ID3 spec requires the encoding to be one of following:

$00 – ISO-8859-1 (LATIN-1, Identical to ASCII for values smaller than 0x80).
$01 – UCS-2 (UTF-16 encoded Unicode with BOM), in ID3v2.2 and ID3v2.3.
$02 – UTF-16BE encoded Unicode without BOM, in ID3v2.4.
$03 – UTF-8 encoded Unicode, in ID3v2.4.

It's illegal to write GBK into ID3 tags, and I don't think we should make any attempt to perpetuate this error.

Changed in mediascanner2 (Ubuntu):
status: New → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.