foreign subtitles broken

Bug #233528 reported by Benjamin Kampmann
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Elisa Media Center
Invalid
Undecided
Unassigned
Moovida
Invalid
High
Unassigned
0.5
Won't Fix
High
Unassigned

Bug Description

See [http://elisa.fluendo.com/forums/?forumaction=showposts&forum=2&thread=281&start=0 forum]:
"" Elisa show my videos fine, but videos with subtitles in other languages (Hebrew) appear as gibberish ""

Example data that fails:

00:00:00:www.napiprojekt.pl - nowa jakoœæ napisów.|Napisy zosta³y specjalnie dopasowane do Twojej wersji filmu.
00:00:12:{C:$aaccff}T³umaczenie: JediAdam|Korekta: Juri24, Omickal
00:00:21:{C:$aaccff}Kinomania SubGroup
00:00:47:POKUTA
00:01:43:/"PRÓBY ARABELLI"|AUTORSTWA BRIONY TALLIS
00:02:13:Skoñczy³am swoj¹ sztukê.
00:02:15:Dobra robota.
00:02:16:Widzia³aœ mamê?
00:02:17:Powinna byæ w pokoju rysowniczym.
00:02:19:Mam nadziejê, ¿e nie bêdziesz dziœ|pl¹taæ siê nam pod nogami, panienko Briony.
00:02:22:Musimy przygotowaæ obiad.
00:02:36:CzeϾ.
00:02:37:S³ysza³em, ¿e wystawiasz sztukê.
00:02:38:Kto ci powiedzia³?

Changed in elisa:
assignee: fboucault → nobody
importance: Undecided → High
milestone: none → 0.5.x
Revision history for this message
Cesar Sanchez (csanchez-fluendo) wrote :

Confirmed... but it it seems to be 99% same bug as 330845

Revision history for this message
Philippe Normand (philn) wrote :

It is not the same as Bug 330845. This one involves GStreamer whereas the other one seems more specific to Elisa

Revision history for this message
Cesar Sanchez (csanchez-fluendo) wrote :

upps... video titles in hebrew do appear as gibberish (in lists), not subttiles while playing. You are right.

Revision history for this message
Philippe Normand (philn) wrote :

Please attach a subtitles file. I can't really reproduce this bug with the given data, in which obviously there are some weird characters

Revision history for this message
Olivier Tilloy (osomon) wrote :

I managed to find an archive of the original forum post: http://web.archive.org/web/20080201105306/http://elisa.fluendo.com/forums/?forumaction=showposts&forum=2&thread=281&start=0, and a copy of the

I am attaching the file that was mentioned in the post.
It indeed displays some crap when used in Moovida, but the file itself seems to contain that crap: it is encoded in latin1 and trying to convert it to cp1250 (Polish) with iconv fails.

Any expert on "exotic" file encodings can shed a light on this situation? Saviq?

Changed in elisa:
milestone: 0.5.x → none
Revision history for this message
Michał Sawicz (saviq) wrote :

That's an encoding issue. Most of polish subtitles are cp1250 (windoze encoding) due to the fact that that's the default windoze encoding for text.

We could give translators the ability to provide (I'd use translations for that) a list of encodings to try in each locale. Then try each in order and fall back to UTF8.

Changed in elisa:
status: Incomplete → Confirmed
Revision history for this message
Olivier Tilloy (osomon) wrote :

Elisa 0.5.x series is not maintained any longer.
The bug is valid in Moovida though.

Revision history for this message
Rafal Zawadzki (bluszcz) wrote :

I made yesterday a research on biggest polish sites with subtitles, and it looks that 100% of polish subtitles use cp1250.

So, please take a look at attached patch - it checks your locale, and in case of pl_PL sets proper encoding for them.

Revision history for this message
Michał Sawicz (saviq) wrote :

I have some comments:
- the patch is somehow broken (garbage at the beginning of each line)
- use the i18n infrastructure to get encoding valid for a locale
- you need a list to try different encoding, in case they're in ISO-8859-2 or UTF-8, not CP1250 (all three valid for polish subtitles)

There's one big problem, though. That will only work if you want to watch with subtitles for your locale, what about if you want to watch with russian subtitles?

I think we should try and use the chardet module (http://chardet.feedparser.org/) - that would solve these problems in most situations.

Revision history for this message
Olivier Tilloy (osomon) wrote :

chardet looks very interesting indeed. And it's packaged in ubuntu (>= hardy).

Revision history for this message
Michał Sawicz (saviq) wrote :

I'll pick this one up.

Changed in elisa:
assignee: nobody → Michał Sawicz (saviq)
Revision history for this message
Rafal Zawadzki (bluszcz) wrote :

Sorry for "colorful garbage". Chardet + subtitle-encoding works perfectly :)

Revision history for this message
Michał Sawicz (saviq) wrote :

Well, it doesn't work good enough :|

It detects cp1250 as iso-8859-2 which is about half compatible :/

It does report 0.75 confidence, but I haven't yet found a way to make it try harder.

Revision history for this message
Michał Sawicz (saviq) wrote :

What's more, chardet is based on Mozilla's character detection, which does the same thing - it thinks cp1250 is latin2 :/

Revision history for this message
Rafal Zawadzki (bluszcz) wrote :

Michał, can you send email with problematic subtitles? In my cases (two) it detects 'windows-1250'.

Revision history for this message
Michał Sawicz (saviq) wrote :

Well, I haven't found any that do detect windows-1250... Try the one originally attached to this bug and I'll upload two more.

Revision history for this message
Michał Sawicz (saviq) wrote :
Revision history for this message
Michał Sawicz (saviq) wrote :
Revision history for this message
Rafal Zawadzki (bluszcz) wrote :

Confirmed. I am curious is this only case where it fails - if yes, it could be fix by simply monkeypatch.

Revision history for this message
Michał Sawicz (saviq) wrote :

Well, the main problem is that there's really no certain way to distinguish cp1250 from iso-8859-2... all characters are valid in both encodings, only they represent different characters.

On the other hand, cp1250 characters converted based on latin2 result in these characters:

šę󹜳żŸćń

while converting latin2 based on cp1250 results in:

±ę󱶳żĽćń

So encountering any of characters "ššœŸ" would mean it's cp1250, while encountering any of "±±¶Ľ" would mean it's latin2. Finding any of these characters should decrease the confidence of such a conversion.

I already contacted the author of chardet, hopefully we get it working.

Revision history for this message
Michał Sawicz (saviq) wrote :

OK, just to clarify, if the input string (valid in both cp1250 and latin2) is:

ęóąśłżźćńĘÓĄŚŁŻŹĆŃ

Then output from 'cross-conversion' of cp1250 data using latin2 mapping would output:

ę󹜳żŸćńĘÓĽŒŁŻĆŃ

And 'cross-conversion' of latin2 using cp1250 mapping would output:

ę󱶳żĽćńĘÓˇ¦ŁŻ¬ĆŃ

So characters decreasing confidence of latin2 would be:

šœŸĽŒ

And those decreasing confidence of cp1250 would be:

±¶Ľˇ¦¬

While in both cases these characters would increase confidence:

ąśźĄŚŹ

Revision history for this message
Olivier Tilloy (osomon) wrote :

chardet looks like the way to go anyway. And if we can fix it along the way, even better!

Revision history for this message
Michał Sawicz (saviq) wrote :

I have a fix ready, but can't push it onto the merges-list, google rejects my e-mails for some reason.

Changed in elisa:
status: Confirmed → In Progress
Revision history for this message
Michał Sawicz (saviq) wrote :

OK, attaching the bundle here as the merges-list does not like me. Someone please push it to BB.

Revision history for this message
Michał Sawicz (saviq) wrote :

Forgot to paste comments for reviewer:

This bundle implements three ways of subtitle encoding detection, in
order:
a) user-defined list of encodings
b) chardet for automatic encoding detection
c) i18n-defined list of encodings

The user can define a list of encodings to try in the config file, first
encoding that will succesfully load the file will be used.

On initial install a) won't be used because the default encoding list is
empty. Automatic detection is done by python-chardet module [1].
Currently chardet is used in try ... except blocks, so it's not a hard
dependency, although it's much encouraged.

If a) is not used or fails and b) fails or is less confident of it's
findings than 0.9 on a [0, 1] scale, c) is tried - a list of encodings
defined by the translator for current locale is used just as in a). If
this fails, the encoding detected in b) is used.

The careful reviewer will see that this bundle does not introduce any
regressions - all subtitles are loaded as usual. The routines
implemented in this bundle can be tested as follows:

* on initial install without user- or i18n- defined encoding and no
python-chardet installed, subtitles will be loaded with default
gstreamer locale. Then config-file support should be tried, both with
correct list of encodings and one that will fail (i.e. ['ascii']). In
both cases the subtitles will load, but will be displayed correctly only
in the first case;
* after updating the translation template (setup.py pot_update) file and
catalog file for your preferred locale (setup.py update_catalog) and
subsequent build of the catalogs (setup.py build_po), the two previous
tests for user-defined encodings should be repeated. In this case the
failing example should be corrected by i18n support as long as the right
encoding was added in the language catalog;
* it's now time to install python-chardet (packaged for most major
distros) and run your tests again. Empty the config encodings list and
remove the compiled catalog files (*.mo) and the encoding should still
be detected properly and the subtitles displayed correctly. There are
mostly issues with differentiating WINDOWS-1250 from ISO-8859-2 (Central
European).

IMPORTANT:
Applying this bundle should be followed by adding python-chardet [1] to
the windows build.

[1] http://chardet.feedparser.org/

Cheers

Revision history for this message
Michał Sawicz (saviq) wrote :

Rafał, could you maybe try my fixes? Anyone else care to try this with subtitles in other languages?

Contact me on irc (<email address hidden>/#elisa) by e-mail or jid (<email address hidden>) if in need of any assistance.

Revision history for this message
Michał Sawicz (saviq) wrote :
Revision history for this message
Michał Sawicz (saviq) wrote :

OK, we got the bundle through, a fix is awaiting review at https://www.moovida.com/quality/review/request/%<email address hidden>%3E

tags: added: patch-available
Revision history for this message
Olivier Tilloy (osomon) wrote :

@Michał: we're using (or at least trying to use) the patch-available tag to mark a bug that has a patch attached to it but for which no Moovida developer has had the time to look into yet. In the case of this bug, the "In Progress" status, the fact that it's explicitly assigned to you and the link to the merge request is enough information.

tags: removed: patch-available
Michał Sawicz (saviq)
Changed in moovida:
assignee: Michał Sawicz (saviq) → nobody
Changed in elisa:
assignee: nobody → Michał Sawicz (saviq)
status: New → In Progress
Michał Sawicz (saviq)
Changed in elisa:
assignee: Michał Sawicz (saviq) → nobody
Revision history for this message
dino99 (9d9) wrote :

The latest free moovida 1.09 does not get any maintenance since a while. Now moovidadb.com is supporting Linux and support can be found at : http://www.fluendo.com/faq/

Changed in moovida:
status: In Progress → Invalid
Changed in elisa:
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.