Localised Ubuntu start pages (8.10) have corrupted UTF-8 text

Bug #290494 reported by David Planella on 2008-10-28
96
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Ubuntu Website - OBSOLETE
High
Matthew Nuzum
ubuntu-docs (Ubuntu)
Undecided
Unassigned

Bug Description

The generator scripts for the Ubuntu 8.10 start page corrupt the text, if that text uses non-latin characters.
To see the page for your language, visit
http://start.ubuntu.com/8.10/index.html.LL
where LL is the language code (for example, 'el', 'ru' and so on).

Although the PO files that the translators produced have the UTF-8 encoding, the scripts that create the HTML pages mistakenly assume that the source encoding is not UTF-8 (but rather iso-8859-1).
This corrupts the text of the pages.

The solution is to find where in the scripts the text gets corrupted. As soon as the problem is fixed, the Start pages will appear properly when you visit the page again.

Old description -----------

Binary package hint: ubuntu-docs

The submitted Catalan translation of the browser start page correctly spelled the title of the start page in UTF-8 format [1]:

"Pàgina inicial de l'Ubuntu"

However, the released start page did some kind of conversion to the "à" character, which has been converted to some unreadable character and thus is not being displayed correctly (see attached screenshot).

Note: I am reporting this against ubuntu-docs because the browser start page translation used to be here. With the last-minute changes to the browser start page I do not know where it resides anymore. Please reassign if necessary.

[1] https://lists.ubuntu.com/archives/ubuntu-translators/2008-October/001837.html

David Planella (dpm) wrote :
David Planella (dpm) wrote :

Please find attached the original translation as submitted to the ubuntu-translators list.

Artem Popov (artfwo) wrote :

Confirming, the title (and contents) is broken for Russian page as well

Changed in ubuntu-docs:
status: New → Confirmed
Felipe Gil Castiñeira (xil) wrote :

The problem is in the "localize.sh" script. It seems that po2html does not manage correctly the accents in the po file. A work-around is the usage of HTML codes for special characters [1] in the .po file instead of the utf-8 encoded characters (e.g. "Páxina" instead of "Páxina").

[1] http://webdesign.about.com/library/bl_htmlcodes.htm

David Planella (dpm) wrote :

In our case the script generated a correct page, though. It is only the published page which does not seem to be encoded in UTF-8

Artem Popov (artfwo) wrote :

Maybe HTML-Tidy produces such an output? I have tried to run it locally and looks like it does not detect utf-8 automatically and converts international characters into unreadable stuff...

Matthew East (mdke) wrote :

This page an online page and is not part of the ubuntu-docs package.

Changed in ubuntu-docs:
status: Confirmed → Invalid
Changed in ubuntu-website:
assignee: nobody → newz
importance: Undecided → High
Dávid Gábor Bodor (drag0nfi) wrote :

Affecting the hungarian startpage too, but I guess you already know.

Dávid Gábor Bodor (drag0nfi) wrote :
Fumihito YOSHIDA (hito) wrote :

Affecting the Japanese startpage too.

Fumihito YOSHIDA (hito) wrote :
David Henningsson (diwic) wrote :

Confirmed for the Swedish start page. It looks like something that is already UTF-8-encoded, is being transformed from something else to UTF-8-encoding once more, as every character > 127 takes up four bytes (checked with hex editor).

Matthew Nuzum (newz) wrote :

Working on a solution. Seems to be po2html causing the problem.

Changed in ubuntu-website:
status: New → Confirmed
vista killer (vistakiller) wrote :

Affecting the Greek startpage too

description: updated

Also the Spanish home page.

Gabor Kelemen (kelemeng) wrote :

Possible solution/workaround/whatever: https://lists.ubuntu.com/archives/ubuntu-translators/2008-October/001886.html
Could somebody confirm if this is a viable way? No reply on the list yet :(. Is it just me or that problem is really _that_ difficult?

I've just send a possible fix to the mailing list ubuntu-translators. HTH

Here's the patch to apply to po2html.py

--- translate-toolkit-1.1.1/translate/convert/po2html.py.old 2008-11-05 17:18:17.000000000 +0100
+++ translate-toolkit-1.1.1/translate/convert/po2html.py 2008-11-05 17:18:50.000000000 +0100
@@ -81,7 +81,7 @@
                 htmlresult = htmlresult.replace(msgid, msgstr, 1)
         htmlresult = htmlresult.encode('utf-8')
         if tidy:
- htmlresult = str(tidy.parseString(htmlresult))
+ htmlresult = str(tidy.parseString(htmlresult, **{'char_encoding': "utf8"}))
         return htmlresult

 def converthtml(inputfile, outputfile, templatefile, wrap=None, includefuzzy=False):

Felipe Gil Castiñeira (xil) wrote :

I can confirm that this patch works correctly (at least for the languages I speak).

Matthew Nuzum (newz) wrote :

A fix has been implemented but a more long-term solution is needed. I will be working on this in the context of the ubuntu-website team, anyone interested in contributed to the solution is welcome and encouraged to join.

Changed in ubuntu-website:
status: Confirmed → Fix Released
Gabor Kelemen (kelemeng) wrote :

Strange, lots of languages (ja, ka, cs, bn) look fine now, but Hungarian and Russian not, see: http://start.ubuntu.com/8.10/index.html.ru, http://start.ubuntu.com/8.10/index.html.hu.

Daniel Nylander (yeager) wrote :

Confirmed for Swedish.
Still waiting for the Search string fix (which is "Sök" in Swedish)

Gabor Kelemen (kelemeng) wrote :

Forget my previous comment, they are fine. Perhaps it was just my browser cache :(.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers