A program for proofreading po and podiff files.

Can not use file which is not UTF-8 encoded

Reported by Byrial Jensen on 2011-12-23
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
PoProofRead
High
TLE

Bug Description

I tried to open a file which uses latin1 (ISO-8859-1) encoding. It gave this warning message to the console:

/usr/lib/pymodules/python2.7/poproofread/poproofread_gtk.py:261: GtkWarning: gtk_text_buffer_emit_insert: assertion `g_utf8_validate (text, len, NULL)' failed
  textbuffer.insert(startiter, text)

I can go forward and back in the file with PageUp and PageDown keys, but diff and comment windows are all empty, except for last one:

=============================================================================
 Number of messages: 3
 =============================================================================

which happens to be the only chunk with only ASCII chars.

Ask Hjorth Larsen (askhl) wrote :

Maybe the best way to fix this is to always use Python unicode objects internally in poproofread, since it has to be handled correctly by gtk. The decode() method in gtparse can be used to do this easily (if it works right now; otherwise we should fix it first).

TLE (k-nielsen81) on 2011-12-23
Changed in poproofread:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → TLE (k-nielsen81)
milestone: none → 0.1.8
TLE (k-nielsen81) wrote :

Hallo Byrial

Thanks for reporting this. Character encoding was one of those things that I had deliberately not done yet because it is tricky and not very funny :| But it definitely needs to be done so now is a good a time as any.

@Ask. I agree that the best way to handle this is to go all Python unicode internally. So we'll decode at parse time en possibly encode back at export time. Regarding how to determine the character encoding I'll give that a little more thought. I would like it to remain independent of pyg3t for essential functions, so my initial thoughts is to:
1: Look for the magic character encoding words from the po-files in the first chunk.
2: If that fails I think I have read that there is a character encoding guessing lib that might be used as a fall back

In both cases do the read and re-read with correct encoding trick from the parser.

But actually. Since we have just determined that podiff's will always contain a header and that the program is designed to work on podiffs and po-files 1 really should cover it.

Regards Kenneth

TLE (k-nielsen81) wrote :

Byrial, can you provide a test case file for this (preferably a podiff). It has been some time since I have encountered a file in an encoding different from UTF-8.

Byrial Jensen (byrial-t) wrote :

I attach a diff file produced by podiff with ISO8859-1 encoding.

TLE (k-nielsen81) wrote :

Thanks.

TLE (k-nielsen81) wrote :

Note to self. Missing:
Handle char set warnings on save in poproofread_gtk and uncomment return statements in __detect_character_encoding

Changed in poproofread:
status: Confirmed → In Progress
TLE (k-nielsen81) wrote :

Note to self. Missing:
Trim down the codec list with invalid codecs, add comment about trying to save in the dialog and test.

TLE (k-nielsen81) on 2012-04-28
Changed in poproofread:
status: In Progress → Fix Committed
TLE (k-nielsen81) wrote :

Fixed with revision 92

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers