utf8 initial support

Bug #262660 reported by Alex Samorukov
2
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
Fix Released
Undecided
Unassigned

Bug Description

this patch adds UTF8 support to the engine. Currently, i enabled UTF8 support only for txt and html formats, because rtf format require some additional work (and also it store codepage information inside, so i don`t think that we really need this).
Description of the patch:
1) it defines ROUT_CODE_UTF8, and PUMA_CODE_UTF8
2) It enables UTF8 support for the HTML and TXT output formats
3) It adds int getUTF8Char() function, which is currenly depends on iconv(). This is only place with iconv() call. It should not be a problem to rewrite it based without recoding, because only used source tables are:
windows-1250
windows-1251
windows-1254
windows-1257
4) It modifies OneChar function to do recoding in case of ROUT_CODE_UTF8
5) It defines PUMA_CODE_UTF8 inside cli

I did testing with different languages and files and i see no regression from this patch. Also it 100% backward compatible, because API is unchanged.

Revision history for this message
Alex Samorukov (samm-os2) wrote :
Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :

Good work. It makes life easy.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Looking good. Depending on how busy I am, this might only be polished and integrated after release 0.4. But then it will have a very high priority.

Changed in cuneiform-linux:
status: New → Confirmed
Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

I created conversion tables from these codepages to UTF-8. It is attached.

Basically when you index

win1250_to_utf8[8_bit_char_code]

you get a null-terminated string with the corresponding UTF-8 character. Those characters that can not be expressed in UTF-8 are given the Unicode replacement character symbol �.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Here is an updated patch which does not use Iconv. Please check if it works for you.

A few comments on the things I fixed.

Global variables should not be used in newly written, self-contained code. In this case it means getting rid of the GetCodePage function call and passing code page as a parameter.

Removed malloc/free. Because it is way too easy to forget the free. :)

Don't pass the character to be converted as a pointer. Only pass pointers if you intend to change the value pointed to or are dealing with structs or classes.

Revision history for this message
Alex Samorukov (samm-os2) wrote :

Sorry for a long delay, I was ill last days, i will try to test everything now.

Revision history for this message
Alex Samorukov (samm-os2) wrote :

Thanks for the adopting initial version of the patch to the project. I tested it. There is only one minor error which broke everything, but it is VERY easy to fix. Please, replace all references to "char" in getUTF8Str to "unsigned char". Without this it try to search in "minus" members of table, and, of course, produce something strange on output. After this change everything seems to work.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Here is an updated version of the patch. It has your fix suggestion and it also adds the proper character set header to the HTML file. It assumes that HTML output is always UTF-8.

Revision history for this message
Alex Samorukov (samm-os2) wrote :

yes, this path work fine. I think that it`s bad idea to always assume that html is in unicode. I did corrected version of html.cpp which also address compability with html w3 standard.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

This is not as simple as it seems. If the parameter is "unsigned char" then russian works but german special characters such as ä do not work. If it is "char" then german works, but russian does not.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

It seems that GetCodepage returns 1251 (a russian encoding) because of this line:

return cp_ansi[gLanguage];

And according to cp_ansi table, german is encoded with windows-1251 even though in reality it seems to use windows-1250 (or maybe even windows-1252?).

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Attached is a patch that seems to work for me on russian and german. However I'm a bit reluctant to apply it, since GetCodepage is called from several locations and I haven't gone through them all (maybe it breaks RTF or something similar). Also, I'd really like to know what the true internal codepages are rather than just guessing wildly.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

A different table with effectively the same data can be found in fon/src/dist_bou.c. It has pretty much the same codepages that I had in my patch. Dist_bou calls the western codepage CSTR_ANSI_CHARSET. I'm guessing that this means Windows-1252 (which is very similar to Windows-1250).

Revision history for this message
Alex Samorukov (samm-os2) wrote :

i will deep in this bug today. My tests with iconv() implementation was ok with all supported languages, including German, of course.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Since there has been no issues, I merged this.

Changed in cuneiform-linux:
status: Confirmed → Fix Committed
Changed in cuneiform-linux:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.