RTF charset ansicpg0 handling

Bug #1163572 reported by Ladislav Lenčucha
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
sengian

Bug Description

Some RTF editors do not specify a valid charset. They use charset 0:
{\rtf1\ansi\ansicpg1250\uc1\deff0\deflang0\deflangfe0

This is correctly handled by plenty of RTF editors, e.g. MS Word, Wordpad, Total Command Preview etc.
I suspect the charset 0 is handled as "use the system default" or "use the charset corresponding to language" (deflang).

It would be nice either to:
- be able to override the codepage in case of RTF with ansicpg0 header (the override of the charset is not possible in case of RTF document, the "input character encoding" value is ignored)
- use the system default charset instead of ansi (that in my case replaces accent characters with non-accent versions)

Word adds the charset (ansicpg with correct charset) when the document is re-saved, Wordpad removes it completely. Nevertheless, it would be nice to be able to solve this within Calibre conversion.

Thanks!

Calibre version: Windows, 32-bit, 0.9.25
Sorry, I can't attach the document here and I am not able to create a one-page sample with the problem described. Maybe email will do?

Tags: rtf-input
Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1163572

Changing the component for this bug.

 assignee sengian
 tag rtf-input
 status triaged

Revision history for this message
Kovid Goyal (kovid) wrote :

I suggest you simply save your RTF as HTML using Word or whatever and
convert that, it will give you much better results.

Changed in calibre:
assignee: nobody → sengian (sengian)
status: New → Triaged
Revision history for this message
Ladislav Lenčucha (lacike) wrote :

Hmm,

I accidentally mentioned the correct header:
{\rtf1\ansi\ansicpg1250\uc1\deff0\deflang0\deflangfe0

The header I meant looks like:
{\rtf1\ansi\ansicpg0\uc1\deff0\deflang0\deflangfe0

in this case the charset presense flag should be cleared and input source be used. If I wanted to convert to HTLM I wouldn't have to use Calibre at all.

Revision history for this message
Ladislav Lenčucha (lacike) wrote :

Ok,

so I checked the Calibre sources and there are the following lines that bother me:
1. calibre\ebooks\rtf2xml\default_encoding.py:
line 134, if not int(cp) - this returns false if cp is not 0
2. calibre\ebooks\rtf2xml\get_char_map.py:
line 37, if map == 'ansicpg0' - this processes the input in case of ansicpg0 as if it was cp1250
3. calibre\ebooks\metadata\rtf.py
line 63, if num == '0' - this processes the input in case of ansicpg0 as if it was cp1250

(1) doesn't make sense, this would mean that everything except 0 is ignored as codepage and I am not speaking about what would happen if ansicpg is not found. Despite this, it seems the the codepage is resolved correctly (what the h...)
(2), (3) these conditions do exactly what I need, but it does not seem to be working (cp1252 is used)

Did I check wrong directory or is there something I don't get?

Revision history for this message
Kovid Goyal (kovid) wrote :

I dont maintain RTF Input, so you will either have to figure it out
yourself, or wait for sengian to chime in (which might be a long wait,
since he hasn't been active of late).

Revision history for this message
sengian (sengian) wrote :

(1) doesn't make sense, this would mean that everything except 0 is ignored as codepage and I am not speaking about what would happen if ansicpg is not found. Despite this, it seems the the codepage is resolved correctly (what the h...)

Wrong, it just means that the default defined in init is cp1252, if \ansicpg is different from 0 then the encoding is overwritten.
If this is 0, then this is not valid RTF as 0 is not a correct value as per RTF specification 1.9.1, so this value is ignored and the default encoding is taken.
What might be something which can be discussed is that there is no check in this case and under windows of the (\ansi | \mac | \pc | \pca) charset, but this is usually meaningless, the true charset being defined by \ansicpg
(2), (3) these conditions do exactly what I need, but it does not seem to be working (cp1252 is used)

So this is working as it should.
What could be done is to put an option in rtfinput options allowing the user to choose a default encoding for the case where no encoding is defined, possibly with an overwrite function.

Revision history for this message
Ladislav Lenčucha (lacike) wrote :

Thanks sengian.

To (1). Let's take the following examples:
if not int(0): self.__code_page= cp
if not int(1250): self.__code_page = cp
When would the default cp1252 be overwritten? "not int(0)" evaluates to True, "not int(1250)" evaluates to False.

To (2), (3) - The default values in these 2 files are cp1250, not cp1252. When are these default values used then?

To your suggestion - well, yes, that would be very nice, but let us discuss (2) and (3) first.

Revision history for this message
sengian (sengian) wrote :

To (1). Let's take the following examples:
if not int(0): self.__code_page= cp
if not int(1250): self.__code_page = cp
When would the default cp1252 be overwritten? "not int(0)" evaluates to True, "not int(1250)" evaluates to False.

To (2), (3) - The default values in these 2 files are cp1250, not cp1252. When are these default values used then?

(1) This is also incorrect the 'not' should be removed but as this is only checked for decision and not in the actual processing that's why it is not such a problem. It will depend on when this function is first called and I think we are lucky.
(2), (3) My mistake, I didn't read correctly. You are right it should be 'cp1252'
In fact, in (3) it shouldn't even be needed if not for badly created RTF files as normally RTF should be an ANSI file.

I linked a first patch.

Revision history for this message
Kovid Goyal (kovid) wrote :

@sengian: Do you want your patch merged?

Revision history for this message
sengian (sengian) wrote : Re: [Bug 1163572] Re: RTF charset ansicpg0 handling

@Kovid: I will wait for an answer concerning the option. This bug is not
major anyway.
Le 5 avr. 2013 05:00, "Kovid Goyal" <email address hidden> a écrit :

> @sengian: Do you want your patch merged?
>
> --
> You received this bug notification because you are a member of calibre
> Bug Wranglers, which is subscribed to calibre.
> https://bugs.launchpad.net/bugs/1163572
>
> Title:
> RTF charset ansicpg0 handling
>
> Status in calibre: e-book management:
> Triaged
>
> Bug description:
> Some RTF editors do not specify a valid charset. They use charset 0:
> {\rtf1\ansi\ansicpg1250\uc1\deff0\deflang0\deflangfe0
>
> This is correctly handled by plenty of RTF editors, e.g. MS Word,
> Wordpad, Total Command Preview etc.
> I suspect the charset 0 is handled as "use the system default" or "use
> the charset corresponding to language" (deflang).
>
> It would be nice either to:
> - be able to override the codepage in case of RTF with ansicpg0 header
> (the override of the charset is not possible in case of RTF document, the
> "input character encoding" value is ignored)
> - use the system default charset instead of ansi (that in my case
> replaces accent characters with non-accent versions)
>
> Word adds the charset (ansicpg with correct charset) when the document
> is re-saved, Wordpad removes it completely. Nevertheless, it would be
> nice to be able to solve this within Calibre conversion.
>
> Thanks!
>
> Calibre version: Windows, 32-bit, 0.9.25
> Sorry, I can't attach the document here and I am not able to create a
> one-page sample with the problem described. Maybe email will do?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/calibre/+bug/1163572/+subscriptions
>

Revision history for this message
Ladislav Lenčucha (lacike) wrote :

@sengian: Who are you waiting for? If that's me I am all for it, let's add the option to override the codepage if ansicpg is not present. Or is there a discussion with somebody if it should be added?

Revision history for this message
Kovid Goyal (kovid) wrote :

I dont see the need for a separate rtf input specific option, just use the existing input encoding option to override the detected encoding, when it is specified.

Revision history for this message
sengian (sengian) wrote :

OK I will do this. Is there an example somewhere?
Le 5 avr. 2013 08:30, "Kovid Goyal" <email address hidden> a écrit :

> I dont see the need for a separate rtf input specific option, just use
> the existing input encoding option to override the detected encoding,
> when it is specified.
>
> --
> You received this bug notification because you are a member of calibre
> Bug Wranglers, which is subscribed to calibre.
> https://bugs.launchpad.net/bugs/1163572
>
> Title:
> RTF charset ansicpg0 handling
>
> Status in calibre: e-book management:
> Triaged
>
> Bug description:
> Some RTF editors do not specify a valid charset. They use charset 0:
> {\rtf1\ansi\ansicpg1250\uc1\deff0\deflang0\deflangfe0
>
> This is correctly handled by plenty of RTF editors, e.g. MS Word,
> Wordpad, Total Command Preview etc.
> I suspect the charset 0 is handled as "use the system default" or "use
> the charset corresponding to language" (deflang).
>
> It would be nice either to:
> - be able to override the codepage in case of RTF with ansicpg0 header
> (the override of the charset is not possible in case of RTF document, the
> "input character encoding" value is ignored)
> - use the system default charset instead of ansi (that in my case
> replaces accent characters with non-accent versions)
>
> Word adds the charset (ansicpg with correct charset) when the document
> is re-saved, Wordpad removes it completely. Nevertheless, it would be
> nice to be able to solve this within Calibre conversion.
>
> Thanks!
>
> Calibre version: Windows, 32-bit, 0.9.25
> Sorry, I can't attach the document here and I am not able to create a
> one-page sample with the problem described. Maybe email will do?
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/calibre/+bug/1163572/+subscriptions
>

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1163572

You just do something like this, in the convert() method of the RTF
Input plugin.

default_encoding = opts.input_encoding or 'cp1252'
... do the conversion to rtf ...
opts.input_encoding = 'utf-8'

The last line is needed because the output from the rtf plugin is in
utf-8, so the rest of the conversion pipeline should use utf-8 to decode
the html from the rtf input plugin.

Revision history for this message
sengian (sengian) wrote :

Fixed in my Git branch

Changed in calibre:
status: Triaged → Fix Committed
Revision history for this message
Kovid Goyal (kovid) wrote : Fixed in master

Fixed in branch master. The fix will be in the next release. calibre is usually released every Friday.

 status fixreleased

Changed in calibre:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.