Comment 5 for bug 1010936

Revision history for this message
Colin Benner (yzhs) wrote :

The bug still exists in the latest master commit on Linux and, presumably, other platforms.

The problem is that rtf2xml (both the version distributed with Calibre and the latest upstream version) does not handle multi-byte encodings correctly. In particular, the problematic single and double quote characters in Michael's file are encoded using code page 932/Shift JIS as two-byte characters: 0x81 0x66 and 0x81 0x68. In the single-byte encoding used by rtf2xml, 0x81 is not assigned, so rtf2xml drops that byte and 0x66 and 0x68 are interpreted as ASCII letters f and h, respectively.

In case anyone else wants to convert a file exhibiting this problem:
As a workaround, I successfully converted the RTF file after opening and saving it in LibreOffice, which results in UTF-8 encoded quote characters, which Calibre handles properly. Another way to get such a UTF-8 encoded RTF document is using "unrtf --rtf original.rtf > converted.rtf". (You might have to use a UTF-8 locale for this.)