Unrtf does not handle UTF-8 correctly. The version is rather old

Bug #290503 reported by Ganton
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Gentoo Linux
Fix Released
Wishlist
unrtf (Debian)
Fix Released
Unknown
unrtf (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: unrtf

Unrtf , at least with the parameter "--text", does not handle accented characters correctly (it's not adapted to UTF-8).

In http://bugs.gentoo.org/show_bug.cgi?id=96376 it's said that they had the same problems and that it is solved updating to a modern Unrtf version (the one that we have in Ubuntu 8.10 is rather old).

Revision history for this message
In , World-root (world-root) wrote :

I've had a few RTF documents to text, and I noticed that unrtf outputs an exclamation mark instead of accents.

Here's a patch that makes it produce valid UTF-8 text for any ANSI RTF input file. Please test :-)

Revision history for this message
In , World-root (world-root) wrote :

Created attachment 61385
Patch to output ANSI RTF characters correctly

Revision history for this message
In , World-root (world-root) wrote :

Created attachment 61386
Patch for the ebuild

Revision history for this message
In , Tove (tove) wrote :

Robin, do you want to take this bug?

Joël, did you sent the patch to the upstream developers?

Revision history for this message
In , Tove (tove) wrote :

Robin, do you want to take this bug?

Jo

Revision history for this message
In , World-root (world-root) wrote :

No, not yet. Should I send it ?

(I suppose unrtf was written before a common encoding, UTF-8 was created. So now
that many people use UTF-8, I guess it's nice to put the extended characters to
good use)

Revision history for this message
In , Tove (tove) wrote :

Let's wait for robbat2's comment. He's travelling for the next 2 weeks.

Revision history for this message
In , Robin H. Johnson (robbat2) wrote :

please send this to upstream.
if they are unresponsive, then i'll just patch our ebuild, but i'd prefer it if
they took it first.

Revision history for this message
In , World-root (world-root) wrote :

Robin,

Thanks for your response ! I'm trying to do it.

Two remarks though:
- I've just found a newer version: http://ftp.gnu.org/gnu/unrtf/0.19.7/
- <email address hidden> does not work
- there is a patch (text_french.patch) in the 0.19.7 package, which is similar
to mine, but only handles a few accents. I'll try to contact its author.

I'll let you know when I get something !

Revision history for this message
In , Gentoo-bugger (gentoo-bugger) wrote :

Any news on this? I'm just trying the 3rd party kat ebuilds and they contain an ebuild with this patch. Would be cool if I needed one ebuild less in my overlay :)

Revision history for this message
In , Gentoo-bugger (gentoo-bugger) wrote :

I just saw that there's a new version 0.19.9 from last week, from the changelog:
| 0.19.4: added unicode support
| 0.19.5: removed defective PS support and non-free text files
| more unicode support
| improved symbol font support - no longer puts entities in latex output
| Bug#266020 concerning double slashes fixed
| Bug#269054 concerning Doctype fixed
| Bug#287038 security breach fixed
| (thanks to Joey Hess <email address hidden>)
| 0.19.6: fix some latex problems
| 0.19.7: updated FSF address
| 0.19.8: minor fixes
| 0.19.9: included verbose mode

So it might be fixed in that version...

Revision history for this message
In , World-root (world-root) wrote :

Hi,

Actually (before I made the patch) the authors did put an _unused_ "text_french.patch" file in unrtf 0.19.7 -- but their patch is incomplete (see comment #7).

I sent an email containing the information, as well as a link to this bugzilla page, to the upstream developers on 3rd July 2005:
TO: <email address hidden>, <email address hidden>
CC: <email address hidden>

I got no response so far.

I haven't looked (or tried) unrtf 0.19.9 -- could you have a quick look at the test.c file, to see what characters they added in the tables ?

Best Regards

Revision history for this message
In , Gentoo-bugger (gentoo-bugger) wrote :
Download full text (3.2 KiB)

unrtf has a project page at savannah, here [1]. There's both a bug and a patch tracker, maybe you've got more luck there.

[1] http://savannah.gnu.org/projects/unrtf/

It seems like they added a few but not all characters, and different to your solution:
mss@otherland ~/tmp $ diff -u unrtf-0.19.3/text.c unrtf_0.19.9/text.c
--- unrtf-0.19.3/text.c 2004-02-19 00:35:04.000000000 +0100
+++ unrtf_0.19.9/text.c 2006-01-06 22:56:06.000000000 +0100
@@ -1,7 +1,6 @@
-
 /*=============================================================================
    GNU UnRTF, a command-line program to convert RTF documents to other formats.
- Copyright (C) 2000,2001 Zachary Thayer Smith
+ Copyright (C) 2000,2001,2004 by Zachary Smith

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
@@ -15,20 +14,25 @@

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
- Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

- The author is reachable by electronic mail at <email address hidden>.
+ The maintainer is reachable by electronic mail at <email address hidden>
 =============================================================================*/

 /*----------------------------------------------------------------------
  * Module name: text
- * Author name: Zach Smith
+ * Author name: Zachary Smith
  * Create date: 19 Sep 01
  * Purpose: Plain text output module
  *----------------------------------------------------------------------
  * Changes:
  * 22 Sep 01, <email address hidden>: added function-level comment blocks
+ * 29 Mar 05, <email address hidden>: changes requested by ZT Smith
+ * 14 Jun 05, <email address hidden>: higher Iso-Latin-1 characters
+ * added - thanks to <email address hidden> and
+ * <email address hidden>
+ * 23 Jul 05, <email address hidden>: added endash, emdash and bullet
  *--------------------------------------------------------------------*/

@@ -59,22 +63,24 @@

 static char*
 upper_translation_table [128] = {
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
- "?", "?", "?", "?", "?", "?", "?", "?",
+/* 0 1 2 3 4 5 6 7 */
+/* 80 */ "?", "?", "?", "?", "?", "?", "?", "?",
+/* 88 */ "?", "?", "?", "?", "?", "?", "?", "?",
+...

Read more...

Revision history for this message
In , World-root (world-root) wrote :

Ah, this new patch looks good :-)

It handles everything, excluding values 0x80..0x9F. It can be because that range of values is forbidden/reserved and cannot not be found in ANSI RTF anyway (I have no idea what's the deal with these 0x80..0x9F values).

My only concern: filling the array in a C file with characters (instead of hex value) could be a bit dangerous, depending on the compiler's character set support (?)

Revision history for this message
In , Robin H. Johnson (robbat2) wrote :

I've just commit 0.19.9 to the tree, is the patch from this bug still needed?

Revision history for this message
In , World-root (world-root) wrote :

I've just tried the 0.19.9 version.

Indeed, the patch I posted is not needed anymore, *but* please note that unrtf will always output ISO-8859-1 text, regardless of the user's $LANG setting. Not very good for pure UTF-8 users IMHO.

Ideal workaround: unrtf should iconv() the whole text at runtime, so the input obeys the user's preferred encoding.

In the meantime, I suggest adding this as a first line in src_compile():

src_compile() {
    iconv -f ISO-8859-15 text.c >text.c.new && mv text.c.new text.c

This would detect the user's encoding at emerge time, which is better than ignoring it completely. With this line added, unrtf outputs proper UTF-8 text for me.

Since iconv is called without '-t' (target encoding) argument, it *should* convert to the user's preferred encoding. It works for UTF-8 -- can someone please test with an ISO-8859 $LANG/$LC_ALL ? I have userlocales and only UTF-8 locales built.

Thanks

Revision history for this message
In , Robin H. Johnson (robbat2) wrote :

I don't agree with using iconv like that.
My root user runs in a different $LANG than my regular user.
unrtf really must be made encoding-aware.

I'm going to close this for now, and I'd ask you take it to upstream again. If you diff the old release with the new one, you'll see there is a new maintainer, and hopefully he can be more responsive.

Revision history for this message
In , World-root (world-root) wrote :

He's from Australia, right ?

Ok, e-mail is sent (including of course, a link to this page) :-)

When something happens I'll report it here.

Revision history for this message
AsstZD (eskaer-spamsink) wrote :

It's been two damned years and included version still doesn't support Unicode! Believe me or not, but some people actually need working software...

Changed in unrtf (Debian):
status: Unknown → New
Revision history for this message
neuromancer (neuromancer) wrote :

Gentoo has fixed this bug. See gentoo bug tracker: http://bugs.gentoo.org/show_bug.cgi?id=96376
There is also a patch.

Changed in gentoo:
importance: Unknown → Wishlist
status: Unknown → Fix Released
Changed in unrtf (Debian):
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.