Strange symbols instead '-' in smart text output

Bug #324256 reported by Polevoy Dmitry
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
New
High
Unassigned

Bug Description

At line 1, 33, 55 in smarttext output (see attach). It looks like bad substitution (0xE2,0x80,0x94) for dash.

I use 'smarttext' output format.

Tags: win32
Revision history for this message
Polevoy Dmitry (openocr-polevoy) wrote :
Revision history for this message
Polevoy Dmitry (openocr-polevoy) wrote :

It blocks quality control (by comparing smarttext output for different builds)

Changed in cuneiform-linux:
importance: Undecided → Critical
Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

The text you attached has only 36 lines in total, so it can't have line 55. There are also no dashes on lines 1 and 33. Did you attach the correct image?

Changed in cuneiform-linux:
importance: Critical → High
Revision history for this message
Polevoy Dmitry (openocr-polevoy) wrote :
Revision history for this message
Yury V. Zaytsev (zyv) wrote :

On Ubuntu 8.04 I also get 0xE2 0x80 0x94 in all of these cases. What's wrong with this?

Revision history for this message
Polevoy Dmitry (openocr-polevoy) wrote : Re: [Bug 324256] Re: Strange symbols instead '-' in smart text output

These are not valid symbols for plaint text files.

2009/2/3 Yury V. Zaytsev <email address hidden>

> On Ubuntu 8.04 I also get 0xE2 0x80 0x94 in all of these cases. What's
> wrong with this?
>
> --
> Strange symbols instead '-' in smart text output
> https://bugs.launchpad.net/bugs/324256
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Linux port of Cuneiform: New
>
> Bug description:
> At line 1, 33, 55 in smarttext output (see attach). It looks like bad
> substitution (0xE2,0x80,0x94) for dash.
>
> I use 'smarttext' output format.
>

Revision history for this message
adriverhoef (a3) wrote :

I'm running Cuneiform 1.0 and the behaviour that I'm observing is also, like others already have mentioned, that a dash gets translated into three characters: —
(that's U+00E2, U+20AC, U+201D).
This happens not only when I use "smarttext" for format, also with "html", "hocr" and "text".
When using Czech, Dutch, English, French, German, etc. Cuneiform will produce —.
However, when using Bulgarian, Russian, etc. Cuneiform will produce: — (that's U+0432, U+0402, U+201D).

Revision history for this message
adriverhoef (a3) wrote :
Revision history for this message
adriverhoef (a3) wrote :

The former output results were for English, the character encoding is UTF-8.
And here are the Cuneiform output results for Bulgarian.
Note that the character encoding is also UTF-8, like the former for English. (You'll probably have to adjust your browser settings.)

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

The final expansion to UTF-8 happens in text.cpp:OneChar. However debugging reveals that the dash (value 151 or 0x97) has already been expanded to the specified characters. Those individual are then UTF-8 encoded as expected. The question becomes: where does the first expansion/substitution happen and why?

Markus (smstuff)
Changed in cuneiform-linux:
assignee: nobody → Markus (smstuff)
Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Markus (smstuff)
Changed in cuneiform-linux:
assignee: Markus (smstuff) → nobody
Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
daniel (daniel-schmid-engineering) wrote : Re: Re: [Cuneiform] [Bug 324256] Re: Strange symbols instead '-' in smart text output

Daniel Wildermuth ist bis zum 15.Oktober 2010 nicht im Büro. In dringenden
Fällen bitte Marco Schmid(<email address hidden>) kontaktieren.

Daniel Wildermuth is out of Office until October 15th 2010. Is it urgent ?
Then please contact Marco Schmid(<email address hidden>))

Revision history for this message
Alexander Kandaurov (w0lfx) wrote :

The problem is caused by the em dash appearing in rcm.c as a unicode character while the program uses one-byte encoding in its internals and converts it to unicode while producing the output. The included patch solves the issue.
Other suspicious places in the source that may cause issues similar to this are line 550 of Kern/rstr/src/rcm.c and line 98 of Kern/rling/sources/c/speldict.c.

Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Thank you for your patch.

However Cuneiform is currently unmaintained so no new patches will be applied to trunk. Sorry.

Revision history for this message
julien (julien-aubert) wrote : Re: [Bug 324256] Re: Strange symbols instead '-' in smart text output

I am one of those still using Cuneiform, however, unfortunately there
is no maintainer of Cuneiform as of now.
Thanks for posting the patch
Regards
Julien

2011/12/13 Alexander Kandaurov <email address hidden>:
> The problem is caused by the em dash appearing in rcm.c as a unicode character while the program uses one-byte encoding in its internals and converts it to unicode while producing the output. The included patch solves the issue.
> Other suspicious places in the source that may cause issues similar to this are line 550 of Kern/rstr/src/rcm.c and line 98 of Kern/rling/sources/c/speldict.c.
>
> ** Patch added: "cuneiform-linux-1.1.0-emdash.patch"
>   https://bugs.launchpad.net/cuneiform-linux/+bug/324256/+attachment/2631040/+files/cuneiform-linux-1.1.0-emdash.patch
>
> --
> You received this bug notification because you are a member of Cuneiform
> Linux, which is the registrant for Cuneiform for Linux.
> https://bugs.launchpad.net/bugs/324256
>
> Title:
>  Strange symbols instead '-' in smart text output
>
> Status in Linux port of Cuneiform:
>  New
>
> Bug description:
>  At line 1, 33, 55 in smarttext output (see attach). It looks like bad
>  substitution (0xE2,0x80,0x94) for dash.
>
>  I use 'smarttext' output format.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/cuneiform-linux/+bug/324256/+subscriptions

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.