Cuneiform for Linux

can produce hOCR with illegal UTF-8 sequences

Bug #585418 reported by Jakub Wilk on 2010-05-25

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Cuneiform for Linux	Fix Released	Undecided	Unassigned

Bug Description

Cuneiform can produce hOCR that contains illegal UTF-8 sequences:

$ cuneiform -l ruseng -f hocr -o test.html test.png
Cuneiform for Linux 0.9.0

$ grep -i utf-8 test.html
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >

$ iconv -f UTF-8 -t UTF-8 < test.html > /dev/null
iconv: illegal input sequence at position 401

Revision history for this message

Jakub Wilk (jwilk) wrote on 2010-05-25:

test image Edit (8.2 KiB, image/png)

Revision history for this message

Jussi Pakkanen (jpakkane) wrote on 2010-05-26:

Since I am not familiar with cyrillic (which you'll probably get because you are using ruseng), could you please specify:

- which recognized character is the issue
- what UTF-8 sequence it produces
- what is the correct UTF-8 sequence for that character

Revision history for this message

Kyrill Detinov (lazy-kent) wrote on 2010-05-26:

The bug is reproducible with '-l eng'. We use '-l ruseng' because a part of full page is in Russian.
The problem character is '@'.

Revision history for this message

Jakub Wilk (jwilk) wrote on 2010-05-26: Re: [Bug 585418] Re: can produce hOCR with illegal UTF-8 sequences

(In fact the test image contains a few dozens of characters, all of which
are covered by ASCII…) The issue is the leading "@", which is outputted
as byte 0xA9.

--
Jakub Wilk

Jussi Pakkanen (jpakkane) on 2010-05-26

Changed in cuneiform-linux:
status:	New → Fix Committed

Jussi Pakkanen (jpakkane) on 2010-06-30

Changed in cuneiform-linux:
status:	Fix Committed → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

test image Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.