can produce hOCR with illegal UTF-8 sequences

Bug #585418 reported by Jakub Wilk
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
Fix Released
Undecided
Unassigned

Bug Description

Cuneiform can produce hOCR that contains illegal UTF-8 sequences:

$ cuneiform -l ruseng -f hocr -o test.html test.png
Cuneiform for Linux 0.9.0

$ grep -i utf-8 test.html
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >

$ iconv -f UTF-8 -t UTF-8 < test.html > /dev/null
iconv: illegal input sequence at position 401

Revision history for this message
Jakub Wilk (jwilk) wrote :
Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Since I am not familiar with cyrillic (which you'll probably get because you are using ruseng), could you please specify:

- which recognized character is the issue
- what UTF-8 sequence it produces
- what is the correct UTF-8 sequence for that character

Revision history for this message
Kyrill Detinov (lazy-kent) wrote :

The bug is reproducible with '-l eng'. We use '-l ruseng' because a part of full page is in Russian.
The problem character is '@'.

Revision history for this message
Jakub Wilk (jwilk) wrote : Re: [Bug 585418] Re: can produce hOCR with illegal UTF-8 sequences

(In fact the test image contains a few dozens of characters, all of which
are covered by ASCII…) The issue is the leading "@", which is outputted
as byte 0xA9.

--
Jakub Wilk

Changed in cuneiform-linux:
status: New → Fix Committed
Changed in cuneiform-linux:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.