can produce hOCR with illegal UTF-8 sequences

Bug #585418 reported by Jakub Wilk
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
Undecided
Unassigned

Bug Description

Cuneiform can produce hOCR that contains illegal UTF-8 sequences:

$ cuneiform -l ruseng -f hocr -o test.html test.png
Cuneiform for Linux 0.9.0

$ grep -i utf-8 test.html
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" >

$ iconv -f UTF-8 -t UTF-8 < test.html > /dev/null
iconv: illegal input sequence at position 401

Revision history for this message
Jakub Wilk (jwilk) wrote :
Revision history for this message
Jussi Pakkanen (jpakkane) wrote :

Since I am not familiar with cyrillic (which you'll probably get because you are using ruseng), could you please specify:

- which recognized character is the issue
- what UTF-8 sequence it produces
- what is the correct UTF-8 sequence for that character

Revision history for this message
Kyrill Detinov (lazy-kent) wrote :

The bug is reproducible with '-l eng'. We use '-l ruseng' because a part of full page is in Russian.
The problem character is '@'.

Revision history for this message
Jakub Wilk (jwilk) wrote : Re: [Bug 585418] Re: can produce hOCR with illegal UTF-8 sequences

(In fact the test image contains a few dozens of characters, all of which
are covered by ASCII…) The issue is the leading "@", which is outputted
as byte 0xA9.

--
Jakub Wilk

Changed in cuneiform-linux:
status: New → Fix Committed
Changed in cuneiform-linux:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments