can produce hOCR with illegal UTF-8 sequences
Bug #585418 reported by
Jakub Wilk
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Cuneiform for Linux |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Cuneiform can produce hOCR that contains illegal UTF-8 sequences:
$ cuneiform -l ruseng -f hocr -o test.html test.png
Cuneiform for Linux 0.9.0
$ grep -i utf-8 test.html
<meta http-equiv=
$ iconv -f UTF-8 -t UTF-8 < test.html > /dev/null
iconv: illegal input sequence at position 401
Changed in cuneiform-linux: | |
status: | New → Fix Committed |
Changed in cuneiform-linux: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
Since I am not familiar with cyrillic (which you'll probably get because you are using ruseng), could you please specify:
- which recognized character is the issue
- what UTF-8 sequence it produces
- what is the correct UTF-8 sequence for that character