Bounding boxes not handled correctly
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
exactimage (Ubuntu) |
New
|
Undecided
|
Unassigned |
Bug Description
Binary package hint: exactimage
hocr2pdf has problems processing hocr output from CuneiForm 1.0.0.
The effect is visible when selecting text in the resulting text which is far too big using current versions of cuneiform + hocr2pdf.
CuneiForm has changed it's output for version 1.0.0 providing two types of bounding boxes - for characters and for lines. Previously there was only output for characters.
I have filed a bug for CuneiForm - see Bug #623438 .
They commented that their output is hocr conform and the problem is related to hocr2pdf.
Sample output from cuneiform v1.0.0:
<p><span class='ocr_line' id='line_1' title="bbox 36 93 580 123">This is a lot of 12 point text to test the <span class='ocr_cinfo' title="x_bboxes 36 93 55 117 57 93 71 117 ... [and so on and so on]
The line is in the first span and the letters in the second.
It seems that hocr2pdf cannot handle both bouding boxes.
ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: exactimage 0.7.4-3ubuntu2
ProcVersionSign
Uname: Linux 2.6.32-24-generic i686
Architecture: i386
Date: Tue Sep 7 18:02:43 2010
InstallationMedia: Ubuntu 10.04 LTS "Lucid Lynx" - Release i386 (20100429)
ProcEnviron:
LANG=en_US.utf8
SHELL=/bin/bash
SourcePackage: exactimage
Attached you find a screenshot example of the effect that happens (selected text so big, that selecting the part of text I want, is quite impossible).