Bounding boxes not handled correctly

Bug #632524 reported by Martin Wildam
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
exactimage (Ubuntu)
New
Undecided
Unassigned

Bug Description

Binary package hint: exactimage

hocr2pdf has problems processing hocr output from CuneiForm 1.0.0.
The effect is visible when selecting text in the resulting text which is far too big using current versions of cuneiform + hocr2pdf.

CuneiForm has changed it's output for version 1.0.0 providing two types of bounding boxes - for characters and for lines. Previously there was only output for characters.

I have filed a bug for CuneiForm - see Bug #623438 .

They commented that their output is hocr conform and the problem is related to hocr2pdf.

Sample output from cuneiform v1.0.0:
<p><span class='ocr_line' id='line_1' title="bbox 36 93 580 123">This is a lot of 12 point text to test the <span class='ocr_cinfo' title="x_bboxes 36 93 55 117 57 93 71 117 ... [and so on and so on]

The line is in the first span and the letters in the second.

It seems that hocr2pdf cannot handle both bouding boxes.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: exactimage 0.7.4-3ubuntu2
ProcVersionSignature: Ubuntu 2.6.32-24.42-generic 2.6.32.15+drm33.5
Uname: Linux 2.6.32-24-generic i686
Architecture: i386
Date: Tue Sep 7 18:02:43 2010
InstallationMedia: Ubuntu 10.04 LTS "Lucid Lynx" - Release i386 (20100429)
ProcEnviron:
 LANG=en_US.utf8
 SHELL=/bin/bash
SourcePackage: exactimage

Revision history for this message
Martin Wildam (mwildam) wrote :
Revision history for this message
Martin Wildam (mwildam) wrote :

Attached you find a screenshot example of the effect that happens (selected text so big, that selecting the part of text I want, is quite impossible).

Revision history for this message
Martin Wildam (mwildam) wrote :

It looks like that the font size is not calculated correctly from the bounding boxes and contained text in the HTML - I am not an expert, but this link might help you:
http://www.emdpi.com/fontsize.html

I have discussed this with somebody who is an expert in PDF and my current understanding is that for creating the PDF the underlying text behind the image displayed needs font size, spacing etc information to be correctly displayed in the viewer.

I noticed that not only the selection in the viewer does not work correctly. Also a lot of words are not found using the internal search functionality of viewers (tested with Evince and Adobe Acrobat Reader).

Side note: If I extract the full text using a PDF library I get a correct looking text (words separated by space, no spaces between words).

I think that creating a correct sandvich PDF is crucial and wonder why not more people are interested in this. But I also think, that it is not easy. I think it would be necessary to get experts in OCR, experts in PDF and experts in fonts together to solve this. - The key missing thing IMHO is to get font metric (font name, size, spacing, ...) information when only having the bounding boxes and contained text. Therefore I posted also the link above which I find important.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.