Ubuntu
exactimage package

Bounding boxes not handled correctly

Bug #632524 reported by Martin Wildam on 2010-09-07

This bug report is a duplicate of: Bug #623438: Font size not correct in merged sandwich PDF. Edit Remove

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	exactimage (Ubuntu)	New	Undecided	Unassigned

Bug Description

Binary package hint: exactimage

hocr2pdf has problems processing hocr output from CuneiForm 1.0.0.
The effect is visible when selecting text in the resulting text which is far too big using current versions of cuneiform + hocr2pdf.

CuneiForm has changed it's output for version 1.0.0 providing two types of bounding boxes - for characters and for lines. Previously there was only output for characters.

I have filed a bug for CuneiForm - see Bug #623438 .

They commented that their output is hocr conform and the problem is related to hocr2pdf.

Sample output from cuneiform v1.0.0:
<p><span class='ocr_line' id='line_1' title="bbox 36 93 580 123">This is a lot of 12 point text to test the <span class='ocr_cinfo' title="x_bboxes 36 93 55 117 57 93 71 117 ... [and so on and so on]

The line is in the first span and the letters in the second.

It seems that hocr2pdf cannot handle both bouding boxes.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: exactimage 0.7.4-3ubuntu2
ProcVersionSignature: Ubuntu 2.6.32-24.42-generic 2.6.32.15+drm33.5
Uname: Linux 2.6.32-24-generic i686
Architecture: i386
Date: Tue Sep 7 18:02:43 2010
InstallationMedia: Ubuntu 10.04 LTS "Lucid Lynx" - Release i386 (20100429)
ProcEnviron:
LANG=en_US.utf8
SHELL=/bin/bash
SourcePackage: exactimage

Tags:

Revision history for this message

Martin Wildam (mwildam) wrote on 2010-09-07:

Cuneiform output v0.8.0 vs v1.0.0 Edit (1.1 MiB, application/zip)
Dependencies.txt Edit (804 bytes, text/plain; charset="utf-8")

Revision history for this message

Martin Wildam (mwildam) wrote on 2010-09-07:

screen097.png Edit (92.7 KiB, image/png)

Attached you find a screenshot example of the effect that happens (selected text so big, that selecting the part of text I want, is quite impossible).

Revision history for this message

Martin Wildam (mwildam) wrote on 2010-09-10:

It looks like that the font size is not calculated correctly from the bounding boxes and contained text in the HTML - I am not an expert, but this link might help you:
http://www.emdpi.com/fontsize.html

I have discussed this with somebody who is an expert in PDF and my current understanding is that for creating the PDF the underlying text behind the image displayed needs font size, spacing etc information to be correctly displayed in the viewer.

I noticed that not only the selection in the viewer does not work correctly. Also a lot of words are not found using the internal search functionality of viewers (tested with Evince and Adobe Acrobat Reader).

Side note: If I extract the full text using a PDF library I get a correct looking text (words separated by space, no spaces between words).

I think that creating a correct sandvich PDF is crucial and wonder why not more people are interested in this. But I also think, that it is not easy. I think it would be necessary to get experts in OCR, experts in PDF and experts in fonts together to solve this. - The key missing thing IMHO is to get font metric (font name, size, spacing, ...) information when only having the bounding boxes and contained text. Therefore I posted also the link above which I find important.

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #623438 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuexactimage package

Bounding boxes not handled correctly

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
exactimage package