Comment 32 for bug 623438

Revision history for this message
Don Marang (speedychair) wrote : Re: [Cuneiform] [Bug 623438] Re: Font size not correct in mergedsandvich PDF

I am not an expert in PDF internal formats at this point. I may need to
start learning. I also have an application, actually a long bash script,
that I want to extend it's capabilities to output several scanned pages that
have had OCR performed and merge the text with the original image in a PDF.
The package is called speedy-ocr.

Does having a sandwhiched PDF mean that the text is then editable in Adobe
as opposed to just attached as a searchable, structured note? I am writing
this script to simplify scanning and OCR functionality for the blind and
visually impaired community. Screen readers, Orca in this case, will need
structured text so that the text can be read in the appropriate order, if
possible. I do not know yet how much of the structure can be retreived from
cuneiform, if any. For our purposes, having the font information is not
necessary for most users. They just need to be able to retreive and store
fairly accurate text, in the correct reading order, for each page. Is this
type of merge different than a sandwhiched PDF? Is this simply attached
searchable text?

We have a distribution of Ubuntu 10.0.4 Lucid that configures several
accessibility systems and a group of developers world wide are attempting to
fix gnome applications for accessibility. Most of the fixes get sent
upstream and incorporated into Ubuntu, partly because Luke is now using the
Vinux distribution as a testbed. The distribution is called Vinux, and it's
home page is vinux.org.uk. Our repositories are also on LaunchPad.net.

Don Marang

There is just so much stuff in the world that, to me, is devoid of any real
substance, value, and content that I just try to make sure that I am working
on things that matter.
Dean Kamen

--------------------------------------------------
From: "Martin Wildam" <email address hidden>
Sent: Friday, September 10, 2010 4:05 AM
To: <email address hidden>
Subject: [Cuneiform] [Bug 623438] Re: Font size not correct in
mergedsandvich PDF

> I have discussed this with somebody who is an expert in PDF and my
> current understanding is that for creating the PDF the underlying text
> behind the image displayed needs font size, spacing etc information to
> be correctly displayed in the viewer.
>
> I noticed that not only the selection in the viewer does not work
> correctly. Also a lot of words are not found using the internal search
> functionality of viewers (tested with Evince and Adobe Acrobat Reader).
>
> Side note: If I extract the full text using a PDF library I get a
> correct looking text (words separated by space, no spaces between
> words).
>
> I think that creating a correct sandvich PDF is crucial and wonder why
> not more people are interested in this. But I also think, that it is not
> easy. I think it would be necessary to get experts in OCR, experts in
> PDF and experts in fonts together to solve this. - The key missing thing
> IMHO is to get font metric (font name, size, spacing, ...) information
> when only having the bounding boxes and contained text. Therefore I
> posted also the link above which I find important.
>
> --
> Font size not correct in merged sandvich PDF
> https://bugs.launchpad.net/bugs/623438
> You received this bug notification because you are a member of Cuneiform
> Linux, which is the registrant for Cuneiform for Linux.
>
> Status in Linux port of Cuneiform: Invalid
>
> Bug description:
> After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter,
> version 0.7.4 (should be the most current version) I get a sandvich pdf
> that looks nice until I select text.
>
> See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf.
> The effect is shown in screen087.png
>
> For another file (Test10pages.pdf) the effect is either worse - basically
> I cannot really select any more text to copy because I only can guess
> where to move with the mouse.
>
> It looks like that the font size in the HTML is somehow not correct - I am
> not an expert, but this link might help you:
> http://www.emdpi.com/fontsize.html
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~cuneiform
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~cuneiform
> More help : https://help.launchpad.net/ListHelp
>