Comment 35 for bug 623438

Revision history for this message
Igor Filippov (igor-v-filippov) wrote : Re: [Cuneiform] [Bug 623438] Re: Font size not correct in merged sandvich PDF

Martin,

I'm not using this functionality myself, so you most likely know best,
but OCRAD is producing ORF output with "-x" command-line option.
According to the README ORF file will contain bounding boxes for OCRed
characters and lines.

Igor

On Fri, 2010-09-10 at 17:52 +0000, Martin Wildam wrote:
> @Igor: I searched quite a while - don't remember ocrad explicitely now
> but I am quite sure I came across it. I also found at other places (blog
> posts) that cuneiform seems to be the only one producing hocr output.
>
> I would be glad if there would be more choices. I have written a common
> file converter with currently plugin using ABBYY to produce ocred pdf
> and also writing a plugin for cuneiform. I would be glad if there would
> be other options - I would immediately start another plugin for that
> one.
>
> @Don: Thanks, I know VLinux - I have a visually impaired friend and VLinux was also mentioned on the goinglinux podcast.
> Back to topic: Regarding the sandvich PDF: ASFAIK sandvich PDF means to have the text below the image so that the text is linked to the position on the page where it belongs. This is more than just having the text as just a long string (as usually delivered if you get the OCR result as text from a TIFF without producing a PDF). In theory you could then group text columns for being read by a screenreader as required for the impaired (I know of these issues you are talking about). But as far as I know cuneiform cannot build such groups. The hocr output is positioning each single character or a whole line. I think ABBYY Finereader is currently the best out there producing really good results (but it costs money).
>
> @Yury: What he is asking basically is: Using cuneiform + hocr2pdf -
> would he have a chance to get a PDF output that using a screenreader
> (for visually impaired people) would read everything in the correct
> order (e.g. if you have a page with left and right column of text it
> should result in reading first the left column and then the right column
> and not first line of left column then first line of right column,
> second line of left column and so on...)
>