Comment 34 for bug 623438

Revision history for this message
Martin Wildam (mwildam) wrote : Re: Font size not correct in merged sandvich PDF

@Igor: I searched quite a while - don't remember ocrad explicitely now but I am quite sure I came across it. I also found at other places (blog posts) that cuneiform seems to be the only one producing hocr output.

I would be glad if there would be more choices. I have written a common file converter with currently plugin using ABBYY to produce ocred pdf and also writing a plugin for cuneiform. I would be glad if there would be other options - I would immediately start another plugin for that one.

@Don: Thanks, I know VLinux - I have a visually impaired friend and VLinux was also mentioned on the goinglinux podcast.
Back to topic: Regarding the sandvich PDF: ASFAIK sandvich PDF means to have the text below the image so that the text is linked to the position on the page where it belongs. This is more than just having the text as just a long string (as usually delivered if you get the OCR result as text from a TIFF without producing a PDF). In theory you could then group text columns for being read by a screenreader as required for the impaired (I know of these issues you are talking about). But as far as I know cuneiform cannot build such groups. The hocr output is positioning each single character or a whole line. I think ABBYY Finereader is currently the best out there producing really good results (but it costs money).

@Yury: What he is asking basically is: Using cuneiform + hocr2pdf - would he have a chance to get a PDF output that using a screenreader (for visually impaired people) would read everything in the correct order (e.g. if you have a page with left and right column of text it should result in reading first the left column and then the right column and not first line of left column then first line of right column, second line of left column and so on...)