Comment 40 for bug 623438

Revision history for this message
Martin Wildam (mwildam) wrote : Re: Font size not correct in merged sandvich PDF

I have got in touch with the developer - he has very much todo, but I sent a donation and he looked at the issue (I exchanged a few emails with him) - here is his final response so far:

On Mon, Sep 13, 2010 at 10:28, Rene Rebe <email address hidden> wrote:

Dear Martin,

the problem is that the latest cuneiform version completely changed the way the bounding box information is written. Actually in a way that makes no sense to me. Before each glyph had a bounding box, which is exactly what we need to write a proper PDF. Now they have a bounding box per line (we we do not need at all) and then an additional array of x start position. However, this can easily get out of sync in regard to multi-byte utf-8 sequences, and also in regards to whitespace. It would also be particularly ugly to adapt the horc2pdf HTML parser to cope with this x position spans written out after the actual text. I doubt this is valid hOCR, and even if it is, it makes no sense to first write out the <span> with the text, and then another <span> just for the x coordinates. And for proper font size estimation we even need the real y-height of the single glyphs in any case (information not present in the new format).

I suggest to revert the change that mangled the hOCR annotation in cuneiform, ... That would approximately be these:

revno: 415
committer: julien <email address hidden>
branch nick: cuneiform-linux
timestamp: Wed 2009-10-07 10:10:13 +0200
message:
 moved some tags around, now follows html spec and hocr spec. fixed russian comments that were destroyed during encoding
------------------------------------------------------------
revno: 414
committer: julien <email address hidden>
branch nick: cuneiform-linux
timestamp: Fri 2009-10-02 21:48:45 +0200
message:
 separated ocr_line and character bboxes. now follows the hocr standard using the ocr_cinfo tag for char bboxes
------------------------------------------------------------
revno: 413
author: Dmitry Polevoy
committer: julien <email address hidden>
branch nick: cuneiform-linux
timestamp: Thu 2009-10-01 17:07:51 +0200
message:
 hocr format now supports ocr_line. Replaced cuneiform_src/Kern/rout/src/html.cpp to the patch submitted in the cuneiform mailing list the 24th of February by Dmitry Polevoy. Cha
nged %d to %l in a few sprintf statements in html.cpp