I will have to change the ocr_cinfo span anyway.. to fix the whitespace bbox and also, I have noted that cuneiform occasionally gives control codes as part of the text. Not sure when I will have time to make the changes, but in any case, we could agree on what the format should be and then someone could implement this.
I had a look at tesseract 3.0, they output bbox per word level, although they are using a "ocr_word" tag which does not exist in the specification.
What about defining an "ocrx_word" (specific to the ocr engine) as characters with a positive-area-bbox?
And rather than placing the "ocr_cinfo" at the "ocr_line" level, it will be placed at the "ocrx_word" level.
This way, both word bbox is given and character bboxes, and by definition, only for valid bboxes.
Example:
<span class='ocr_line' id='line_1' title="bbox 0 0 45 20"><span class='ocr_xword' id='xword_1' title="bbox 0 0 20 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">hello</span></span><span> </span><span class='ocr_xword' id='xword_2' title="bbox 25 0 45 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">world</span></span>
(note the whitespace which is not part of any ocr_xword as cuneiform will produce an incorrect bbox for it)
I will have to change the ocr_cinfo span anyway.. to fix the whitespace bbox and also, I have noted that cuneiform occasionally gives control codes as part of the text. Not sure when I will have time to make the changes, but in any case, we could agree on what the format should be and then someone could implement this.
I had a look at tesseract 3.0, they output bbox per word level, although they are using a "ocr_word" tag which does not exist in the specification.
What about defining an "ocrx_word" (specific to the ocr engine) as characters with a positive-area-bbox?
And rather than placing the "ocr_cinfo" at the "ocr_line" level, it will be placed at the "ocrx_word" level.
This way, both word bbox is given and character bboxes, and by definition, only for valid bboxes.
Example: /span>< /span>< span> </span><span class='ocr_xword' id='xword_2' title="bbox 25 0 45 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">world< /span>< /span>
<span class='ocr_line' id='line_1' title="bbox 0 0 45 20"><span class='ocr_xword' id='xword_1' title="bbox 0 0 20 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">hello<
(note the whitespace which is not part of any ocr_xword as cuneiform will produce an incorrect bbox for it)
sounds OK or you have suggestions?