Cuneiform for Linux

Bug #623438
Comment #47

Comment 47 for bug 623438

Revision history for this message

julien (julien-aubert) wrote on 2010-10-17: Re: Font size not correct in merged sandvich PDF

#47

Let me first summarize the cuneiform specific issues / proposed changes from Martin Wildam's conversation with Rene Rebe.

1) rev 413 to 415 completely changed the way bounding box info is written, now bbox per line and additional array of x start position, missing y height for proper font size estimation
2) bbox per char can easily get out of sync in regard to multi-byte utf-8 sequences and also in regards to whitespace
3) Rene doubts that writing out x positions after actual text is valid hOCR output
4) Rene propose it makes no sense to first write out with the text and then another for just the x coordinates
(Let me know if there were other specific cuneiform issues mentioned)

nr2 is an issue - will create a separate bug for this as it is cuneiform internal.

In my view, 1,3,4 are not an error of cuneiform but an interpretation issue of the hOCR spec.
Official hOCR spec: https://docs.google.com/View?docid=dfxcv4vc_67g844kf
I believe cuneiform does the right thing with the ocr_line, ocr_cinfo and the x_bboxes. More details below.

Perhaps Rene could have a look here for help on parsing the hocr output from cuneiform:
http://bazaar.launchpad.net/~hocr-parsers/hocr-parsers/main/files

Unless a violation of the hOCR spec regarding this topic is found, I think this bug should be closed.

Details
1) incorrect - y height is available:
Output from rev412:
B
Y
G
G
N
A
D
E
R
Output in cuneiform 1.0 (or after rev415):

BYGGNADER 
<span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">

It is an incorrect assumption that the x_bboxes are only x positions. The official specification for the hOCR format can be found here:
https://docs.google.com/View?docid=dfxcv4vc_67g844kf
My understanding is that the above is the correct way for hOCR output.

2) I do not understand the comment that "it can easily get out of sync", there is exactly one bbox per character on the line.
however, I confirm that there is an issue with whitespace and control characters being part of the characters on the line and for which the bounding boxes are not correct. I will open this as a separate bug, needs to be checked whether this needs to be special-case treated in the hocr output or if it is an issue upstream in cuneiform (an issue of not providing a bounding box for whitespace and of producing control characters in the recognized text)

3 and 4)
I find the specification somewhat difficult to interpret at times but it is my understanding that character bbox info goes within the ocr_line tag element. whether it goes before or after the textual elements is irrelevant. E.g.

 BYGGNADER 
 <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">

and

 <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
 BYGGNADER 

are equally correct, it is the association to the correct line which matters.
So unless it can be pointed out that the hocr output is breaking the hocr spec, I would not change it in cuneiform.

Let me first summarize the cuneiform specific issues / proposed changes from Martin Wildam's conversation with Rene Rebe.

nr2 is an issue - will create a separate bug for this as it is cuneiform internal.

Perhaps Rene could have a look here for help on parsing the hocr output from cuneiform:
http://bazaar.launchpad.net/~hocr-parsers/hocr-parsers/main/files

Unless a violation of the hOCR spec regarding this topic is found, I think this bug should be closed.

Details
1) incorrect - y height is available:
Output from rev412:
	B
	Y
	G
	G
	N
	A
	D
	E
	R
Output in cuneiform 1.0 (or after rev415):
	
		BYGGNADER 
		<span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">

It is an incorrect assumption that the x_bboxes are only x positions. The official specification for the hOCR format can be found here: 
https://docs.google.com/View?docid=dfxcv4vc_67g844kf
My understanding is that the above is the correct way for hOCR output.

3 and 4)
I find the specification somewhat difficult to interpret at times but it is my understanding that character bbox info goes within the ocr_line tag element. whether it goes before or after the textual elements is irrelevant. E.g.
	
		BYGGNADER 
		<span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
	
and
	
 <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
		BYGGNADER 
	
are equally correct, it is the association to the correct line which matters.
So unless it can be pointed out that the hocr output is breaking the hocr spec, I would not change it in cuneiform.