Font size not correct in merged sandvich PDF

Bug #623438 reported by Martin Wildam on 2010-08-24
58
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
Undecided
Unassigned
exactimage (Ubuntu)
Undecided
Unassigned

Bug Description

After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter, version 0.7.4 (should be the most current version) I get a sandvich pdf that looks nice until I select text.

See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf.
The effect is shown in screen087.png

For another file (Test10pages.pdf) the effect is either worse - basically I cannot really select any more text to copy because I only can guess where to move with the mouse.

It looks like that the font size in the HTML is somehow not correct - I am not an expert, but this link might help you:
http://www.emdpi.com/fontsize.html

Martin Wildam (mwildam) wrote :
Martin Wildam (mwildam) wrote :

I am using Ubuntu 10.04.1 with kernel 2.6.32-24-generic #41-Ubuntu SMP Thu Aug 19 01:12:52 UTC 2010 i686 GNU/Linux
although same happened with earlier kernel also (well, I don't assume it has anything to do with the kernel).

I have also installed most current versions of these packages:
bzr cmake build-essential imagemagick libmagick++2 libmagick++-dev pdftk subversion build-essential libtiff-tools

Serj Poltavskiy (serge-uliss) wrote :

what scipt do you use to convert from hOCR to PDF?
in hOCR output produced by cuneiform there's no font size information, only bounding boxes of lines and chars.

Martin Wildam (mwildam) wrote :

I don't use a script, I call it from a Java program.

Here's are the commandlines I am using:
cuneiform -l en -f hocr -o htmlfilename imagefilename
hocr2pdf -s -i imagefilename -o pdffilename

Martin Wildam (mwildam) wrote :

It was my assumption that there must be something with the font size - sorry, if I have irritated you.

Martin Wildam (mwildam) wrote :

I looked at the HTML - indeed there is no font height or size information there. So I assume, the coordinates of the boxes are simply inaccurate. - Or hocr2pdf is doing something wrong when merging the HTML with the image...

When I select Text in the result PDF it looks like the box is a little too small (missing a piece above), but for the Test10pages.pdf the effect is far more extreme. See here: http://www.youtube.com/watch?v=0d8_T-vV_Ak
In that case it selects in reality the line above the line I really want to select (the "für" is recognized as "ii" which is a different story).

Martin Wildam (mwildam) wrote :

I think (one of) the question(s) is: What does a PDF viewer rely on, when selecting text? Do they take the bounding boxes or do they display selected text using the given font information (which is missing in your HTML output)?

It could be that the viewer applications would need the font information and you don't provide that in the html.
And another question is if hocr2pdf would be able to interpret the font information if it would be given.

Basically without getting this correct, I don't gain much of having a sandvich PDF, so I find this problem quite essential.

Martin Wildam (mwildam) wrote :

If the font information is needed in the HTML then I guess my originally posted link would be important to calculate proper font size. In addition there is not only font size and height, but also spacing between letters and words...

Jussi Pakkanen (jpakkane) wrote :

AFAIK the bounding box of every single letter is written to the HOCR file, so generating proper info from that is the PDF generator's job.

Martin Wildam (mwildam) wrote :

But how can the hocr2pdf build the sandvich PDF correctly, if no font information is given?
For sure this can be provided only by the OCR because you know the font face and and other parameters. If you do not provide that information, the hocr2pdf can't find it out just of the given bounding boxes.

Jussi Pakkanen (jpakkane) wrote :

What is the specific piece of information hocr2pdf uses to generate these letter locations?

Martin Wildam (mwildam) wrote :

I have contacted exactcode (the developers of hocr2pdf - this is a german company and I speak german) and asked them for explanation (also sent them a link to this bug). So I hope for more information.

MAndree (hapm) wrote :

Same problem here on ubuntu 10.04 with cuneiform 1.0.0 and hocr2pdf 0.7.4. I compared the information in the hocr file with the position of the text in the pdf, and whatever hocr2pdf does, the text in the pdf doesn't match the boundingboxes defined in the .hocr file. So i think this is a problem of hocr2pdf. Not sure if this is related to how the hocr output of cuneiform is formatted, as i have read that there are many ways to attach the boundingboxes to the text (using own tags, using attributes in the tag enclosing the text directly, ...). Would be nice to know if hocr2pdf can handle hocr output from other ocr engines atm, and if so, where their hocr files are different to the cuneiform output.

MAndree (hapm) wrote :

Compiled hocr2pdf 0.8.2 from source now and having the same results as with 0.7.4.

Jussi Pakkanen (jpakkane) wrote :

I'm closing this as the symptoms seem to point to hocr2pdf. If new information is discovered, feel free to reopen.

Changed in cuneiform-linux:
status: New → Invalid
Martin Wildam (mwildam) wrote :

@MAndree: You mean after compiling from source you have the same wrong results as with 0.7.4? - What was the last version it worked with?

@JussiP: For me it is not yet clear how you can be sure it is an issue of hocr2pdf?
How should they now the correct font for selection to use when the OCR is not giving it?

Jussi Pakkanen (jpakkane) wrote :

I'm not sure, that's why I instructed people to reopen if new information arises.

The main problem is that AFAIK Cuneiform does not know "font information" that people seem to want at the time the output text is generated. It only knows each letter, its bounding box and maybe some other stuff. Adding document-wide font info is probably quite a lot of work.

Martin Wildam (mwildam) wrote :

The strange thing is: That already worked with an earlier version of CuneiForm and/or hocr2pdf.

Yury V. Zaytsev (zyv) wrote :

Why wouldn't you try to bisect to find out which was the latest working combination and then attempt to quantify the changes between the output produced by Cuneiform? If the output turns out to be identical it's an hocr2pdf fault.

Martin Wildam (mwildam) wrote :

Good idea - that because I have an old image of that machine still around - will restore that into a test machine and post the results.

Martin Wildam (mwildam) wrote :

OK, I retried on the old image (Ubuntu 9.10). I have there the hocr2pdf v.07.6 (compiled from source) where on my 10.04 I had installed the "exactimage" package which contains v0.7.4 of hocr2pdf and I have the cuneiform 0.8.0 there (also compiled from source).

I removed the package from the distribution and I tried compiling different versions of the hocr2pdf from source - even the 0.7.6. Result on my Ubuntu 10.04 workstation is always the same result as originally complained. See attachment screenshot.

Martin Wildam (mwildam) wrote :

Attached you find the html output with cuneiform v0.8.0 and cuneiform v1.0.0.
Not only that the v0.8.0 containes line breaks so that the html is more human-readable, the v1.0.0 looks quite totally different.

So I would say, that this is a cuneiform issue!

MAndree (hapm) wrote :

Well, looks like hocr2pdf needs the boundingbox informations per character where cuneiforms 1.0.0 output has changed to boundingboxes per line. This is an allowed format as described by hocr specification. So its not cuneiforms fault, but it would be nice to get a switch that forces cuneiform to give boundingboxes per character instead of per line. That would be a feature request so and no issue.

Jussi Pakkanen (jpakkane) wrote :

Cuneiform outputs both boundingboxes:

<p><span class='ocr_line' id='line_1' title="bbox 36 93 580 123">This is a lot of 12 point text to test the <span class='ocr_cinfo' title="x_bboxes 36 93 55 117 57 93 71 117 ... [and so on and so on]

The line is in the first span and the letters in the second.

Martin Wildam (mwildam) wrote :

Bug filed for exactimage as Bug #632524 - I tried to change the package for this bug but was not possible.

Martin Wildam (mwildam) wrote :

I have discussed this with somebody who is an expert in PDF and my current understanding is that for creating the PDF the underlying text behind the image displayed needs font size, spacing etc information to be correctly displayed in the viewer.

I noticed that not only the selection in the viewer does not work correctly. Also a lot of words are not found using the internal search functionality of viewers (tested with Evince and Adobe Acrobat Reader).

Side note: If I extract the full text using a PDF library I get a correct looking text (words separated by space, no spaces between words).

I think that creating a correct sandvich PDF is crucial and wonder why not more people are interested in this. But I also think, that it is not easy. I think it would be necessary to get experts in OCR, experts in PDF and experts in fonts together to solve this. - The key missing thing IMHO is to get font metric (font name, size, spacing, ...) information when only having the bounding boxes and contained text. Therefore I posted also the link above which I find important.

Yury V. Zaytsev (zyv) wrote :

What I can not understand is why you wouldn't file a bug against hocr2pdf.

As you discovered, Cuneiform exports bboxes for both lines and characters, so it shouldn't be its fault. So now what we can do for you? You are not going to get the font metrics. It's very ambiguous and lots of work. bboxes are by far more than enough to approximately fit the characters.

And as surprising as it might sound, not everybody is interested in creating sandwich PDFs. E.g. I don't care. So you have to push it, if you want to get it solved.

Martin Wildam (mwildam) wrote :

I have filed a bug there at exactimage (which is the package containing hocr2pdf) - see comment #25

With my last comment I wanted to point out that for creating a proper final sandvich PDF more information might be necessary - maybe bounding boxes for words also - and maybe you could identify Descent and Ascent information for better font choosing.

Anyway, I am still waiting for any response from exactimage developer(s) to tell their view of the things...

I currently cannot do much more than filing bugs and testing. And I even tried to reach them by phone.
It it has been determined what needs to be done, maybe there is an option to pay for implementation/fix, but currently I don't have an idea if and how the problem can be solved (and approximately what amount of work).

Yury V. Zaytsev (zyv) wrote :

The bug against exactimage is not going to be processed, as this package is autosynced from Debian, so the way it will work is as follows: one day someone from Ubuntu will report it against Debian, and few years later a Debian Developer will try to report it to upstream.

It is possible to change package, I just did it and marked your other bug as a duplicate of this one.

If I were you I would come here :

http://www.exactcode.de/site/open_source/exactimage/feedback/

And try to bring this to the attention of the mailing list members.

It's quite rude to call individual developers unless you used the company number (hopefully). If you are willing to pay for the fix maybe pesting the company is not such a bad idea.

Martin Wildam (mwildam) wrote :

Yes, I used the company number. And I already sent them an email. So far now response.
I followed now your advice to subscribe to the mailing list and will report the issue there - we will see if this works.
Thank you for your assistance.

Martin,

Have you tried other OCR engines which can generate hOCR output?
I'm not sure all of them can but here are a few free and open source OCR
engines I've run on Linux:
GOCR
OCRAD
Tesseract

Does this issue affect them as well?

Best,
Igor

On Fri, 2010-09-10 at 11:45 +0000, Martin Wildam wrote:
> Yes, I used the company number. And I already sent them an email. So far now response.
> I followed now your advice to subscribe to the mailing list and will report the issue there - we will see if this works.
> Thank you for your assistance.
>

Download full text (4.2 KiB)

I am not an expert in PDF internal formats at this point. I may need to
start learning. I also have an application, actually a long bash script,
that I want to extend it's capabilities to output several scanned pages that
have had OCR performed and merge the text with the original image in a PDF.
The package is called speedy-ocr.

Does having a sandwhiched PDF mean that the text is then editable in Adobe
as opposed to just attached as a searchable, structured note? I am writing
this script to simplify scanning and OCR functionality for the blind and
visually impaired community. Screen readers, Orca in this case, will need
structured text so that the text can be read in the appropriate order, if
possible. I do not know yet how much of the structure can be retreived from
cuneiform, if any. For our purposes, having the font information is not
necessary for most users. They just need to be able to retreive and store
fairly accurate text, in the correct reading order, for each page. Is this
type of merge different than a sandwhiched PDF? Is this simply attached
searchable text?

We have a distribution of Ubuntu 10.0.4 Lucid that configures several
accessibility systems and a group of developers world wide are attempting to
fix gnome applications for accessibility. Most of the fixes get sent
upstream and incorporated into Ubuntu, partly because Luke is now using the
Vinux distribution as a testbed. The distribution is called Vinux, and it's
home page is vinux.org.uk. Our repositories are also on LaunchPad.net.

Don Marang

There is just so much stuff in the world that, to me, is devoid of any real
substance, value, and content that I just try to make sure that I am working
on things that matter.
Dean Kamen

--------------------------------------------------
From: "Martin Wildam" <email address hidden>
Sent: Friday, September 10, 2010 4:05 AM
To: <email address hidden>
Subject: [Cuneiform] [Bug 623438] Re: Font size not correct in
mergedsandvich PDF

> I have discussed this with somebody who is an expert in PDF and my
> current understanding is that for creating the PDF the underlying text
> behind the image displayed needs font size, spacing etc information to
> be correctly displayed in the viewer.
>
> I noticed that not only the selection in the viewer does not work
> correctly. Also a lot of words are not found using the internal search
> functionality of viewers (tested with Evince and Adobe Acrobat Reader).
>
> Side note: If I extract the full text using a PDF library I get a
> correct looking text (words separated by space, no spaces between
> words).
>
> I think that creating a correct sandvich PDF is crucial and wonder why
> not more people are interested in this. But I also think, that it is not
> easy. I think it would be necessary to get experts in OCR, experts in
> PDF and experts in fonts together to solve this. - The key missing thing
> IMHO is to get font metric (font name, size, spacing, ...) information
> when only having the bounding boxes and contained text. Therefore I
> posted also the link above which I find important.
>
> --
> Font size not correct in merged sandvich PDF
> htt...

Read more...

Yury V. Zaytsev (zyv) wrote :

I don't understand your question. Can you formulate it using no more than 75 words?

Martin Wildam (mwildam) wrote :

@Igor: I searched quite a while - don't remember ocrad explicitely now but I am quite sure I came across it. I also found at other places (blog posts) that cuneiform seems to be the only one producing hocr output.

I would be glad if there would be more choices. I have written a common file converter with currently plugin using ABBYY to produce ocred pdf and also writing a plugin for cuneiform. I would be glad if there would be other options - I would immediately start another plugin for that one.

@Don: Thanks, I know VLinux - I have a visually impaired friend and VLinux was also mentioned on the goinglinux podcast.
Back to topic: Regarding the sandvich PDF: ASFAIK sandvich PDF means to have the text below the image so that the text is linked to the position on the page where it belongs. This is more than just having the text as just a long string (as usually delivered if you get the OCR result as text from a TIFF without producing a PDF). In theory you could then group text columns for being read by a screenreader as required for the impaired (I know of these issues you are talking about). But as far as I know cuneiform cannot build such groups. The hocr output is positioning each single character or a whole line. I think ABBYY Finereader is currently the best out there producing really good results (but it costs money).

@Yury: What he is asking basically is: Using cuneiform + hocr2pdf - would he have a chance to get a PDF output that using a screenreader (for visually impaired people) would read everything in the correct order (e.g. if you have a page with left and right column of text it should result in reading first the left column and then the right column and not first line of left column then first line of right column, second line of left column and so on...)

Martin,

I'm not using this functionality myself, so you most likely know best,
but OCRAD is producing ORF output with "-x" command-line option.
According to the README ORF file will contain bounding boxes for OCRed
characters and lines.

Igor

On Fri, 2010-09-10 at 17:52 +0000, Martin Wildam wrote:
> @Igor: I searched quite a while - don't remember ocrad explicitely now
> but I am quite sure I came across it. I also found at other places (blog
> posts) that cuneiform seems to be the only one producing hocr output.
>
> I would be glad if there would be more choices. I have written a common
> file converter with currently plugin using ABBYY to produce ocred pdf
> and also writing a plugin for cuneiform. I would be glad if there would
> be other options - I would immediately start another plugin for that
> one.
>
> @Don: Thanks, I know VLinux - I have a visually impaired friend and VLinux was also mentioned on the goinglinux podcast.
> Back to topic: Regarding the sandvich PDF: ASFAIK sandvich PDF means to have the text below the image so that the text is linked to the position on the page where it belongs. This is more than just having the text as just a long string (as usually delivered if you get the OCR result as text from a TIFF without producing a PDF). In theory you could then group text columns for being read by a screenreader as required for the impaired (I know of these issues you are talking about). But as far as I know cuneiform cannot build such groups. The hocr output is positioning each single character or a whole line. I think ABBYY Finereader is currently the best out there producing really good results (but it costs money).
>
> @Yury: What he is asking basically is: Using cuneiform + hocr2pdf -
> would he have a chance to get a PDF output that using a screenreader
> (for visually impaired people) would read everything in the correct
> order (e.g. if you have a page with left and right column of text it
> should result in reading first the left column and then the right column
> and not first line of left column then first line of right column,
> second line of left column and so on...)
>

On Fri, 10 Sep 2010 Martin Wildam <email address hidden> wrote:

> @Igor: I searched quite a while - don't remember ocrad explicitely now
> but I am quite sure I came across it. I also found at other places (blog
> posts) that cuneiform seems to be the only one producing hocr output.

This was never true.

For the present status cf. e.g.

    http://groups.google.com/group/hocr

Regards

JSB

--
                     ,
dr hab. Janusz S. Bien, prof. UW - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - Warsaw University (Department of Formal Linguistics)
<email address hidden>, <email address hidden>, http://fleksem.klf.uw.edu.pl/~jsbien/

Martin Wildam (mwildam) wrote :

I could not find any documentation about how to get the hocr output back when I tested those OCR engines and after looking back now I can't find any documentation for ocropus or tesseract on how to produce the hocr html files.

Yury V. Zaytsev (zyv) wrote :

I am not aware of any open source OCR software that is doing multi-column document recognition. It's more of a segmentation task, rather than recognition itself, so it should be rather implemented in a front-end, such as OCRopus. If you have a linear text flow, sandwich PDFs can be read by a screen reader smart enough in a reasonable way.

Apart from already mentioned Finereader, old Cuneiform Windows freeware seems to be able to do multi-column.

Thanks for the response and info. The main issue with most front ends, Open
Source or commercial, is that they all tend to be very graphical and not
accessible to screen readers. Even reading well formed PDF documents
directly has been an issue until lately. I think the latest evince in gnome
finally starts to address this problem.

Don Marang

There is just so much stuff in the world that, to me, is devoid of any real
substance, value, and content that I just try to make sure that I am working
on things that matter.
Dean Kamen

--------------------------------------------------
From: "Yury V. Zaytsev" <email address hidden>
Sent: Saturday, September 11, 2010 9:41 AM
To: <email address hidden>
Subject: [Cuneiform] [Bug 623438] Re: Font size not correct in
mergedsandvich PDF

> I am not aware of any open source OCR software that is doing multi-
> column document recognition. It's more of a segmentation task, rather
> than recognition itself, so it should be rather implemented in a front-
> end, such as OCRopus. If you have a linear text flow, sandwich PDFs can
> be read by a screen reader smart enough in a reasonable way.
>
> Apart from already mentioned Finereader, old Cuneiform Windows freeware
> seems to be able to do multi-column.
>
> --
> Font size not correct in merged sandvich PDF
> https://bugs.launchpad.net/bugs/623438
> You received this bug notification because you are a member of Cuneiform
> Linux, which is the registrant for Cuneiform for Linux.
>
> Status in Linux port of Cuneiform: Invalid
> Status in “exactimage” package in Ubuntu: New
>
> Bug description:
> After processing with Cuneiform for Linux 1.0.0 and hOCR to PDF converter,
> version 0.7.4 (should be the most current version) I get a sandvich pdf
> that looks nice until I select text.
>
> See the sample 5AADFEE1-0000.* files in the attachment and the result.pdf.
> The effect is shown in screen087.png
>
> For another file (Test10pages.pdf) the effect is either worse - basically
> I cannot really select any more text to copy because I only can guess
> where to move with the mouse.
>
> It looks like that the font size in the HTML is somehow not correct - I am
> not an expert, but this link might help you:
> http://www.emdpi.com/fontsize.html
>
>
>
> _______________________________________________
> Mailing list: https://launchpad.net/~cuneiform
> Post to : <email address hidden>
> Unsubscribe : https://launchpad.net/~cuneiform
> More help : https://help.launchpad.net/ListHelp
>

Martin Wildam (mwildam) wrote :

I have got in touch with the developer - he has very much todo, but I sent a donation and he looked at the issue (I exchanged a few emails with him) - here is his final response so far:

On Mon, Sep 13, 2010 at 10:28, Rene Rebe <email address hidden> wrote:

Dear Martin,

the problem is that the latest cuneiform version completely changed the way the bounding box information is written. Actually in a way that makes no sense to me. Before each glyph had a bounding box, which is exactly what we need to write a proper PDF. Now they have a bounding box per line (we we do not need at all) and then an additional array of x start position. However, this can easily get out of sync in regard to multi-byte utf-8 sequences, and also in regards to whitespace. It would also be particularly ugly to adapt the horc2pdf HTML parser to cope with this x position spans written out after the actual text. I doubt this is valid hOCR, and even if it is, it makes no sense to first write out the <span> with the text, and then another <span> just for the x coordinates. And for proper font size estimation we even need the real y-height of the single glyphs in any case (information not present in the new format).

I suggest to revert the change that mangled the hOCR annotation in cuneiform, ... That would approximately be these:

revno: 415
committer: julien <email address hidden>
branch nick: cuneiform-linux
timestamp: Wed 2009-10-07 10:10:13 +0200
message:
 moved some tags around, now follows html spec and hocr spec. fixed russian comments that were destroyed during encoding
------------------------------------------------------------
revno: 414
committer: julien <email address hidden>
branch nick: cuneiform-linux
timestamp: Fri 2009-10-02 21:48:45 +0200
message:
 separated ocr_line and character bboxes. now follows the hocr standard using the ocr_cinfo tag for char bboxes
------------------------------------------------------------
revno: 413
author: Dmitry Polevoy
committer: julien <email address hidden>
branch nick: cuneiform-linux
timestamp: Thu 2009-10-01 17:07:51 +0200
message:
 hocr format now supports ocr_line. Replaced cuneiform_src/Kern/rout/src/html.cpp to the patch submitted in the cuneiform mailing list the 24th of February by Dmitry Polevoy. Cha
nged %d to %l in a few sprintf statements in html.cpp

Yury V. Zaytsev (zyv) on 2010-09-13
Changed in cuneiform-linux:
status: Invalid → Confirmed
Yury V. Zaytsev (zyv) wrote :

I am not entirely convinced about his arguments about UTF-8 and whitespace (sounds like just being lazy to adopt the parser to hOCR specs), but the loss of information about y-coordinates, which used to be present in the output of the previous versions sounds very much like a bug (if it's indeed the case).

I think that hOCR specification has to be studied in order to find out what are the actual requirements and if they can be interpreted liberally to a certain extent, maybe this could be put to advantage of hOCR developer.

Martin Wildam (mwildam) wrote :

How will you proceed now regarding this issue?

Yury V. Zaytsev (zyv) wrote :

I reopened the bug and maybe Jussi or someone who cares will have a look.

Martin Wildam (mwildam) wrote :

Here is another note from René:

On Mon, Sep 13, 2010 at 11:53, Rene Rebe <email address hidden> wrote:

Note that I wrote the initial hOCR annotation in cuneiform, ... :-)

If they desperately want to keep this new format, one could add 2 different hOCR formats, like hocr and hocr-detailed or so to cuneiform.

However, one could also use my initial style bbox per character annocation and their new whole span bbox. However, I see no reason for thiis imprecise and hard to parse span after span x_bboxes, per glyph bbox's would retain compatibility with hocr2pdf and also get higher precision as the x_bboxes do not contain the individual heights, ...

       René

Ahmad Jagot (ahmadjagot) wrote :

This bug also affects me. Would it be possible to add a command-line switch which allows reverting to the older bounding box format?

Have downgraded to 0.8 for the time being...

jswinner (jswinner) wrote :

Similar problems when using Ocropus

julien (julien-aubert) wrote :
Download full text (4.0 KiB)

Let me first summarize the cuneiform specific issues / proposed changes from Martin Wildam's conversation with Rene Rebe.

1) rev 413 to 415 completely changed the way bounding box info is written, now bbox per line and additional array of x start position, missing y height for proper font size estimation
2) bbox per char can easily get out of sync in regard to multi-byte utf-8 sequences and also in regards to whitespace
3) Rene doubts that writing out x positions after actual text is valid hOCR output
4) Rene propose it makes no sense to first write out <span> with the text and then another <span> for just the x coordinates
(Let me know if there were other specific cuneiform issues mentioned)

nr2 is an issue - will create a separate bug for this as it is cuneiform internal.

In my view, 1,3,4 are not an error of cuneiform but an interpretation issue of the hOCR spec.
Official hOCR spec: https://docs.google.com/View?docid=dfxcv4vc_67g844kf
I believe cuneiform does the right thing with the ocr_line, ocr_cinfo and the x_bboxes. More details below.

Perhaps Rene could have a look here for help on parsing the hocr output from cuneiform:
http://bazaar.launchpad.net/~hocr-parsers/hocr-parsers/main/files

Unless a violation of the hOCR spec regarding this topic is found, I think this bug should be closed.

Details
1) incorrect - y height is available:
Output from rev412:
 <span title="bbox 363 1253 382 1279">B</span>
 <span title="bbox 383 1254 407 1281">Y</span>
 <span title="bbox 409 1255 431 1283">G</span>
 <span title="bbox 434 1256 458 1284">G</span>
 <span title="bbox 460 1258 485 1285">N</span>
 <span title="bbox 486 1260 511 1286">A</span>
 <span title="bbox 514 1261 538 1287">D</span>
 <span title="bbox 541 1260 560 1289">E</span>
 <span title="bbox 561 1261 581 1289">R</span>
Output in cuneiform 1.0 (or after rev415):
 <span class='ocr_line' id='line_18' title="bbox 363 1253 581 1289">
  <b>BYGGNADER </b>
  <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
 </span>

It is an incorrect assumption that the x_bboxes are only x positions. The official specification for the hOCR format can be found here:
https://docs.google.com/View?docid=dfxcv4vc_67g844kf
My understanding is that the above is the correct way for hOCR output.

2) I do not understand the comment that "it can easily get out of sync", there is exactly one bbox per character on the line.
however, I confirm that there is an issue with whitespace and control characters being part of the characters on the line and for which the bounding boxes are not correct. I will open this as a separate bug, needs to be checked whether this needs to be special-case treated in the hocr output or if it is an issue upstream in cuneiform (an issue of not providing a bounding box for whitespace and of producing control characters in the recognized text)

3 and 4)
I find the specification somewhat difficult to interpret at times but it is my understanding that character bbox info goes within the ocr_line tag element. whether it goes befo...

Read more...

>I find the specification somewhat difficult to interpret at times but
>it is my understanding that character bbox info goes within the
>ocr_line tag element. whether it goes before or after the textual
>elements is irrelevant. E.g.
> <span class='ocr_line' id='line_18' title="bbox 363 1253 581 1289">
> <b>BYGGNADER </b>
> <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
> </span>

Apart from not being valid HTML, this doesn't make sense. And this was
already pointed out a year ago(!):
https://lists.launchpad.net/cuneiform/msg00450.html

>and
> <span class='ocr_line' id='line_18' title="bbox 363 1253 581 1289">
> <span class='ocr_cinfo' title="x_bboxes 363 1253 382 1279 383 1254 407 1281 409 1255 431 1283 434 1256 458 1284 460 1258 485 1285 486 1260 511 1286 514 1261 538 1287 541 1260 560 1289 561 1261 581 1289 -1 -1 -1 -1 ">
> <b>BYGGNADER </b>
> </span>
>are equally correct, it is the association to the correct line which matters.

If you don't close <span> it's not even a valid HTML...

--
Jakub Wilk

julien (julien-aubert) wrote :

Jakub Wilk, as you can see in any hocr output, the <span> is closed, I was sloppy when I copy pasted to the post. I have run the produced hocr output from cuneiform through
http://validator.w3.org/check
and it validates just fine.

As for the
<span class='ocr_line'...>Some text<span class='ocr_cinfo'...></span></span>
vs
<span class='ocr_line'...><span class='ocr_cinfo'...>Some text</span></span>
I agree that the latter makes more sense, if you or any other feel this is important to change you are free to change the code and propose a merge?

julien (julien-aubert) wrote :

I will have to change the ocr_cinfo span anyway.. to fix the whitespace bbox and also, I have noted that cuneiform occasionally gives control codes as part of the text. Not sure when I will have time to make the changes, but in any case, we could agree on what the format should be and then someone could implement this.

I had a look at tesseract 3.0, they output bbox per word level, although they are using a "ocr_word" tag which does not exist in the specification.

What about defining an "ocrx_word" (specific to the ocr engine) as characters with a positive-area-bbox?
And rather than placing the "ocr_cinfo" at the "ocr_line" level, it will be placed at the "ocrx_word" level.
This way, both word bbox is given and character bboxes, and by definition, only for valid bboxes.

Example:
<span class='ocr_line' id='line_1' title="bbox 0 0 45 20"><span class='ocr_xword' id='xword_1' title="bbox 0 0 20 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">hello</span></span><span> </span><span class='ocr_xword' id='xword_2' title="bbox 25 0 45 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">world</span></span>
(note the whitespace which is not part of any ocr_xword as cuneiform will produce an incorrect bbox for it)

sounds OK or you have suggestions?

Jakub Wilk (jwilk) wrote :

>Example:
><span class='ocr_line' id='line_1' title="bbox 0 0 45 20"><span class='ocr_xword' id='xword_1' title="bbox 0 0 20 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">hello</span></span><span> </span><span class='ocr_xword' id='xword_2' title="bbox 25 0 45 20"><span class='ocr_cinfo' title="x_bboxes b1x0 b1y0 b1x1 b1y1 b2x0 ...">world</span></span>
>(note the whitespace which is not part of any ocr_xword as cuneiform will produce an incorrect bbox for it)

That looks much better that the current output or pre-0.9 output.
However, I'm not sure if/why we need ocr_cinfo at all here. AFAIU,
"x_bboxes" is analogous to "cuts" and "nlp" properties, which could be
applied to any element (e.g. directly to an ocr_xword).

Anyway, if there are any doubts on the interpretation of the hOCR
specification (which is admittedly vague), it's better to ask at
<email address hidden> than to guess.

--
Jakub Wilk

Stuart Whitman (stu-groupw) wrote :

I just recently discovered this issue and wonder what is the final disposition? I read all the comments, but I am still unsure what is going to happen. Has it been determined that cuneiform is producing hocr standard compliant output and the issue is with hocr2pdf? Based on what I have read in the comments the status of "confirmed" confuses me. Does a status of "confirmed" mean the issue is recognized as a cuneiform bug?

I'm having similar issue. I can confirm that it is not related to Cuneiform. I'm using ocropus (ocroscript recognize) (which uses Tesseract) and I have check the resulting .html (hocr) which seems valid and pixel perfect.

However, hocr2pdf misalign the text with their related bounding boxes. I've tried ocroscript recognize with and without the --charboxes options and the result is always wrong (the text has an offset on the Y axis).

This is with exactimage 0.8.1-3build1 on Ubuntu Natty.

Luis Mendes (luis-p-mendes) wrote :

I've installed exactimage 0.8.6 from source and verified that it still can't cope with new cuneiform hocr file format.
Latest cuneiform version that still outputs old format is the 0.8.0. I had to revert to that version to get usable results.

Since both cuneiform and hocr2pdf are needed to get the work done, and it seems that it will take some time for exactimage to support the new cuneiform hocr file format, I think it makes sense to add that command line switch to generate old hocr file format, as. this impedes users to migrate from version 0.8.0 to newer versions.

Jussi Pakkanen (jpakkane) wrote :

I'd like to remind everyone that Cuneiform is currently unmaintained. No-one is working on this or any other bug.

Martin Wildam (mwildam) wrote :

On Mon, Aug 8, 2011 at 09:40, Jussi Pakkanen
<email address hidden> wrote:
> I'd like to remind everyone that Cuneiform is currently unmaintained.
> No-one is working on this or any other bug.

Sad, but I had such an impression already.
As far as I can see the one and only OCR option for Linux and Ubuntu
that runs stable is ABBYY and costs money. Unfortunately they lack
behind in version for the Linux variant.

--
Martin Wildam

http://www.google.com/profiles/mwildam

To be fair there are also OCRAD, GOCR, and Tesseract.

Igor

On Wed, 2011-08-10 at 08:53 +0000, Martin Wildam wrote:
> On Mon, Aug 8, 2011 at 09:40, Jussi Pakkanen
> <email address hidden> wrote:
> > I'd like to remind everyone that Cuneiform is currently unmaintained.
> > No-one is working on this or any other bug.
>
> Sad, but I had such an impression already.
> As far as I can see the one and only OCR option for Linux and Ubuntu
> that runs stable is ABBYY and costs money. Unfortunately they lack
> behind in version for the Linux variant.
>
> --
> Martin Wildam
>
> http://www.google.com/profiles/mwildam
>

jswinner (jswinner) on 2013-07-01
Changed in exactimage (Ubuntu):
status: New → Confirmed
George Chriss (gschriss) wrote :

Treating Comment #1 as "works as intended" (with a character precision limitation) and Bug #632524 as "broken" (font size/placement has no correlation to underlying text + out-of-bounds/missing/"dog-piled" text), I'm happy to report the following:

While developing a new Inkscape extension to export hand-drawn/typed text boxes as hOCR I came across the same issues reported in Bug #632524. The hOCR file generated by the extension does not use 'ocr_word' nor 'ocr_cinfo' elements, just plain text within 'ocr_line' parent elements (with corresponding unique 'id' and 'bbox' attributes).

I believe hocr2pdf was mis-parsing the file expecting that each character was contained within its own bbox. As a stop-gap measure adding either matching <p></p> elements around each plain text line, or, alternatively, a <br> at the end of each plain text line resulted in 'proper' text placement. The <title> element also needs to be escaped in this way.

Tested with exact-image 0.8.8. I wasn't able to complete the build due to relocation errors but '/objdir/frontends/hocr2pdf' was usable as-is. 'lib/hocr.cc' is the file in need of patches.

George Chriss (gschriss) wrote :

Link to Inkscape Extension 'Export Image Overlay Text as hOCR' mentioned in Comment #58: https://bugs.launchpad.net/inkscape/+bug/1069248

Rudolf (rk-com) wrote :

Many thanks to George Chriss! (see above)

My workaround based on his description:
Modify the created hocr by XSLT (see below). Then using hocr2pdf 0.8.9 - and the textboxes are placed (almost) correctly.

$ tesseract image.tif ocr_file hocr
$ xsltproc -html -nonet -novalid -o ocr_fixed.hocr fix-hocr.xsl ocr_file.hocr
$ hocr2pdf -i image.tif -o searchable.pdf <ocr_fixed.hocr

See attached file fix-hocr.xsl.

Merlin (merlin-skinner) wrote :

I can confirm that Rudolf (rk-com)'s and George Chriss (gschriss)'s fix works. Thanks!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers