Add --nopictures, --tables=n to cli

Bug #395351 reported by Ben Jackson on 2009-07-03
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
Undecided
Unassigned

Bug Description

I found that cuneiform would not OCR anything inside an outline (any kind of box) and would consider it either a picture or a table. With '--nopicture' it seems to just ignore those same areas. If you also add --tables it will successfully OCR inside the table.

I've never used bzr before this so I've probably botched the patch somehow, but it is fairly trivial.

Ben Jackson (ben.jackson) wrote :

I see now in puma.h there are constants related to the values I set:

        # define PUMA_TABLE_NONE 0
        # define PUMA_TABLE_DEFAULT 1
        # define PUMA_TABLE_ONLY_LINE 2
        # define PUMA_TABLE_ONLY_TEXT 3
        # define PUMA_TABLE_LINE_TEXT 4

        # define PUMA_PICTURE_NONE 0
        # define PUMA_PICTURE_ALL 1

I tried all the table settings but didn't really get any different output. Setting it to anything other than 0 got it to OCR things inside a box outline (as opposed to turning the box into a picture).

Jussi Pakkanen (jpakkane) wrote :

This probably has to do with the table recognition code, which has not been open sourced yet. I'd like to get some comments from Cognitive people before committing this.

Ben Jackson (ben.jackson) wrote :

I saw the earlier discussion about this. I believe the missing code is for table *output* (eg as a spreadsheet). Enabling tables definitely allows the OCR code to look inside outlined areas, which it otherwise will not. In fact, if you OCR a page with a border it will not recognize anything without these new options.

Also, assuming bzr send is reasonable, the two patches are separate, so you could apply --nopictures.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers