OCR quality drops

Bug #344790 reported by Polevoy Dmitry
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Cuneiform for Linux
New
Undecided
Unassigned

Bug Description

OCR quality drops during porting.
Look at the result of recognition stdj4.tif, line 7, smart text format

was (stdj4.txt.initial)
mli i f r nin. Ithas

is (stdj4.txt.puma)
m li i f r nin Ithas

Revision history for this message
Polevoy Dmitry (openocr-polevoy) wrote :
Revision history for this message
Polevoy Dmitry (openocr-polevoy) wrote :
Revision history for this message
Polevoy Dmitry (openocr-polevoy) wrote :
Revision history for this message
Polevoy Dmitry (openocr-polevoy) wrote :

I am checking my local (special) binary Win32 back compatible version and OCR results change when new version of puma.dll have been placed into initial Cuneiform version (IMHO puma.dll changes are not the cause of differences, but there are bring other error to light).

Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :

I make small bash script for use with The ISRI OCR Performance Toolkit (http://www.isri.unlv.edu/ISRI/OCRtk).

Script convert text zone files into one file, make recognize with, and without use dictionary and make accuracy report for each recognized files and final report.
To use script, copy into one directory content of two directory in one test packages from http://www.isri.unlv.edu/ISRI/OCRtk and program "accsum" and "accuracy" from http://www.isri.unlv.edu/downloads/ftk-1.0.tgz, then run:
 test Z 3B,
where Z - first letter in extension of zone files, 3B - extension of image files.

After script run you get text report .

I run this script an 3b.tgz and get many cuneiform error such

Unknown DIB format
CTIImageList::AddImage: invalid image info

and

Assertion failed: 0 file /home/mgraf/refactoring/src/lns32/rbambuk.cpp, line 173

Press <Space> to continue execution, <Esc> to abort

Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :

First conclusion:
cuneiform make wrong sectioning on two column pages.

Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :
Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :
Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :
Download full text (14.1 KiB)

Second conclusion:
cuneiform (I try refactoring branch) produce error message like "Unknown DIB format
CTIImageList::AddImage: invalid image info " if the page has image object.

Example

cuneiform -v 1871_016.3B
1871_016.3B=> DIB 2544x3300 2544x3300+0+0 1-bit Bilevel DirectClass 24.02MiB 0.210u 0:00.209
############################
CuneiForm Recognize options:
  Language: 0
  Fax: false
  Use speller: false
Layout options:
  One Column: false
  Dot Matrix: false
  Auto Rotate: false
  Tables number: 0
  Geometry: Rect(Point(0,0), Point(2544,3300)) width:2544; height:3300
FormatOptions:
  SerifName: Times New Roman
  SansSerifName: Arial
  Monospace Name: Courier New
  Use bold: false
  Use Italic: false
  Use font size: false
  Unrecognized char: '~'
  Line breaks: false
############################
The image depth is 24 at this point.
falseWarning: RSL said that the lines don't need to be erased from the picture.
VSL: before table search - 0, after -13
VSL: Нужных изменений не найдено
Container CPAGE contains:
 name : size
TYPE_IMAGE : 12032
TYPE_IMAGE : 12032
TYPE_TEXT : 12032
TYPE_TEXT : 12032 ...

Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :
Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :

This situation is only with refactoring branch

summary: - OCR quality drops (mising point)
+ OCR quality drops
Revision history for this message
Serge Poltavsky (serge-uliss) wrote :

To Kuzemko Aleksandr - no output error fixed in latest revision

Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :

To Serj Poltavskiy
1. Now cuneiform (refactoring branch) produce good output text, but cuneiform (main branch) produce output with image.
2. With two column text image "refactoring" produce wrong order of column, main - right.

Revision history for this message
Serge Poltavsky (serge-uliss) wrote :

Alexandr, thank you for pointing these bugs.
 I know that image saving is broken and it was a quick fix from me.
Also I think that CED module and exports to other formats (ROUT) should be completely rewritten - so if you can find mistake I would be glad, but it's not my main goal.

Revision history for this message
Serge Poltavsky (serge-uliss) wrote :

I found the problem - it's RFRMT module code, changed in 750 rev, during migration from from tagRect16(32) => CIF::Rect
Now I'm going to write test cases and begin to backport changes

Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :
Download full text (6.4 KiB)

I test 3 version of cuneiform: official free version (from openocr.org site), fro cuneiform and refactoring branch with bash script from https://bugs.launchpad.net/cuneiform-linux/+bug/344790/comments/5.
To test I use 106 files with english text without images and tables from 3b.tgz archive. If you wont I can distribute list of it.

The results:
1.a. Official with dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  231979 Characters
   11151 Errors
   95.19% Accuracy

     461 Reject Characters
       0 Suspect Markers
       0 False Marks
    0.20% Characters Marked
   95.83% Accuracy After Correction

     Ins Subst Del Errors
      91 239 1143 1473 Marked
    3052 3511 3115 9678 Unmarked
    3143 3750 4258 11151 Total

   Count Missed %Right
   35888 740 97.94 ASCII Spacing Characters
    8668 662 92.36 ASCII Special Symbols
    6038 904 85.03 ASCII Digits
   11546 551 95.23 ASCII Uppercase Letters
  169839 4036 97.62 ASCII Lowercase Letters
  231979 6893 97.03 Total

1.b. Official without dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  231979 Characters
   10879 Errors
   95.31% Accuracy

     483 Reject Characters
       0 Suspect Markers
       0 False Marks
    0.21% Characters Marked
   95.89% Accuracy After Correction

     Ins Subst Del Errors
     130 196 1022 1348 Marked
    3217 3304 3010 9531 Unmarked
    3347 3500 4032 10879 Total

   Count Missed %Right
   35888 751 97.91 ASCII Spacing Characters
    8668 662 92.36 ASCII Special Symbols
    6038 895 85.18 ASCII Digits
   11546 542 95.31 ASCII Uppercase Letters
  169839 3997 97.65 ASCII Lowercase Letters
  231979 6847 97.05 Total

2.a. refactoring with dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  231979 Characters
   13086 Errors
   94.36% Accuracy

     456 Reject Characters
       0 Suspect Markers
       0 False Marks
    0.20% Characters Marked
   95.03% Accuracy After Correction

     Ins Subst Del Errors
      99 227 1242 1568 Marked
    4592 2830 4096 11518 Unmarked
    4691 3057 5338 13086 Total

   Count Missed %Right
   35888 944 97.37 ASCII Spacing Characters
    8668 733 91.54 ASCII Special Symbols
    6038 846 85.99 ASCII Digits
   11546 629 94.55 ASCI...

Read more...

Revision history for this message
Serge Poltavsky (serge-uliss) wrote :

Alexander, thank you for your report!
I found that quality of recognition changed in some range when I changed linking options of cuneiform.
In modules EXC, RSTR, DIF, RBLOCK - there's a lot of doubling functions and variables, also extern int bla-bla-bla is widely used.
When I removed -fvisibility=hidden some tests show better quality of recognition, but some tests not.

When I set up compiler flag -fvisibility again on some modules, I found that quality result of my branch became closer to original.
I think that way to better recognition results lay in isolating global variables in modules like EXC, LOC, DIF, RSTR, RBLOCK and removing doubling of code.

Also I'm planing to make automatic regression tests, like guys from openocr done, so it will be possible detect regressions after every small change.
But the main goal for me now is:
1. Compiling with MSVC
2. Fix crashes under FreeBSD, NetBSD while recognition with Russian, Bulgarian and Ukrainian languages
3. Increase more test coverage for my written code

Revision history for this message
Kuzemko Aleksandr (kuzemkoa-rambler) wrote :

I think that http://www.redhillconsulting.com.au/products/simian/ may help found doubling of code.

Revision history for this message
Serge Poltavsky (serge-uliss) wrote :

Thank, Alexander! I'll try this tool

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.