Bug #344790 “OCR quality drops” : Bugs : Cuneiform for Linux

Revision history for this message

Polevoy Dmitry (openocr-polevoy) wrote on 2009-03-18:

#1

STDJ4.TIF Edit (19.1 KiB, image/tiff)

Revision history for this message

Polevoy Dmitry (openocr-polevoy) wrote on 2009-03-18:

#2

Initial Cuneiform version text output Edit (1.1 KiB, text/plain)

Revision history for this message

Polevoy Dmitry (openocr-polevoy) wrote on 2009-03-18:

#3

Current version text output Edit (1.1 KiB, text/plain)

Revision history for this message

Polevoy Dmitry (openocr-polevoy) wrote on 2009-03-18:

#4

I am checking my local (special) binary Win32 back compatible version and OCR results change when new version of puma.dll have been placed into initial Cuneiform version (IMHO puma.dll changes are not the cause of differences, but there are bring other error to light).

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2009-11-10:

#5

bash script Edit (2.1 KiB, text/plain)

I make small bash script for use with The ISRI OCR Performance Toolkit (http://www.isri.unlv.edu/ISRI/OCRtk).

Script convert text zone files into one file, make recognize with, and without use dictionary and make accuracy report for each recognized files and final report.
To use script, copy into one directory content of two directory in one test packages from http://www.isri.unlv.edu/ISRI/OCRtk and program "accsum" and "accuracy" from http://www.isri.unlv.edu/downloads/ftk-1.0.tgz, then run:
test Z 3B,
where Z - first letter in extension of zone files, 3B - extension of image files.

After script run you get text report .

I run this script an 3b.tgz and get many cuneiform error such

Unknown DIB format
CTIImageList::AddImage: invalid image info

and

Assertion failed: 0 file /home/mgraf/refactoring/src/lns32/rbambuk.cpp, line 173

Press <Space> to continue execution, <Esc> to abort

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2009-11-10:

#6

test image Edit (98.6 KiB, application/octet-stream)

First conclusion:
cuneiform make wrong sectioning on two column pages.

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2009-11-10:

#7

correct text file Edit (5.1 KiB, text/plain)

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2009-11-10:

#8

text output by cuneiform Edit (5.0 KiB, text/plain)

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2009-11-10:

#9

Download full text (14.1 KiB)

Second conclusion:
cuneiform (I try refactoring branch) produce error message like "Unknown DIB format
CTIImageList::AddImage: invalid image info " if the page has image object.

Example

cuneiform -v 1871_016.3B
1871_016.3B=> DIB 2544x3300 2544x3300+0+0 1-bit Bilevel DirectClass 24.02MiB 0.210u 0:00.209
############################
CuneiForm Recognize options:
  Language: 0
  Fax: false
  Use speller: false
Layout options:
  One Column: false
  Dot Matrix: false
  Auto Rotate: false
  Tables number: 0
  Geometry: Rect(Point(0,0), Point(2544,3300)) width:2544; height:3300
FormatOptions:
  SerifName: Times New Roman
  SansSerifName: Arial
  Monospace Name: Courier New
  Use bold: false
  Use Italic: false
  Use font size: false
  Unrecognized char: '~'
  Line breaks: false
############################
The image depth is 24 at this point.
falseWarning: RSL said that the lines don't need to be erased from the picture.
VSL: before table search - 0, after -13
VSL: Нужных изменений не найдено
Container CPAGE contains:
name : size
TYPE_IMAGE : 12032
TYPE_IMAGE : 12032
TYPE_TEXT : 12032
TYPE_TEXT : 12032 ...

Second conclusion:
cuneiform (I try refactoring branch) produce error message like "Unknown DIB format                                    
CTIImageList::AddImage: invalid image info " if the page has image object.

Example

cuneiform -v 1871_016.3B 
1871_016.3B=> DIB 2544x3300 2544x3300+0+0 1-bit Bilevel DirectClass 24.02MiB 0.210u 0:00.209
############################                                                                
CuneiForm Recognize options:                                                                
  Language:      0                                                                          
  Fax:           false                                                                      
  Use speller:   false                                                                      
Layout options:                                                                             
  One Column:    false                                                                      
  Dot Matrix:    false                                                                      
  Auto Rotate:   false                                                                      
  Tables number: 0                                                                          
  Geometry:      Rect(Point(0,0), Point(2544,3300)) width:2544; height:3300                 
FormatOptions:                                                                              
  SerifName:         Times New Roman                                                        
  SansSerifName:     Arial                                                                  
  Monospace Name:    Courier New                                                            
  Use bold:          false                                                                  
  Use Italic:        false                                                                  
  Use font size:     false                                                                  
  Unrecognized char: '~'                                                                    
  Line breaks:       false                                                                  
############################                                                                
The image depth is 24 at this point.                                                        
falseWarning: RSL said that the lines don't need to be erased from the picture.             
VSL: before table search - 0, after -13                                                     
VSL: Нужных изменений не найдено                                                            
Container CPAGE contains:                                                                   
 name : size                                                                                
TYPE_IMAGE : 12032                                                                          
TYPE_IMAGE : 12032                                                                          
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
TYPE_TEXT : 12032                                                                           
Fragment 1 Line  1: <1000>                                                                  
Fragment 2 Line  2: <J. E. COSTA>                                                           
Fragment 3 Line  3: <reconstruction from rainfall data. Reconstructed flood peaks based>    
Fragment 3 Line  4: <on rainfall analyses of' the 1976 Big Thompson storm for the two>      
Fragment 3 Line  5: <basins are 153 and 110 m~/s (Miller and others, 1978) (Table 6).>      
Fragment 3 Line  6: <Th>                                                                    
Fragment 3 Line  7: <ese values are also closer to the paleohydraulic values than are>      
Fragment 3 Line  8: <the slope-area estimates. Thus, two independent indirect discharge>    
Fragment 3 Line  9: <estimates give peak discharge estimates similar to those calculated>   
Fragment 3 Line 10: <by the paleohydraulic technique developed here, but significantl>      
Fragment 3 Line 11: <icny>                                                                  
Fragment 3 Line 12: <less than the published conventional slope-area discharge estimates>   
Fragment 3 Line 13: <(Table 6). These results support the suggestion that excessive chan->  
Fragment 3 Line 14: <nel scour in small mountain tributaries could cause slope-area dis->   
Fragment 3 Line 15: <charge estimates to be too large.>                                     
Fragment 3 Line 16: <A second explanation for why most paleohydraulic recon->               
Fragment 3 Line 17: <structed discharges on small streams are lower than those estimated>   
Fragment 3 Line 18: <by slope-area techniques may be that slope-area inethods require>      
Fragment 3 Line 19: <the estimation of a roughness coefficient (n). Typical values consis-> 
Fragment 3 Line 20: <tently selected for large floods in small mountain channels are n =>   
Fragment 3 Line 21: <0.035 to 0.06. Research on verification of roughness coefficients for> 
Fragment 3 Line 22: <steep mountain channels recently completed (R. D. Jarrett, unpub.>     
Fragment 3 Line 23: <data) indicates that these n values may be too low by factors of 1.5>  
Fragment 3 Line 24: <to 2.0. Higher values would reduce velocities and result in lower>     
Fragment 3 Line 25: <slope-area discharge estimates.>                                       
Fragment 3 Line 26: <Finally, the fundamental assumptions that particles of all sizes>      
Fragment 3 Line 27: <are available for transport in small, steep mountain valleys and that> 
Fragment 3 Line 28: <flood velocity and depth (actually depth-slope product) are>           
Fragment 3 Line 29: <reflected in the size of boulders in flood deposits must be examined.> 
Fragment 3 Line 30: <Large floods may have been able to move boulders larger than>          
Fragment 3 Line 31: <those that were available. This may be the case in Sawmill Gulch>      
Fragment 3 Line 32: <(Table 6, site 9), which follows a major shear zone along which>       
Fragment 3 Line 33: <uranium enrichrnent occurs (Sims and Sheridan, 1964). Fault>           
Fragment 3 Line 34: <movements have crushed and broken the bedrock along closely>           
Fragment 3 Line 35: <spaced joints and fractures, and consequently there are no very>       
Fragment 3 Line 36: <large boulders available to be moved during a large flood. The>        
Fragment 3 Line 37: <second assumption, that average velocity and depth can be recon->      
Fragment 3 Line 38: <structed from particle size with reasonable accuracy, requires that>   
Fragment 3 Line 39: <the methods selected, premises, and numerical values estimated rea->   
Fragment 3 Line 40: <sonably approximate processes and conditions in the field. Unfor->     
Fragment 3 Line 41: <tunately, the hazards entailed in compiling actual measurements>       
Fragment 3 Line 42: <and observations during catastrophic floods preclude any substan->     
Fragment 3 Line 43: <tial direct empirical corroboration.>                                  
Fragment 4 Line 44: <macroturbulent effects in large rivers during flash fl>                
Fragment 4 Line 45: <'�rna>                                                                 
Fragment 4 Line 46: <explanation. Lifting forces induced by macrot>                         
Fragment 4 Line 47: <o ur ulent ">                                                          
Fragment 4 Line 48: <play an essential role in the entrainment of coarse>                   
Fragment 4 Line 49: <rse Particles ii>                                                      
Fragment 4 Line 50: <deep flows (Matthes, 1947; Baker, 1973; Jackson, 1976>                 
Fragment 4 Line 51: <upward forces promote the entrainment of parti.>                       
Fragment 4 Line 52: <ac son, 19761>                                                         
Fragment 4 Line 53: <coarsr>                                                                
Fragment 4 Line 54: <those that tractive force and velocity alone can .:>                   
Fragment 4 Line 55: <iphsh>                                                                 
Fragment 4 Line 56: <particles coarser than about 2 m may be moved;>                        
Fragment 4 Line 57: <iws le,>                                                               
Fragment 4 Line 58: <and more shallow than would be predicted by exi;>                      
Fragment 4 Line 59: <exu ~polatrni>                                                         
Fragment 4 Line 60: <incipient motion velocity and depth values for small>                  
Fragment 4 Line 61: <e" pariie>                                                             
Fragment 5 Line 62: <APPLICATION OF PALEOHYDRAULIC>                                         
Fragment 5 Line 63: <DISCHARGE COMPUTATIONS>                                                
Fragment 6 Line 64: <The application of the paleohydraulic flood dis h>                     
Fragment 6 Line 65: <isc arge (>                                                            
Fragment 6 Line 66: <struction technique developed here can be demonstr t d>                
Fragment 6 Line 67: <s rate usin>                                                           
Fragment 6 Line 68: <streams in the Colorado Front Range with sediment 1>                   
Fragment 6 Line 69: <mento ogrea>                                                           
Fragment 6 Line 70: <dence of large flash floods, but without conventional indiree>         
Fragment 6 Line 71: <charge estimates. The two examples are a small tributa>                
Fragment 6 Line 72: <u arytog>                                                              
Fragment 6 Line 73: <Gulch in the Big Thompson River basin, and a I:irge,t,>                
Fragment 6 Line 74: <Boulder Creek at Boulder, Colorado, where the s, -entolo>              
Fragment 6 Line 75: <flood record previously has been investigated by B. and q>             
Fragment 6 Line 76: <(1980).>                                                               
Fragment 7 Line 77: <Rabbit Gulch Tributary>                                                
Fragment 8 Line 78: <Figure 1 I shows a pile of large boulders deposited at the>            
Fragment 8 Line 79: <2>                                                                     
Fragment 8 Line 80: <of a small (1.8 km ) tributary to Rabbit Gulch from a catastrol>       
Fragment 8 Line 81: <flash flood in 1976 in the Big Thompson River basin. The averag>       
Fragment 8 Line 82: <the 5 largest boulders is 1,150 mm, and the channel slope measo>       
Fragment 8 Line 83: <from 1:24,000 scale topographic maps is 0.091. Using equatioo>         
Fragment 8 Line 84: <the estimated average flood velocity is 5.57 m/s, and from Figor>      
Fragment 8 Line 85: <the estimated average flood depth is 1.35 m. Two valley cros~ s>       
Fragment 8 Line 86: <tions are shown in Figure 12, along with the appropriate top flo>      
Fragment 8 Line 87: <width (dashed lines) for the estimated average depth., he averi>       
Fragment 8 Line 88: <discharge for the two cross sections is 57 ms/s. This ts to lx>
Fragment 8 Line 89: <reasonable value for the flood peak for two reasons thc or>
Fragment 8 Line 90: <discharge is approximately 32 m-/s/km2, which r.~ . rnilar>
Fragment 8 Line 91: <unit discharges for other small tributaries in the bra,ompsr>
Fragment 9 Line 92: <I.arge Streams>
Fragment10 Line 93: <When dealing with particles coarser than about 2 m, paleohy->
Fragment10 Line 94: <draulic reconstructions of' average velocity and depth are less accu->
Fragment10 Line 95: <rate than f' or smaller boulders. For the Big Thompson River, the>
Fragment10 Line 96: <only large stream in the Colorado Front Range used to verify>
Fragment10 Line 97: <paleohydraulic reconstructions, postflood slope-area surveys (Groz->
Fragment10 Line 98: <ier and others, 1976) indicated the average flood depth for two cross>
Fragment10 Line 99: <sections was 3.23 m. The reconstructed depth from Figure 7 using>
Fragment10 Line100: <the average of the five largest boulders moved through the cross>
Fragment10 Line101: <section and deposited below the mouth of the canyon (2.76 m)>
Fragment10 Line102: <( .. radley, 1982, personal commun.) is 4.80 m. The calculated>
Fragment10 Line103: <(W. C. Br>
Fragment10 Line104: <average velocity from slope-area measurements (rated "poor") is>
Fragment10 Line105: <7.92 m/s compared to 8.53 m/s computed from equation 10. The>
Fragment10 Line106: <paleohydraulic overestimation of depth and velocity results in the>
Fragment10 Line107: <reconstructed peak discharge at the mouth of the canyon exceedin>
Fragment10 Line108: <g>
Fragment10 Line109: <the slope-area estimate by 76eZ<.>
Fragment10 Line110: <This is a greater diff'erence than occurs in any of the smaller>
Fragment10 Line111: <streams with smaller-sized flood boulders. Possibly, additional>
Fragment11 Line112: <Figure II. Photograph of large boulders deposited at '"'>
Fragment11 Line113: <mouth of an unnamed tributary to Rabbit Gulch from a farg�'>
Fragment11 Line114: <flood in 1976. No previous discharge estimates e ist for this��>
Unknown DIB format
CTIImageList::AddImage: invalid image info

Text is recognized correctly, but there are no output file produced

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2009-11-10:

#10

Test page with image Edit (110.3 KiB, application/octet-stream)

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2009-11-10:

#11

This situation is only with refactoring branch

Kuzemko Aleksandr (kuzemkoa-rambler) on 2009-11-10

summary:

- OCR quality drops (mising point)
+ OCR quality drops

Revision history for this message

Serge Poltavsky (serge-uliss) wrote on 2009-11-11:

#12

To Kuzemko Aleksandr - no output error fixed in latest revision

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2009-11-11:

#13

Two column image Edit (98.6 KiB, application/octet-stream)

To Serj Poltavskiy
1. Now cuneiform (refactoring branch) produce good output text, but cuneiform (main branch) produce output with image.
2. With two column text image "refactoring" produce wrong order of column, main - right.

Revision history for this message

Serge Poltavsky (serge-uliss) wrote on 2009-11-11:

#14

Alexandr, thank you for pointing these bugs.
I know that image saving is broken and it was a quick fix from me.
Also I think that CED module and exports to other formats (ROUT) should be completely rewritten - so if you can find mistake I would be glad, but it's not my main goal.

Revision history for this message

Serge Poltavsky (serge-uliss) wrote on 2009-11-18:

#15

I found the problem - it's RFRMT module code, changed in 750 rev, during migration from from tagRect16(32) => CIF::Rect
Now I'm going to write test cases and begin to backport changes

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2010-02-04:

#16

Download full text (6.4 KiB)

I test 3 version of cuneiform: official free version (from openocr.org site), fro cuneiform and refactoring branch with bash script from https://bugs.launchpad.net/cuneiform-linux/+bug/344790/comments/5.
To test I use 106 files with english text without images and tables from 3b.tgz archive. If you wont I can distribute list of it.

The results:
1.a. Official with dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  231979 Characters
   11151 Errors
   95.19% Accuracy

     461 Reject Characters
       0 Suspect Markers
       0 False Marks
    0.20% Characters Marked
   95.83% Accuracy After Correction

     Ins Subst Del Errors
      91 239 1143 1473 Marked
    3052 3511 3115 9678 Unmarked
    3143 3750 4258 11151 Total

   Count Missed %Right
   35888 740 97.94 ASCII Spacing Characters
    8668 662 92.36 ASCII Special Symbols
    6038 904 85.03 ASCII Digits
   11546 551 95.23 ASCII Uppercase Letters
  169839 4036 97.62 ASCII Lowercase Letters
  231979 6893 97.03 Total

1.b. Official without dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  231979 Characters
   10879 Errors
   95.31% Accuracy

     483 Reject Characters
       0 Suspect Markers
       0 False Marks
    0.21% Characters Marked
   95.89% Accuracy After Correction

     Ins Subst Del Errors
     130 196 1022 1348 Marked
    3217 3304 3010 9531 Unmarked
    3347 3500 4032 10879 Total

   Count Missed %Right
   35888 751 97.91 ASCII Spacing Characters
    8668 662 92.36 ASCII Special Symbols
    6038 895 85.18 ASCII Digits
   11546 542 95.31 ASCII Uppercase Letters
  169839 3997 97.65 ASCII Lowercase Letters
  231979 6847 97.05 Total

2.a. refactoring with dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  231979 Characters
   13086 Errors
   94.36% Accuracy

     456 Reject Characters
       0 Suspect Markers
       0 False Marks
    0.20% Characters Marked
   95.03% Accuracy After Correction

     Ins Subst Del Errors
      99 227 1242 1568 Marked
    4592 2830 4096 11518 Unmarked
    4691 3057 5338 13086 Total

   Count Missed %Right
   35888 944 97.37 ASCII Spacing Characters
    8668 733 91.54 ASCII Special Symbols
    6038 846 85.99 ASCII Digits
   11546 629 94.55 ASCI...

I test 3 version of cuneiform: official free version (from openocr.org site), fro cuneiform and refactoring branch with bash script from https://bugs.launchpad.net/cuneiform-linux/+bug/344790/comments/5.
To test I use 106 files with english text without images and tables from 3b.tgz archive. If you wont I can distribute list of it.

The results:
1.a. Official  with dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1
-----------------------------------------
  231979   Characters                    
   11151   Errors                        
   95.19%  Accuracy

461   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.20%  Characters Marked
   95.83%  Accuracy After Correction

Ins    Subst      Del   Errors
      91      239     1143     1473   Marked
    3052     3511     3115     9678   Unmarked
    3143     3750     4258    11151   Total

Count   Missed   %Right
   35888      740    97.94   ASCII Spacing Characters
    8668      662    92.36   ASCII Special Symbols   
    6038      904    85.03   ASCII Digits            
   11546      551    95.23   ASCII Uppercase Letters 
  169839     4036    97.62   ASCII Lowercase Letters 
  231979     6893    97.03   Total

1.b. Official without dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1                
-----------------------------------------                
  231979   Characters                                    
   10879   Errors                                        
   95.31%  Accuracy

483   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.21%  Characters Marked
   95.89%  Accuracy After Correction

Ins    Subst      Del   Errors
     130      196     1022     1348   Marked
    3217     3304     3010     9531   Unmarked
    3347     3500     4032    10879   Total

Count   Missed   %Right
   35888      751    97.91   ASCII Spacing Characters
    8668      662    92.36   ASCII Special Symbols   
    6038      895    85.18   ASCII Digits            
   11546      542    95.31   ASCII Uppercase Letters 
  169839     3997    97.65   ASCII Lowercase Letters 
  231979     6847    97.05   Total

2.a. refactoring with dictionary

UNLV-ISRI OCR Accuracy Report Version 5.1                           
-----------------------------------------                           
  231979   Characters                                               
   13086   Errors                                                   
   94.36%  Accuracy

456   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.20%  Characters Marked
   95.03%  Accuracy After Correction

Ins    Subst      Del   Errors
      99      227     1242     1568   Marked
    4592     2830     4096    11518   Unmarked
    4691     3057     5338    13086   Total

Count   Missed   %Right
   35888      944    97.37   ASCII Spacing Characters
    8668      733    91.54   ASCII Special Symbols   
    6038      846    85.99   ASCII Digits            
   11546      629    94.55   ASCII Uppercase Letters 
  169839     4596    97.29   ASCII Lowercase Letters 
  231979     7748    96.66   Total    
2.b. refactoring without dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1                             
-----------------------------------------                             
  231979   Characters                                                 
   13333   Errors                                                     
   94.25%  Accuracy

478   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.21%  Characters Marked
   94.97%  Accuracy After Correction

Ins    Subst      Del   Errors
     133      247     1293     1673   Marked
    4805     2930     3925    11660   Unmarked
    4938     3177     5218    13333   Total

Count   Missed   %Right
   35888      994    97.23   ASCII Spacing Characters
    8668      747    91.38   ASCII Special Symbols   
    6038      853    85.87   ASCII Digits            
   11546      669    94.21   ASCII Uppercase Letters 
  169839     4852    97.14   ASCII Lowercase Letters 
  231979     8115    96.50   Total

3.a. cuneiform branch with dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1                                  
-----------------------------------------                                  
  231979   Characters                                                      
   13837   Errors                                                          
   94.04%  Accuracy

451   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.19%  Characters Marked
   94.77%  Accuracy After Correction

Ins    Subst      Del   Errors
      94      216     1395     1705   Marked
    4445     3766     3921    12132   Unmarked
    4539     3982     5316    13837   Total

Count   Missed   %Right
   35888     2483    93.08   ASCII Spacing Characters
    8668      965    88.87   ASCII Special Symbols   
    6038      800    86.75   ASCII Digits            
   11546      583    94.95   ASCII Uppercase Letters 
  169839     3690    97.83   ASCII Lowercase Letters 
  231979     8521    96.33   Total

3.b. cuneiform without dictionary
UNLV-ISRI OCR Accuracy Report Version 5.1                                    
-----------------------------------------                                    
  231979   Characters                                                        
   14426   Errors                                                            
   93.78%  Accuracy

451   Reject Characters
       0   Suspect Markers  
       0   False Marks      
    0.19%  Characters Marked
   94.52%  Accuracy After Correction

Ins    Subst      Del   Errors
      94      216     1395     1705   Marked
    4739     3766     4216    12721   Unmarked
    4833     3982     5611    14426   Total

Count   Missed   %Right
   35888     2525    92.96   ASCII Spacing Characters
    8668      969    88.82   ASCII Special Symbols   
    6038      802    86.72   ASCII Digits            
   11546      595    94.85   ASCII Uppercase Letters 
  169839     3924    97.69   ASCII Lowercase Letters 
  231979     8815    96.20   Total

Revision history for this message

Serge Poltavsky (serge-uliss) wrote on 2010-02-05:

#17

Alexander, thank you for your report!
I found that quality of recognition changed in some range when I changed linking options of cuneiform.
In modules EXC, RSTR, DIF, RBLOCK - there's a lot of doubling functions and variables, also extern int bla-bla-bla is widely used.
When I removed -fvisibility=hidden some tests show better quality of recognition, but some tests not.

When I set up compiler flag -fvisibility again on some modules, I found that quality result of my branch became closer to original.
I think that way to better recognition results lay in isolating global variables in modules like EXC, LOC, DIF, RSTR, RBLOCK and removing doubling of code.

Also I'm planing to make automatic regression tests, like guys from openocr done, so it will be possible detect regressions after every small change.
But the main goal for me now is:
1. Compiling with MSVC
2. Fix crashes under FreeBSD, NetBSD while recognition with Russian, Bulgarian and Ukrainian languages
3. Increase more test coverage for my written code

Revision history for this message

Kuzemko Aleksandr (kuzemkoa-rambler) wrote on 2010-02-05:

#18

I think that http://www.redhillconsulting.com.au/products/simian/ may help found doubling of code.

Revision history for this message

Serge Poltavsky (serge-uliss) wrote on 2010-02-07:

#19

Thank, Alexander! I'll try this tool

Cuneiform for Linux

OCR quality drops

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches