X 1.0rc3: Region.text() -- known problems and needed improvements

Reported by RaiMan on 2011-01-31
200
This bug affects 33 people
Affects Status Importance Assigned to Milestone
Sikuli
Low
RaiMan

Bug Description

******* this report is a summary of known problems and feature requests

*** recent status information after release of rc3 see comment #6

The text recognition feature (OCR - Region.text()) together with the possibility to find text in an image is still experimental and under developement.

This are currently reported bugs:
bug 777660: text recognition errors with some fonts
bug 783082: [request] want font parameters for text recognition
bug 735434: Text extraction from Images fails in some cases on colored backgrounds
bug 695616: Inconsistency in text recognition and matching, especially with integers-as-text!
bug 695650: find(text).text() does not return same text
bug 701005: text() always returns text with trailing x'200A20'
bug 701012: text() does not return all intervening blanks, add's others
bug 795391: [request] OCR/tesseract: allow new training sets for other languages and more tesseract features

Other experienced oddities
-- there are problems with text, that is not in english language
-- very small and very large fonts may not work
-- multiline text makes problems
-- intervening/preceding/trailing grafics and symbols are tried to be interpreted as text

Tip when using Region.text():
Currently you get the best results, when the region represents only one line of text and only contains text (no graphics/symbols) in english language. If you can influence it: make the text as large as possible.

-- additional information:
Internally the tesseract OCR engine (http://code.google.com/p/tesseract-ocr/) is used.
So their restrictions apply (e.g. minimum size of font, ...).
Information can be found on their Wiki.

RaiMan (raimund-hocke) on 2011-02-02
description: updated
Changed in sikuli:
status: New → In Progress
RaiMan (raimund-hocke) on 2011-02-25
summary: - X 1.0rc1: Region.text() -- known problems and needed improvements
+ X 1.0rc2: Region.text() -- known problems and needed improvements
Changed in sikuli:
importance: Undecided → High
RaiMan (raimund-hocke) on 2011-04-06
description: updated
RaiMan (raimund-hocke) on 2011-05-05
description: updated
RaiMan (raimund-hocke) on 2011-05-16
description: updated
description: updated
RaiMan (raimund-hocke) on 2011-06-10
description: updated

Is there any plan to integrate tesseract version 3.00?
What would be the issues related?

RaiMan (raimund-hocke) wrote :

Did you make tests with version 3.0?

I made some trials and did not find any better results than with version 2

Yes, I did some tests and didn't find better results either but there are a lot of new languages available. I also trained Tesseract for the "OCR-A extended" font that have good results on smaller text 14-16 px.

RaiMan (raimund-hocke) on 2011-09-15
summary: - X 1.0rc2: Region.text() -- known problems and needed improvements
+ X 1.0rc3: Region.text() -- known problems and needed improvements
RaiMan (raimund-hocke) on 2011-09-27
Changed in sikuli:
milestone: none → x1.0
RaiMan (raimund-hocke) wrote :

***** from a post on the mailing list sikuli-dev by macs

Is the latest Sikuli migrated to tesseract3? I see a branch name as tesseract3 in git hub. I see many issues regarding OCR being discussed in launchpad.

In my understanding OCR results can be improved by pre-processing of images
1. Convert image to gray scale.
2. Improve contrast or apply edge detection filters.
3. inverting colors or negative
4. Reducing the color depth.
5. Apply image smoothing filters.

All filters may not be applicable for all types of images. User might want to improve a filter or a combination of filter to achieve better results. Can we give this option to user?

I was not sure if any of the pre processing was done in the RC2 release. I tried to modify the function "doFind(PSC ptn)" in region.java to convert image to grayscale before OCR processing. But I could not see any improvement in OCR. I did not try further because my eclipse environment is not setup completely. Does Sikuli do any pre-processing of image before calling the OCR?

It would be nice if you can have the following support for OCR in Sikuli

1. Option for user to select language (Already requested)
2. Tesseract supports training and creation of box files. We should have a option to select user trained files.
3. There are many commercial OCR tools which has higher accuracy and better support for other languages. If the Sikuli OCR design can be modular (as defined in blueprint), user should be able to use other OCR.

Other observations in the current OCR

1. The OCR can recognize the text but the click fails.
    If a screen has text "Search" and if I try click("Search") the click returns failure. But when I try to get the text in the screen using the text() api and print the text, it will print all the strings including the string "Search".
    May be I think we need some improvement in searching the string of text returned by OCR.

RaiMan (raimund-hocke) wrote :

**** comment #4 was answered by the developers

--- tesseract 3
The tesseract 3 branch is still under development, not merged back to the main develop branch yet.

--- user option to apply image filtering individually
We hope to do these things automatically and keep the user interface as simple as possible.

--- where is the preprocessing done?
Yes, Sikuli does lots of preprocessing. The code is in sikuli-script/src/main/native, not in the Java level.

--- additional features/options
select language and option to select user trained files: These two would be possible in the tesseract 3 branch.

--- user should be able to use other OCR.
No plan to do this yet.

--- Other observations in the current OCR
This would be improved in the tesseract 3 branch as well.

Tsung-Hsiang Chang (vgod) wrote :

Let me briefly summarize the progress on the OCR research we are doing for Sikuli.

1. Recently I've implemented a new OCR algorithm designed for small screen text (which is from a paper "Recognition of Screen-Rendered Text", ICPR '06). However, it turns out this algorithm doesn't perform so well as the authors claimed in the paper. It's even worse than Tesseract OCR, so right now we will continue using Tesseract as Sikuli's OCR engine.

2. We are migrating from Tesseract 2 to Tesseract 3. One significant advantage of Tesseract 3 is that it supports many more languages such as Chinese and Japanese. We are also working on making a simple OCR trainer so Sikuli users can train the OCR engine using the fonts installed on their systems.

3. Improving OCR performance is very tricky. Lots of parameters and preprocessing could be done to improve it. We put a collection of screenshots with labeled ground truth in our source repo, so everyone can try to improve the OCR algorithm, and simply run the tests to know if it really gets better or worse. Welcome to fork our code and try any possible improvements, or even provide more labeled screenshots to make our data set more diverse.

RaiMan (raimund-hocke) on 2011-10-05
description: updated
yogesh joshi (yjoshi) wrote :

Dear Sebastien Pinel (pinel-sebastien),
   I am training Tesseract for font "OCR A extended".Training process completes without errors. eng.traineddata is created and getting wrong results.Please send me training files( .tif, .box, .tr, eng.traineddata) asap.

RaiMan (raimund-hocke) wrote :

@ yogesh

I have set you both (you and Sebastian) on the subscription list.

So both of you have a chance to read and being noticed.

RaiMan (raimund-hocke) on 2012-11-02
Changed in sikuli:
milestone: x1.0 → none
assignee: nobody → RaiMan (raimund-hocke)
RaiMan (raimund-hocke) on 2012-11-02
tags: added: ocr
RaiMan (raimund-hocke) on 2013-02-21
tags: added: fkt-text
removed: ocr
RaiMan (raimund-hocke) on 2013-02-21
Changed in sikuli:
importance: High → Low
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers