******* this report is a summary of known problems and feature requests
*** recent status information after release of rc3 see comment #6
The text recognition feature (OCR - Region.text()) together with the possibility to find text in an image is still experimental and under developement.
This are currently reported bugs:
bug 777660: text recognition errors with some fonts
bug 783082: [request] want font parameters for text recognition
bug 735434: Text extraction from Images fails in some cases on colored backgrounds
bug 695616: Inconsistency in text recognition and matching, especially with integers-as-text!
bug 695650: find(text).text() does not return same text
bug 701005: text() always returns text with trailing x'200A20'
bug 701012: text() does not return all intervening blanks, add's others
bug 795391: [request] OCR/tesseract: allow new training sets for other languages and more tesseract features
Other experienced oddities
-- there are problems with text, that is not in english language
-- very small and very large fonts may not work
-- multiline text makes problems
-- intervening/preceding/trailing grafics and symbols are tried to be interpreted as text
Tip when using Region.text():
Currently you get the best results, when the region represents only one line of text and only contains text (no graphics/symbols) in english language. If you can influence it: make the text as large as possible.
-- additional information:
Internally the tesseract OCR engine (http://code.google.com/p/tesseract-ocr/) is used.
So their restrictions apply (e.g. minimum size of font, ...).
Information can be found on their Wiki.
Is there any plan to integrate tesseract version 3.00?
What would be the issues related?