Installing tesseract-ocr should also install tesseract-ocr-eng

Bug #224264 reported by Yesudeep J Mangalapilly
44
This bug affects 9 people
Affects Status Importance Assigned to Milestone
tesseract (Debian)
Fix Released
Unknown
tesseract (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Problem Description:
---------------------------
$ tesseract foo.tiff foo.text
Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/eng.unicharset

When tesseract is called without specifying the language parameter, it defaults to using English.
The tesseract-ocr package does not install the English language data by default, which causes
tesseract-ocr to output this error message.

The file in question does not exist at this particular location.

Suggested Solution:
--------------------------
The tesseract-ocr package should include tesseract-ocr-eng as a dependency.

Changed in tesseract:
assignee: nobody → dcordero
status: New → In Progress
Changed in tesseract:
assignee: dcordero → nobody
status: In Progress → Confirmed
Revision history for this message
komputes (komputes) wrote :

I am confirming having this bug as well. It seems that the English unicharset was not included in the package.

I am using Ubuntu 8.04.1 and tesseract-ocr 2.01-3. The workaround is to install the package manually. Open a terminal and run:
$ sudo apt-get install tesseract-ocr-eng

I found that you should use a high quality image when converting to text through OCR or you are likely to run into spelling errors. Please make english part of the default package (instead of German) or make it a dependency when packaging.

Revision history for this message
CSkau (clementskau-gmail) wrote :

This is also a problem on Ubuntu 9.10 beta (fully updated 2009-10-18)

Changed in tesseract (Debian):
status: Unknown → New
Revision history for this message
SabreWolfy (sabrewolfy) wrote :

Confirmed in fully patched Karmic.

Revision history for this message
SabreWolfy (sabrewolfy) wrote :

Also, the filename extension MUST be "tif", not "tiff".

Changed in tesseract (Debian):
status: New → Fix Released
Changed in tesseract (Debian):
status: Fix Released → New
Revision history for this message
Damiön la Bagh (kat-amsterdam) wrote :

This also happens with the Dutch version

kat@tab:~/Bureaublad$ tesseract DOC178.tif sint.txt -l nl
Unable to load unicharset file /usr/share/tesseract-ocr/tessdata/nl.unicharset of tesseract.

Revision history for this message
neuromancer (neuromancer) wrote :

In Ubuntu 10.10 maverick meerkat tesseract-eng package is correctly installed installing tesseract so this bug is FIXED.

@Kat Amsterdam : I think that your problem is a different one and not related to this bug.
Howewer try to see if in the directory reported by your problem there is the right file.
cd /usr/share/tesseract-ocr/tessdata/
ls -al

In my case launching tesseract file.tif file.txt -l it
give me same error
So I've checked if it.unicharset was present in /usr/share/tesseract-ocr/tessdata/ folder and I've found that this file is named ita.unicharset.
Launching tesseract file.tif file.txt -l ita works great :)

Changed in tesseract (Ubuntu):
status: Confirmed → Fix Released
Changed in tesseract (Debian):
status: New → Fix Committed
Changed in tesseract (Debian):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.