german text detecion does work very bad

Bug #677608 reported by benste
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
ocrfeeder (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Binary package hint: ocrfeeder

here are some examples which letters are replaced wrongly while doing OCR for the attached file

should be - detected

h as x
o as 0 or c
ö,ä,ü do not exist
ur as m
ü as ii
ä as é

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: ocrfeeder 0.6.6-3
ProcVersionSignature: Ubuntu 2.6.35-22.35-generic 2.6.35.4
Uname: Linux 2.6.35-22-generic x86_64
NonfreeKernelModules: nvidia
Architecture: amd64
Date: Fri Nov 19 20:08:03 2010
EcryptfsInUse: Yes
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release amd64 (20101007)
PackageArchitecture: all
ProcEnviron:
 LANGUAGE=de_DE:de:en_GB:en
 LANG=de_DE.utf8
 SHELL=/bin/bash
SourcePackage: ocrfeeder

Revision history for this message
benste (benste) wrote :
Revision history for this message
Daniel Koć (kocio) wrote :

What OCR engine did you used for this? Did you add -l option to the settings (in case of German it should be "-l deu" for Tesseract and "-l grm" for CuneiForm)? I think this was your problem.

I've just made extensive changes to the OCR help page, read more about language handling there:

https://help.ubuntu.com/community/OCR#OCRFeeder

Revision history for this message
benste (benste) wrote :

Hi, i was using the default one shipped in Ubuntu, taking a look at the settings it is:

Tesseract
/usr/bin/tesseract
$IMAGE $FILE; cat $FILE.txt

there is nothing to set a lnaguage :-)
should i add it with "arguments"?

+ What about adding a GUI Element to choose a language within OCRFeeder ?

Revision history for this message
Daniel Koć (kocio) wrote :

@benste: try changing options like that:

-l deu $IMAGE $FILE; cat $FILE.txt

I also look for a GUI language chooser.

Revision history for this message
benste (benste) wrote :

indipendent on where you add this the result is a empty text after running OCR

Revision history for this message
Daniel Koć (kocio) wrote :

So that's a different story - no output means that you probably need to install additional language data. For German it's in a tesseract-ocr-deu package (or tesseract-ocr-deu-f in case of Fraktur script).

I don't know if 2.x series is good enough for German, but you may also consider this PPA, which contains fresh 3.x packages and few other fresh OCR tools, inluding other good engine, CuneiForm:

https://launchpad.net/~alex-p/+archive/notesalexp/

Developer I talk with about new official packages said it's a problem with Debian "free" constrains on language data and he needs few days to work it out, but those PPA files work good for Polish.

Revision history for this message
benste (benste) wrote :

installed the packages - but still nothing detected
+ shouldn't the tesseract-ocr-deu package be auto installed by the language manager ?

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ocrfeeder (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.