Ubuntu
ocrfeeder package

german text detecion does work very bad

Bug #677608 reported by benste on 2010-11-19

12

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	ocrfeeder (Ubuntu)	Confirmed	Undecided	Unassigned

Bug Description

Binary package hint: ocrfeeder

here are some examples which letters are replaced wrongly while doing OCR for the attached file

should be - detected

h as x
o as 0 or c
ö,ä,ü do not exist
ur as m
ü as ii
ä as é

ProblemType: Bug
DistroRelease: Ubuntu 10.10
Package: ocrfeeder 0.6.6-3
ProcVersionSignature: Ubuntu 2.6.35-22.35-generic 2.6.35.4
Uname: Linux 2.6.35-22-generic x86_64
NonfreeKernelModules: nvidia
Architecture: amd64
Date: Fri Nov 19 20:08:03 2010
EcryptfsInUse: Yes
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Release amd64 (20101007)
PackageArchitecture: all
ProcEnviron:
LANGUAGE=de_DE:de:en_GB:en
LANG=de_DE.utf8
SHELL=/bin/bash
SourcePackage: ocrfeeder

Tags:

Revision history for this message

benste (benste) wrote on 2010-11-19:

#1

bpb geschi_ s1.JPG Edit (240.2 KiB, image/jpeg)
Dependencies.txt Edit (5.3 KiB, text/plain; charset="utf-8")

Revision history for this message

Daniel Koć (kocio) wrote on 2011-02-05:

#2

What OCR engine did you used for this? Did you add -l option to the settings (in case of German it should be "-l deu" for Tesseract and "-l grm" for CuneiForm)? I think this was your problem.

I've just made extensive changes to the OCR help page, read more about language handling there:

https://help.ubuntu.com/community/OCR#OCRFeeder

Revision history for this message

benste (benste) wrote on 2011-02-05:

#3

Hi, i was using the default one shipped in Ubuntu, taking a look at the settings it is:

Tesseract
/usr/bin/tesseract
$IMAGE $FILE; cat $FILE.txt

there is nothing to set a lnaguage :-)
should i add it with "arguments"?

+ What about adding a GUI Element to choose a language within OCRFeeder ?

Revision history for this message

Daniel Koć (kocio) wrote on 2011-02-07:

#4

@benste: try changing options like that:

-l deu $IMAGE $FILE; cat $FILE.txt

I also look for a GUI language chooser.

Revision history for this message

benste (benste) wrote on 2011-02-07:

#5

indipendent on where you add this the result is a empty text after running OCR

Revision history for this message

Daniel Koć (kocio) wrote on 2011-02-07:

#6

So that's a different story - no output means that you probably need to install additional language data. For German it's in a tesseract-ocr-deu package (or tesseract-ocr-deu-f in case of Fraktur script).

I don't know if 2.x series is good enough for German, but you may also consider this PPA, which contains fresh 3.x packages and few other fresh OCR tools, inluding other good engine, CuneiForm:

https://launchpad.net/~alex-p/+archive/notesalexp/

Developer I talk with about new official packages said it's a problem with Debian "free" constrains on language data and he needs few days to work it out, but those PPA files work good for Polish.

Revision history for this message

benste (benste) wrote on 2011-02-07:

#7

installed the packages - but still nothing detected
+ shouldn't the tesseract-ocr-deu package be auto installed by the language manager ?

Revision history for this message

Launchpad Janitor (janitor) wrote on 2012-04-17:

#8

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in ocrfeeder (Ubuntu):
status:	New → Confirmed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.