Extract text using optical character recognition (OCR)

Bug #483391 reported by Robert Ancell
176
This bug affects 29 people
Affects Status Importance Assigned to Milestone
Simple Scan
In Progress
Wishlist
Unassigned

Bug Description

Simple Scan should offer a workflow to do optical character recognition (OCR) on the scanned text.
It is to be decided what this workflow should look like, but we should do it in two steps:

Milestone 1: some-ocr-at-all:
Get a minimum viable product: Add a button to the interface that reads "Recognize Text", and when it is clicked, the current page is saved (in an appropriate format) to /tmp/$something and the most mature OCR tool is invoked with that file as input.

Milestone 2: integrated-ocr:
Make the whole thing more integrated, so that simple scan does the scanning with settings optimized for OCR, automatically applies relevant image preprocessing, allows to select the area to work on from within the application and probably allow exporting to PDF with searchable text and neat stuff like that.

List of OCR engines / software that might be evaluated:

ocropus: http://code.google.com/p/ocropus/source/list
Cuneiform: https://launchpad.net/cuneiform-linux
tesseract-ocr: http://code.google.com/p/tesseract-ocr/source/list
Ocrad: http://www.gnu.org/software/ocrad/
OCRFeeder: https://live.gnome.org/OCRFeeder

Original Description:
Add a "Text" profile that automatically runs the scan through OCR and saves in .txt format

Tags: patch
Changed in simple-scan:
status: New → Triaged
importance: Undecided → Wishlist
Revision history for this message
Robert Ancell (robert-ancell) wrote :

Probably better to have an option in page menu -> "Convert to text"

summary: - Add text mode which uses OCR
+ Extract text using optical character recognition (OCR)
Revision history for this message
Robert Ancell (robert-ancell) wrote :

Tesseract seems the appropriate OCR engine to use:
http://code.google.com/p/tesseract-ocr/

Revision history for this message
Robert Ancell (robert-ancell) wrote :

See bug #519618 for barcode support

Revision history for this message
Rui Batista (ruiandrebatista) wrote :

Hi,

There is also a linux port of the cuneiForm OCR here:
https://launchpad.net/cuneiform-linux

For portuguese texts I got better results then with tesseract. My suggestion is making the OCR program configurable, since most of the times OCR engines in Linux are simple cli programs, making a common interface to them in python for example don't seem very dificult. But for starting I do think tesseract is a good choice.

Revision history for this message
Milan Bouchet-Valat (nalimilan) wrote :

IMHO the best way of integrating OCR would indeed be a menu or toolbar icon, which would open the default text processor for ODT files. People are unlikely to edit their text files as .txt in gedit... :-p

Changed in simple-scan:
assignee: nobody → Liel Fridman (lielft)
Revision history for this message
Robert Ancell (robert-ancell) wrote :

Liel, let me know if you need any help or modifications to simple-scan to make this easier

Revision history for this message
Liel Fridman (lielft-deactivatedaccount) wrote :

I think it sould use HOCR for Hebrew OCR actions. What do you think?

Changed in simple-scan:
status: Triaged → In Progress
Revision history for this message
toobuntu (toobuntu) wrote :

If this helps... For English OCR, I have some helper scripts. I don't remember where I found them, prob. Ubuntu Forums.

/ocr$ ls
ocr.sh pdf2tif usage.txt

/ocr$ cat usage.txt
to generate a txt from a pdf # uses pdf2tif as a helper script
./ocr.sh <filename>.pdf

to produce only a tif from a pdf
./pdf2tif <filename>.pdf

/ocr$ cat ocr.sh
#! /bin/sh -e

# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.

./pdf2tif $1

# edit this to point to wherever you've got your tesseract binary
progdir=/usr/bin

for j in *.tif

    do
    x=$( basename $j \.tif )
    ${progdir}/tesseract ${j} ${x}
    rm ${x}.raw
    rm ${x}.map

# un-comment next line if you want to remove the .tif files when done.
# rm ${j}
done

/ocr$ cat pdf2tif
#! /bin/sh -e
# $Id: pdf2ps 6300 2005-12-28 19:56:24Z giles $
# Convert PDF to PostScript.

# This definition is changed on install to match the
# executable name set in the makefile
GS_EXECUTABLE=gs

OPTIONS=""
while true
do
 case "$1" in
 -?*) OPTIONS="$OPTIONS $1" ;;
 *) break ;;
 esac
 shift
done

if [ $# -eq 2 ]
then
    outfile=$2
elif [ $# -eq 1 ]
then
    outfile=$( basename "$1" \.pdf ).tif
else
    echo "Usage: $( basename $0 ) [-dASCII85EncodePages=false] [-dLanguageLevel=1|2|3] input.pdf [output.tif]" 1>&2
    exit 1
fi

# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec $GS_EXECUTABLE $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"

Revision history for this message
Liel Fridman (lielft-deactivatedaccount) wrote :

Well, I've attached a patch to the simple-scan.ui file.

Revision history for this message
Richard Laager (rlaager) wrote :

+1 to making the OCR engines configurable (again, since they're just command line calls). But I'd like to see an option when saving PDFs to OCR them for text (which should default *on*). The idea is to create a PDF that looks exactly like the scanned image, but is text searchable. The copy machine at work does this and it's quite nice.

Revision history for this message
Pablo Quirós (polmac1985) wrote :

You could also take a look at Ocropus as an OCR engine: http://code.google.com/p/ocropus/

Revision history for this message
311005901 (nitrousinacan-gmail) wrote :

I agree with Richard Laager (post #10). I would really like to see simple-scan have the OCR recognition automatically inserted into the pdf file. This feature could be comparable to Adobe Acrobat's OCR feature.

Revision history for this message
311005901 (nitrousinacan-gmail) wrote :

Comment #9, what is this patch patching?

Revision history for this message
Liel Fridman (lielft-deactivatedaccount) wrote :

Sorry, I do not understand how to implement this :(. Please assign somebody else.

Changed in simple-scan:
assignee: Liel Fridman (lielft) → nobody
Revision history for this message
Manish Sinha (मनीष सिन्हा) (manishsinha) wrote :

If a bug is not assigned, how can it be In Progress

Changed in simple-scan:
status: In Progress → New
Revision history for this message
papukaija (papukaija) wrote :

Please mark this bug as triaged again. Thanks in advance.

Changed in simple-scan:
status: New → Confirmed
Revision history for this message
Kip Warner (kip) wrote :

The GNU Ocrad library might be well suited for this task. Ideally, regardless of whatever OCR library is used, there should be a means to have the extracted text automatically appended to a generated PDF as part of the selectable / searchable text layer.

This is the home page for the GNU Ocrad library:
http://www.gnu.org/software/ocrad

Revision history for this message
Andres Gomez (Tanty) (tanty) wrote :

OCRFeeder is already accomplishing this feature:
https://live.gnome.org/OCRFeeder

I don't know if it makes sense to add this to simple-scan, rather than invoking OCRFeeder with the document as parameter.

Other possibility is either incorporing simple-scan features to OCRFeeder or the other way around, but doesn't sound to wise to reinvent the wheel once again.

Revision history for this message
Michael Nagel (nailor) wrote :

Info from a related blueprint:

Actually, it is Tesseract software. And there is already GUI for it: VietOCR. Please check this: http://vietunicode.sourceforge.net/download/vietocr/readme.html

Revision history for this message
Michael Nagel (nailor) wrote :

I think this could be split into two milestones:

milestone 1: some-ocr-at-all:
Get a minimum viable product: Add a button to the interface that reads "Recognize Text", and when it is clicked, the current page is saved (in an appropriate format) to /tmp/$something and the most mature OCR tool is invoked with that file as input.

milestone 2: integrated-ocr:
Make the whole thing more integrated, so that simple scan does the scanning with settings optimized for OCR, automatically applies relevant image preprocessing, allows to select the area to work on from within the application and probably allow exporting to PDF with searchable text and neat stuff like that.

Michael Nagel (nailor)
description: updated
Changed in simple-scan:
status: Confirmed → Triaged
Revision history for this message
Damien Bally (dbally) wrote :

The attached patch builds against version 2.32.0.2. It overrides the "show_page_cb" function in ui.c.
Instead of calling the image viewer, the function invokes tesseract, then launches gedit to show the result.

I'm aware this patch is an ugly hack and that it is probably badly written and buggy.
I just made it for my own use. Maybe some people will find it useful : just change the tesseract command line option "-l fra" to something corresponding to your language.

Sorry, I'm not sufficiently experienced in C and GTK programming to provide a better interface for now.

papukaija (papukaija)
tags: added: patch
Revision history for this message
Bernhard Reiter (ockham-razor) wrote :

Tesseract 3.0 has finally landed in Precise, and it has layout recognition, which can produce hOCR files that can in turn be used by tools such as hocr2pdf to add a (properly positioned) layer of searchable text to a PDF (as suggested for Milestone 2).

In order to fit in nicely with Simple Scan's ease-of-use, I'd suggest not adding an extra OCR button to the toolbar, but to just perform OCR whenever scanning text documents, possibly with an option (checkbox) to deactivate that feature in the settings dialog.
The settings dialog should also contain a combobox that lists installed tesseract languages, with the user's language pre-selected. Note that tesseract has somewhat unusual abbreviations for languages.

Then, when scanning a text document, the necessary steps for producing a searchable PDF are about as follows:
- Preprocess the image by running unpaper on it.
- Run tesseract on the image (for the language selected in the settings dialog), and tell it to produce an hOCR file (instructions -- in German, but easy enough to grasp: http://adnanvatandas.wordpress.com/2010/10/28/update-tesseract-3/ )
- Run hocr2pdf to add info from the hOCR file to the PDF.

Note that I don't know if these tools must really executed, or if there are libraries that are shipped with them and that can be invoked instead.

I don't have much experience with Vala, so I'm afraid I can't implement this, but I hope this draft is still somewhat helpful.

Revision history for this message
Robert Ancell (robert-ancell) wrote :

Bernhard, that seems like the right experience to me.

Kip Warner (kip)
Changed in simple-scan:
assignee: nobody → Kip Warner (kip)
Kip Warner (kip)
Changed in simple-scan:
status: Triaged → In Progress
Revision history for this message
Kip Warner (kip) wrote :

You'll all be happy to know I am making good headway with implementing this feature, taking into account everything described above. Screen shots very soon.

Revision history for this message
Bob Jonkman (bjonkman) wrote :

I've been using YAGF for OCR (mostly with PDFs that have scanned images of MS-Word docs. sigh.)

 http://symmetrica.net/cuneiform-linux/yagf-en.html

 So, Milestone 1 in the bug report to launch something like YAGF would suit my purposes just fine!

Revision history for this message
Martin Pitt (pitti) wrote :

A friend of mine uses attached script to post-process a couple of PDFs into ones with OCR'ed text (via tesseract). I haven't used it myself, but it might serve as inspiration.

Revision history for this message
silverballer47 (silverballer47) wrote :

In my lab, this would definitely be very useful as it is next-to-impossible to search for relevant text in scientific papers and journals that have been scanned in. Much appreciated, Kip!

Revision history for this message
Robert Ancell (robert-ancell) wrote :
Dave Chiluk (chiluk)
Changed in simple-scan:
assignee: Kip Warner (kip) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related questions

  • Simple Scan: OCR

Remote bug watches

Bug watches keep track of this bug in other bug trackers.