Simple Scan

Extract text using optical character recognition (OCR)

Bug #483391 reported by Robert Ancell on 2009-11-16

176

This bug affects 29 people

Affects		Status	Importance	Assigned to	Milestone
	Simple Scan	In Progress	Wishlist	Unassigned

Bug Description

Simple Scan should offer a workflow to do optical character recognition (OCR) on the scanned text.
It is to be decided what this workflow should look like, but we should do it in two steps:

Milestone 1: some-ocr-at-all:
Get a minimum viable product: Add a button to the interface that reads "Recognize Text", and when it is clicked, the current page is saved (in an appropriate format) to /tmp/$something and the most mature OCR tool is invoked with that file as input.

Milestone 2: integrated-ocr:
Make the whole thing more integrated, so that simple scan does the scanning with settings optimized for OCR, automatically applies relevant image preprocessing, allows to select the area to work on from within the application and probably allow exporting to PDF with searchable text and neat stuff like that.

List of OCR engines / software that might be evaluated:

ocropus: http://code.google.com/p/ocropus/source/list
Cuneiform: https://launchpad.net/cuneiform-linux
tesseract-ocr: http://code.google.com/p/tesseract-ocr/source/list
Ocrad: http://www.gnu.org/software/ocrad/
OCRFeeder: https://live.gnome.org/OCRFeeder

Original Description:
Add a "Text" profile that automatically runs the scan through OCR and saves in .txt format

See original description

Tags:

Robert Ancell (robert-ancell) on 2009-11-16

Changed in simple-scan:
status:	New → Triaged
importance:	Undecided → Wishlist

Revision history for this message

Robert Ancell (robert-ancell) wrote on 2009-12-10:

Probably better to have an option in page menu -> "Convert to text"

summary:

- Add text mode which uses OCR
+ Extract text using optical character recognition (OCR)

Revision history for this message

Robert Ancell (robert-ancell) wrote on 2010-01-19:

Tesseract seems the appropriate OCR engine to use:
http://code.google.com/p/tesseract-ocr/

Revision history for this message

Robert Ancell (robert-ancell) wrote on 2010-02-13:

See bug #519618 for barcode support

Revision history for this message

Rui Batista (ruiandrebatista) wrote on 2010-02-13:

Hi,

There is also a linux port of the cuneiForm OCR here:
https://launchpad.net/cuneiform-linux

For portuguese texts I got better results then with tesseract. My suggestion is making the OCR program configurable, since most of the times OCR engines in Linux are simple cli programs, making a common interface to them in python for example don't seem very dificult. But for starting I do think tesseract is a good choice.

Revision history for this message

Milan Bouchet-Valat (nalimilan) wrote on 2010-02-13:

IMHO the best way of integrating OCR would indeed be a menu or toolbar icon, which would open the default text processor for ODT files. People are unlikely to edit their text files as .txt in gedit... :-p

Liel Fridman (lielft-deactivatedaccount) on 2010-02-27

Changed in simple-scan:
assignee:	nobody → Liel Fridman (lielft)

Revision history for this message

Robert Ancell (robert-ancell) wrote on 2010-03-01:

Liel, let me know if you need any help or modifications to simple-scan to make this easier

Revision history for this message

Liel Fridman (lielft-deactivatedaccount) wrote on 2010-03-02:

I think it sould use HOCR for Hebrew OCR actions. What do you think?

Changed in simple-scan:
status:	Triaged → In Progress

Revision history for this message

toobuntu (toobuntu) wrote on 2010-03-02:

If this helps... For English OCR, I have some helper scripts. I don't remember where I found them, prob. Ubuntu Forums.

/ocr$ ls
ocr.sh pdf2tif usage.txt

/ocr$ cat usage.txt
to generate a txt from a pdf # uses pdf2tif as a helper script
./ocr.sh <filename>.pdf

to produce only a tif from a pdf
./pdf2tif <filename>.pdf

/ocr$ cat ocr.sh
#! /bin/sh -e

# takes one parameter, the path to a pdf file to be processed.
# uses custom script 'pdf2tif' to generate the tif files,
# generates them at 300x300 dpi.
# drops them in our current directory
# then runs $progdir/tesseract on them, deleting the .raw
# and .map files that tesseract drops.

./pdf2tif $1

# edit this to point to wherever you've got your tesseract binary
progdir=/usr/bin

for j in *.tif

    do
    x=$( basename $j \.tif )
    ${progdir}/tesseract ${j} ${x}
    rm ${x}.raw
    rm ${x}.map

# un-comment next line if you want to remove the .tif files when done.
# rm ${j}
done

/ocr$ cat pdf2tif
#! /bin/sh -e
# $Id: pdf2ps 6300 2005-12-28 19:56:24Z giles $
# Convert PDF to PostScript.

# This definition is changed on install to match the
# executable name set in the makefile
GS_EXECUTABLE=gs

OPTIONS=""
while true
do
case "$1" in
-?*) OPTIONS="$OPTIONS $1" ;;
*) break ;;
esac
shift
done

if [ $# -eq 2 ]
then
    outfile=$2
elif [ $# -eq 1 ]
then
    outfile=$( basename "$1" \.pdf ).tif
else
    echo "Usage: $( basename $0 ) [-dASCII85EncodePages=false] [-dLanguageLevel=1|2|3] input.pdf [output.tif]" 1>&2
    exit 1
fi

# Doing an initial 'save' helps keep fonts from being flushed between pages.
# We have to include the options twice because -I only takes effect if it
# appears before other options.
exec $GS_EXECUTABLE $OPTIONS -q -dNOPAUSE -dBATCH -dSAFER -r300x300 -sDEVICE=tiffg3 "-sOutputFile=$outfile" $OPTIONS -c save pop -f "$1"

Revision history for this message

Liel Fridman (lielft-deactivatedaccount) wrote on 2010-03-06:

simple-scan.ui.patch Edit (2.5 KiB, text/plain)

Well, I've attached a patch to the simple-scan.ui file.

Revision history for this message

Richard Laager (rlaager) wrote on 2010-03-09:

#10

+1 to making the OCR engines configurable (again, since they're just command line calls). But I'd like to see an option when saving PDFs to OCR them for text (which should default *on*). The idea is to create a PDF that looks exactly like the scanned image, but is text searchable. The copy machine at work does this and it's quite nice.

Revision history for this message

Pablo Quirós (polmac1985) wrote on 2010-04-14:

#11

You could also take a look at Ocropus as an OCR engine: http://code.google.com/p/ocropus/

Revision history for this message

311005901 (nitrousinacan-gmail) wrote on 2010-06-23:

#12

I agree with Richard Laager (post #10). I would really like to see simple-scan have the OCR recognition automatically inserted into the pdf file. This feature could be comparable to Adobe Acrobat's OCR feature.

Revision history for this message

311005901 (nitrousinacan-gmail) wrote on 2010-09-29:

#13

Comment #9, what is this patch patching?

Revision history for this message

Liel Fridman (lielft-deactivatedaccount) wrote on 2010-10-16:

#14

Sorry, I do not understand how to implement this :(. Please assign somebody else.

Changed in simple-scan:
assignee:	Liel Fridman (lielft) → nobody

Revision history for this message

Manish Sinha (मनीष सिन्हा) (manishsinha) wrote on 2011-05-12:

#15

If a bug is not assigned, how can it be In Progress

Changed in simple-scan:
status:	In Progress → New

Revision history for this message

papukaija (papukaija) wrote on 2011-05-13:

#16

Please mark this bug as triaged again. Thanks in advance.

Changed in simple-scan:
status:	New → Confirmed

Revision history for this message

Kip Warner (kip) wrote on 2011-07-31:

#17

The GNU Ocrad library might be well suited for this task. Ideally, regardless of whatever OCR library is used, there should be a means to have the extracted text automatically appended to a generated PDF as part of the selectable / searchable text layer.

This is the home page for the GNU Ocrad library:
http://www.gnu.org/software/ocrad

Revision history for this message

Andres Gomez (Tanty) (tanty) wrote on 2011-09-23:

#18

OCRFeeder is already accomplishing this feature:
https://live.gnome.org/OCRFeeder

I don't know if it makes sense to add this to simple-scan, rather than invoking OCRFeeder with the document as parameter.

Other possibility is either incorporing simple-scan features to OCRFeeder or the other way around, but doesn't sound to wise to reinvent the wheel once again.

Revision history for this message

Michael Nagel (nailor) wrote on 2011-11-28:

#19

Info from a related blueprint:

Actually, it is Tesseract software. And there is already GUI for it: VietOCR. Please check this: http://vietunicode.sourceforge.net/download/vietocr/readme.html

Revision history for this message

Michael Nagel (nailor) wrote on 2011-11-29:

#20

I think this could be split into two milestones:

milestone 1: some-ocr-at-all:
Get a minimum viable product: Add a button to the interface that reads "Recognize Text", and when it is clicked, the current page is saved (in an appropriate format) to /tmp/$something and the most mature OCR tool is invoked with that file as input.

milestone 2: integrated-ocr:
Make the whole thing more integrated, so that simple scan does the scanning with settings optimized for OCR, automatically applies relevant image preprocessing, allows to select the area to work on from within the application and probably allow exporting to PDF with searchable text and neat stuff like that.

Michael Nagel (nailor) on 2011-12-03

description:

updated

Robert Ancell (robert-ancell) on 2012-01-03

Changed in simple-scan:
status:	Confirmed → Triaged

Revision history for this message

Damien Bally (dbally) wrote on 2012-02-25:

#21

ui.c.diff Edit (1.7 KiB, text/plain)

The attached patch builds against version 2.32.0.2. It overrides the "show_page_cb" function in ui.c.
Instead of calling the image viewer, the function invokes tesseract, then launches gedit to show the result.

I'm aware this patch is an ugly hack and that it is probably badly written and buggy.
I just made it for my own use. Maybe some people will find it useful : just change the tesseract command line option "-l fra" to something corresponding to your language.

Sorry, I'm not sufficiently experienced in C and GTK programming to provide a better interface for now.

papukaija (papukaija) on 2012-02-25

tags:

added: patch

Revision history for this message

Bernhard Reiter (ockham-razor) wrote on 2012-02-26:

#22

Tesseract 3.0 has finally landed in Precise, and it has layout recognition, which can produce hOCR files that can in turn be used by tools such as hocr2pdf to add a (properly positioned) layer of searchable text to a PDF (as suggested for Milestone 2).

In order to fit in nicely with Simple Scan's ease-of-use, I'd suggest not adding an extra OCR button to the toolbar, but to just perform OCR whenever scanning text documents, possibly with an option (checkbox) to deactivate that feature in the settings dialog.
The settings dialog should also contain a combobox that lists installed tesseract languages, with the user's language pre-selected. Note that tesseract has somewhat unusual abbreviations for languages.

Then, when scanning a text document, the necessary steps for producing a searchable PDF are about as follows:
- Preprocess the image by running unpaper on it.
- Run tesseract on the image (for the language selected in the settings dialog), and tell it to produce an hOCR file (instructions -- in German, but easy enough to grasp: http://adnanvatandas.wordpress.com/2010/10/28/update-tesseract-3/ )
- Run hocr2pdf to add info from the hOCR file to the PDF.

Note that I don't know if these tools must really executed, or if there are libraries that are shipped with them and that can be invoked instead.

I don't have much experience with Vala, so I'm afraid I can't implement this, but I hope this draft is still somewhat helpful.

Revision history for this message

Robert Ancell (robert-ancell) wrote on 2012-05-07:

#23

Bernhard, that seems like the right experience to me.

Kip Warner (kip) on 2014-04-03

Changed in simple-scan:
assignee:	nobody → Kip Warner (kip)

Kip Warner (kip) on 2014-04-04

Changed in simple-scan:
status:	Triaged → In Progress

Revision history for this message

Kip Warner (kip) wrote on 2014-04-04:

#24

You'll all be happy to know I am making good headway with implementing this feature, taking into account everything described above. Screen shots very soon.

Revision history for this message

Bob Jonkman (bjonkman) wrote on 2014-04-07:

#25

I've been using YAGF for OCR (mostly with PDFs that have scanned images of MS-Word docs. sigh.)

http://symmetrica.net/cuneiform-linux/yagf-en.html

So, Milestone 1 in the bug report to launch something like YAGF would suit my purposes just fine!

Revision history for this message

Martin Pitt (pitti) wrote on 2014-04-07:

#26

example CLI script to scan and use tesseract Edit (2.5 KiB, text/x-sh)

A friend of mine uses attached script to post-process a couple of PDFs into ones with OCR'ed text (via tesseract). I haven't used it myself, but it might serve as inspiration.

Revision history for this message

silverballer47 (silverballer47) wrote on 2014-04-07:

#27

In my lab, this would definitely be very useful as it is next-to-impossible to search for relevant text in scientific papers and journals that have been scanned in. Much appreciated, Kip!

Revision history for this message

Robert Ancell (robert-ancell) wrote on 2017-05-03:

#28

Migrated bug to GNOME https://bugzilla.gnome.org/show_bug.cgi?id=782107

Dave Chiluk (chiluk) on 2017-05-07

Changed in simple-scan:
assignee:	Kip Warner (kip) → nobody

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Related blueprints

Bug attachments

Add attachment

Remote bug watches

gnome-bugs #782107
[RESOLVED OBSOLETE] Edit

Bug watches keep track of this bug in other bug trackers.

Simple Scan

Extract text using optical character recognition (OCR)

Bug Description

Duplicates of this bug

Other bug subscribers

Related questions

Related blueprints

Bug attachments

Remote bug watches