Version 2.00 was released with international support

Bug #130848 reported by Pierre Slamich
8
Affects Status Importance Assigned to Milestone
tesseract (Debian)
Fix Released
Unknown
tesseract (Ubuntu)
Fix Released
Wishlist
Unassigned

Bug Description

http://code.google.com/p/tesseract-ocr/

Version 2.00 is now available and contains the following new features:

    * Support for English, French, Italian, German, Spanish, Dutch
    * Scripts to test accuracy against the original 1995 tests run by UNLV (see TestingTesseract)
    * Ability to train in other languages and scripts (see TrainingTesseract)

Tags: upgrade
Changed in tesseract:
importance: Undecided → Wishlist
Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

REVU is down, so I am attaching the the files here.

Here is the main program and the English data files. As soon as it is clear that the packaging is OK, I'll add the other 5 languages.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :
Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :
Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :
Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :
Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :
Changed in tesseract:
status: Unknown → New
Revision history for this message
Barry deFreese (bddebian) wrote :

At a quick glance the packaging seems OK. Does it really need a different source package for every language or can each languages binary be built from the single source package?

Thanks.

Changed in tesseract:
status: New → Incomplete
Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote : Re: [Bug 130848] Re: Version 2.00 was released with international support

On 16/08/07, Barry deFreese <email address hidden> wrote:
> At a quick glance the packaging seems OK. Does it really need a
> different source package for every language or can each languages binary
> be built from the single source package?

The tesseract-ocr project supplies the languages separately to the
main program. At the moment, there are only six languages. I imagine
in future releases will have many more and the language files are not
negligibly small.

Most people will only need one or two languages, and it seems
unnecessary to get them to download bags of stuff they don't need.

Revision history for this message
Barry deFreese (bddebian) wrote :

Jeffrey,

Agreed. What I am asking is if it really needs different source packages. Multiple binary packages (.debs) might make sense but different source packages?

Bye the way, REVU is back up. Thanks.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

Barry,

On 06/09/07, Barry deFreese <email address hidden> wrote:
> Agreed. What I am asking is if it really needs different source
> packages. Multiple binary packages (.debs) might make sense but
> different source packages?

Upstream releases the languages as different source packages. For
instance, they have just added another two languages to the original
six without changing the engine source. The engine source come with no
languages and is useless as-is without one.

> Bye the way, REVU is back up. Thanks.

OK. I'll try and upload it this evening.

Revision history for this message
André Barmasse (barmassus) wrote :

Hi folks

To get around the not complicated, but unnecessary compiling and language data copying process I've built a deb packet with all the available languages. I have tested it with Ubuntu Gutsy Gibbon and it installs everything just fine. I hope the new excellent version 2.0 will soon make its way into the repositories. Until then, feel free to download my packet at

http://www.barmasse.org/gaestebereich/downloads/tesseract_2.01-1_i386.deb

When recognizing the TIF file don't forget to activate the correct recognition language by putting the -l option behind the command line. Like this:

tesseract inputimage.tif outputtext.txt" -l eng/fra/ita/deu/deu-f/spa/nld/por

I hope everything works fine for you as it did for me. Tesseract rules!!

Revision history for this message
André Barmasse (barmassus) wrote :

Hi folks

To get around the not complicated, but unnecessary compiling and language data copying process I've built a deb packet with all the available languages. I have tested it with Ubuntu Gutsy Gibbon and it installs everything just fine. I hope the new excellent version 2.0 will soon make its way into the repositories. Until then, feel free to download my packet at

http://www.barmasse.org/gaestebereich/downloads/tesseract_2.01-1_i386.deb

When recognizing the TIF file don't forget to activate the correct recognition language by putting the -l option behind the command line. Like this:

tesseract inputimage.tif outputtext.txt -l eng/fra/ita/deu/deu-f/spa/nld/por

I hope everything works fine for you as it did for me. Tesseract rules!!

Revision history for this message
André Barmasse (barmassus) wrote :

Ooops, sorry! Double entry. Just wanted to delete the wrong apostroph in the tesseract command line. Can someone delete my first entry? Thanks.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

On 12/10/2007, André Barmasse <email address hidden> wrote:
> To get around the not complicated, but unnecessary compiling and
> language data copying process I've built a deb packet with all the
> available languages. I have tested it with Ubuntu Gutsy Gibbon and it

I think this is a bad plan. The individual languages are around 1Mb
each. At the moment, there are 8, but this is likely to grow.

Whilst REVU was down, I attempted to get my packaging sponsored in
Debian. Whilst I am still waiting for a sponsor, I did go through a
couple of iterations of people finding (minor) issues and my fixing
them, including splitting off the (normally) unnecessary libraries
into a -deb package:

The package can be found on mentors.debian.net:
- URL: http://mentors.debian.net/debian/pool/main/t/tesseract
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract/tesseract_2.01-1.dsc

The languages packages are:

- URL: http://mentors.debian.net/debian/pool/main/t/tesseract-deu
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract-deu/tesseract-deu_2.00-1.dsc

- URL: http://mentors.debian.net/debian/pool/main/t/tesseract-deu-f
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract-deu-f/tesseract-deu-f_2.01-1.dsc

- URL: http://mentors.debian.net/debian/pool/main/t/tesseract-eng
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract-eng/tesseract-eng_2.00-1.dsc

- URL: http://mentors.debian.net/debian/pool/main/t/tesseract-fra
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract-fra/tesseract-fra_2.00-1.dsc

- URL: http://mentors.debian.net/debian/pool/main/t/tesseract-ita
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract-ita/tesseract-ita_2.00-1.dsc

- URL: http://mentors.debian.net/debian/pool/main/t/tesseract-nld
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract-nld/tesseract-nld_2.00-1.dsc

- URL: http://mentors.debian.net/debian/pool/main/t/tesseract-spa
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract-spa/tesseract-spa_2.00-1.dsc

- URL: http://mentors.debian.net/debian/pool/main/t/tesseract-por
- Source repository: deb-src http://mentors.debian.net/debian unstable
main contrib non-free
- dget http://mentors.debian.net/debian/pool/main/t/tesseract-por/tesseract-por_2.01-1.dsc

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

On 12/10/2007, Jeffrey Ratcliffe <email address hidden> wrote:
> them, including splitting off the (normally) unnecessary libraries
> into a -deb package:

I meant, of course, a -dev package.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for tesseract (Ubuntu) because there has been no activity for 60 days.]

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

On 25/10/2007, Launchpad Janitor <email address hidden> wrote:
> [Expired for tesseract (Ubuntu) because there has been no activity for
> 60 days.]

I have attempted 6 times
(http://groups.google.co.uk/group/linux.debian.devel.mentors/browse_thread/thread/1c1f8baf77e212fc/341687dc97e00416?lnk=raot#341687dc97e00416)
to get this uploaded to Debian. I will therefore repackage this with
an Ubuntu version number in the next couple of days and upload it to
REVU. Perhaps I will have more luck there.

Regards

Jeff

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :
Changed in tesseract:
assignee: nobody → jeffreyratcliffe
status: Invalid → In Progress
Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

This has now been accepted into Debian as 2.01-1. Please sync.

Changed in tesseract:
assignee: jeffreyratcliffe → nobody
Changed in tesseract:
status: New → Fix Released
Revision history for this message
André Barmasse (barmassus) wrote :

Hi Jeffrey

Just stopped by and read your comment on my dowload link. Of course you are right that there is actually no point in making a deb-packet that I am not updating. The idea of making one came upon me when I was urgently in need of a good OCR program - and tesseract 2.X is the best and most user friendly there is at the moment - for Ubuntu Gutsy Gibbon and there was only the old and rather crapy 1.X version of tesseract in the repositoires.

Yesterday I installed the Ubuntu Hardy Heron Alpha 1 and was rather surprised to see that tesseract-ocr 2.01-3 is now in the repositories. Sadly enough without the necessary tesseract-ocr-languages available - at least for this early stage of Ubuntu Hardy Heron Ubuntu release. But anyway, But I am looking forward to use tesseract in the future, and if I can help your development with a small donation, please let me know, Jeffrey! Great work!

PS: Let me know if you would rather see me remove the deb-packet.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

On 06/12/2007, André Barmasse <email address hidden> wrote:
> Yesterday I installed the Ubuntu Hardy Heron Alpha 1 and was rather
> surprised to see that tesseract-ocr 2.01-3 is now in the repositories.
> Sadly enough without the necessary tesseract-ocr-languages available -
> at least for this early stage of Ubuntu Hardy Heron Ubuntu release. But

The main package was an upgrade, but I imagine that the language packs
take longer to get through QA as they are regarded as new packages.

Regards

Jeff

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

The languages have now been synced.

Changed in tesseract:
status: In Progress → Fix Released
Revision history for this message
André Barmasse (barmassus) wrote :

Hi Jeffrey

Great work syncing the languages! Everything works fine on Hardy Heron!
Therefore I have remove the compiled deb-packet from my website.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.