ocrmypdf program and man page disagree about options

Bug #1687308 reported by david braun
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ocrmypdf (Ubuntu)
Fix Released
Low
Unassigned

Bug Description

The man page for ocrmypdf claimes there is a "--just-print" option but the program rejects this. Also the man page claims the "-n" does the same. It doesn't. The option is accepted but nothing obvious happens.

ProblemType: Bug
DistroRelease: Ubuntu 17.04
Package: ocrmypdf 4.3.5-2
ProcVersionSignature: Ubuntu 4.10.0-20.22-generic 4.10.8
Uname: Linux 4.10.0-20-generic x86_64
ApportVersion: 2.20.4-0ubuntu4
Architecture: amd64
CurrentDesktop: Unity:Unity7
Date: Sun Apr 30 13:55:46 2017
EcryptfsInUse: Yes
InstallationDate: Installed on 2015-05-31 (699 days ago)
InstallationMedia: Ubuntu 14.04.2 LTS "Trusty Tahr" - Release amd64 (20150218.1)
PackageArchitecture: all
ProcEnviron:
 LANGUAGE=en_US
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: ocrmypdf
UpgradeStatus: Upgraded to zesty on 2017-04-28 (1 days ago)

Revision history for this message
david braun (braunster) wrote :
tags: added: manpage
Revision history for this message
Andreas Moog (ampelbein) wrote :

The option is "--just_print" (underscore instead of dash). And it seems to be working correctly for me, see http://paste.ubuntu.com/24656479/

What exactly is the error message you are getting?

Changed in ocrmypdf (Ubuntu):
importance: Undecided → Low
status: New → Incomplete
Revision history for this message
david braun (braunster) wrote :

Sorry for the misspelling! So when I try the correct option I get

    $ ocrmypdf --just_print input.pdf output.pdf
    Traceback (most recent call last):
      File "/usr/bin/ocrmypdf", line 11, in <module>
        load_entry_point('ocrmypdf==4.3.5', 'console_scripts', 'ocrmypdf')()
      File "/usr/lib/python3/dist-packages/ocrmypdf/__main__.py", line 1521, in run_pipeline
        pdfa_info = file_claims_pdfa(options.output_file)
      File "/usr/lib/python3/dist-packages/ocrmypdf/pdfa.py", line 131, in file_claims_pdfa
        pdf = pypdf.PdfFileReader(filename)
      File "/usr/lib/python3/dist-packages/PyPDF2/pdf.py", line 1081, in __init__
        fileobj = open(stream, 'rb')
    FileNotFoundError: [Errno 2] No such file or directory: 'output.pdf'
    $

which I don't understand at all - output.pdf is the output file - it shouldn't exist! Or if it does it should either be overwritten, a warning printed, or a confirmation requested.

the -n option does the same.

However - if I use "output.pdf" as the output file AND it exists AND it is a PDF file (but not a PDF/A) I get

    $ file *.pdf
    input.pdf: PDF document, version 1.7
    output.pdf: PDF document, version 1.7
    $ ocrmypdf --just_print input.pdf output.pdf
    WARNING - Output file is okay but is not PDF/A (seems to be No XMP metadata)

(note: Both input.pdf and output.pdf are the same)

BUT - if output.pdf is a PDF/A-2B file I get

    $ file *.pdf
    input.pdf: PDF document, version 1.7
    output.pdf: PDF document, version 1.5
    $ ocrmypdf --just_print input.pdf output.pdf
       INFO - Output file is a PDF/A-2B (as expected)

None of which is what I expected from the man page description

       -n, --just_print
              Don't actually run any commands; just print the pipeline.

Something isn't right!

BTW: the -n and --just_print options aren't listed in the SYNOPSIS section of the man page

Revision history for this message
Andreas Moog (ampelbein) wrote :

I agree that there is a bug with the -n option. From what I understand, it should only simulate the commands, not actually execute them. But the final check on the output.pdf seems to be unconditionally called, even if -n is used. That's why you get an error.

You have to use the verbose option to see more of what ocrmypdf does. Like in my example, -n --verbose 2 will tell you what tasks would be run.

What do you expect -n to be doing?

Revision history for this message
James R Barlow (jbarlow83) wrote : Re: [Bug 1687308] Re: ocrmypdf program and man page disagree about options

In upstream I removed both of these arguments. I suggest patching them out
of Ubuntu as well.

On Thu, May 25, 2017 at 12:41 Andreas Moog <email address hidden>
wrote:

> I agree that there is a bug with the -n option. From what I understand,
> it should only simulate the commands, not actually execute them. But the
> final check on the output.pdf seems to be unconditionally called, even
> if -n is used. That's why you get an error.
>
> You have to use the verbose option to see more of what ocrmypdf does.
> Like in my example, -n --verbose 2 will tell you what tasks would be
> run.
>
> What do you expect -n to be doing?
>
> --
> You received this bug notification because you are subscribed to Ubuntu.
> https://bugs.launchpad.net/bugs/1687308
>
> Title:
> ocrmypdf program and man page disagree about options
>
> Status in ocrmypdf package in Ubuntu:
> Incomplete
>
> Bug description:
> The man page for ocrmypdf claimes there is a "--just-print" option but
> the program rejects this. Also the man page claims the "-n" does the
> same. It doesn't. The option is accepted but nothing obvious happens.
>
> ProblemType: Bug
> DistroRelease: Ubuntu 17.04
> Package: ocrmypdf 4.3.5-2
> ProcVersionSignature: Ubuntu 4.10.0-20.22-generic 4.10.8
> Uname: Linux 4.10.0-20-generic x86_64
> ApportVersion: 2.20.4-0ubuntu4
> Architecture: amd64
> CurrentDesktop: Unity:Unity7
> Date: Sun Apr 30 13:55:46 2017
> EcryptfsInUse: Yes
> InstallationDate: Installed on 2015-05-31 (699 days ago)
> InstallationMedia: Ubuntu 14.04.2 LTS "Trusty Tahr" - Release amd64
> (20150218.1)
> PackageArchitecture: all
> ProcEnviron:
> LANGUAGE=en_US
> PATH=(custom, no user)
> XDG_RUNTIME_DIR=<set>
> LANG=en_US.UTF-8
> SHELL=/bin/bash
> SourcePackage: ocrmypdf
> UpgradeStatus: Upgraded to zesty on 2017-04-28 (1 days ago)
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/ocrmypdf/+bug/1687308/+subscriptions
>
>

Revision history for this message
david braun (braunster) wrote :


​That's unfortunate!​ Any reason why you removed the options?

Revision history for this message
James R Barlow (jbarlow83) wrote :

The code makes decisions at runtime based on the input file, so an argument
to skip executing all intermediates doesn't give an accurate picture of
what will happen. There is a --flowchart argument that produces a SVG file
showing the processing path which helps development a lot, but it's
probably not helpful to anyone else.

What sort of use did you have for it?
On Thu, May 25, 2017 at 17:56 david braun <email address hidden>
wrote:

> ​
> ​That's unfortunate!​ Any reason why you removed the options?
>
> --
> You received this bug notification because you are subscribed to Ubuntu.
> https://bugs.launchpad.net/bugs/1687308
>
> Title:
> ocrmypdf program and man page disagree about options
>
> Status in ocrmypdf package in Ubuntu:
> Incomplete
>
> Bug description:
> The man page for ocrmypdf claimes there is a "--just-print" option but
> the program rejects this. Also the man page claims the "-n" does the
> same. It doesn't. The option is accepted but nothing obvious happens.
>
> ProblemType: Bug
> DistroRelease: Ubuntu 17.04
> Package: ocrmypdf 4.3.5-2
> ProcVersionSignature: Ubuntu 4.10.0-20.22-generic 4.10.8
> Uname: Linux 4.10.0-20-generic x86_64
> ApportVersion: 2.20.4-0ubuntu4
> Architecture: amd64
> CurrentDesktop: Unity:Unity7
> Date: Sun Apr 30 13:55:46 2017
> EcryptfsInUse: Yes
> InstallationDate: Installed on 2015-05-31 (699 days ago)
> InstallationMedia: Ubuntu 14.04.2 LTS "Trusty Tahr" - Release amd64
> (20150218.1)
> PackageArchitecture: all
> ProcEnviron:
> LANGUAGE=en_US
> PATH=(custom, no user)
> XDG_RUNTIME_DIR=<set>
> LANG=en_US.UTF-8
> SHELL=/bin/bash
> SourcePackage: ocrmypdf
> UpgradeStatus: Upgraded to zesty on 2017-04-28 (1 days ago)
>
> To manage notifications about this bug go to:
>
> https://bugs.launchpad.net/ubuntu/+source/ocrmypdf/+bug/1687308/+subscriptions
>
>

Revision history for this message
david braun (braunster) wrote :
Download full text (4.2 KiB)

Sorry for the delay.
I'm trying to translate the text in the attached to english. I have loaded
the tesseract RUS language and executing
$ ocrmypdf -l rus --image-dpi 64 111684498_large_2.jpg 111684498_large_2.pdf
completes with the following messages
   INFO - Input file is not a PDF, checking if it is an image...
   INFO - Input file is an image
   INFO - Input image has no ICC profile, assuming sRGB
   INFO - Image seems valid. Try converting to PDF...
   INFO - Successfully converted to PDF, processing...
WARNING - 1: [tesseract] unsure about page orientation
   INFO - Output file is a PDF/A-2B (as expected)
But Google translate produces garbage.
I was hoping to see what was being done by ocrmypdf to see if I could
figure out what might be the cause.

BTW - I chose the DPI randomly - how significant is this parameter?

On Fri, May 26, 2017 at 12:51 AM, James R Barlow <<email address hidden>
> wrote:

> The code makes decisions at runtime based on the input file, so an argument
> to skip executing all intermediates doesn't give an accurate picture of
> what will happen. There is a --flowchart argument that produces a SVG file
> showing the processing path which helps development a lot, but it's
> probably not helpful to anyone else.
>
> What sort of use did you have for it?
> On Thu, May 25, 2017 at 17:56 david braun <email address hidden>
> wrote:
>
> > ​
> > ​That's unfortunate!​ Any reason why you removed the options?
> >
> > --
> > You received this bug notification because you are subscribed to Ubuntu.
> > https://bugs.launchpad.net/bugs/1687308
> >
> > Title:
> > ocrmypdf program and man page disagree about options
> >
> > Status in ocrmypdf package in Ubuntu:
> > Incomplete
> >
> > Bug description:
> > The man page for ocrmypdf claimes there is a "--just-print" option but
> > the program rejects this. Also the man page claims the "-n" does the
> > same. It doesn't. The option is accepted but nothing obvious happens.
> >
> > ProblemType: Bug
> > DistroRelease: Ubuntu 17.04
> > Package: ocrmypdf 4.3.5-2
> > ProcVersionSignature: Ubuntu 4.10.0-20.22-generic 4.10.8
> > Uname: Linux 4.10.0-20-generic x86_64
> > ApportVersion: 2.20.4-0ubuntu4
> > Architecture: amd64
> > CurrentDesktop: Unity:Unity7
> > Date: Sun Apr 30 13:55:46 2017
> > EcryptfsInUse: Yes
> > InstallationDate: Installed on 2015-05-31 (699 days ago)
> > InstallationMedia: Ubuntu 14.04.2 LTS "Trusty Tahr" - Release amd64
> > (20150218.1)
> > PackageArchitecture: all
> > ProcEnviron:
> > LANGUAGE=en_US
> > PATH=(custom, no user)
> > XDG_RUNTIME_DIR=<set>
> > LANG=en_US.UTF-8
> > SHELL=/bin/bash
> > SourcePackage: ocrmypdf
> > UpgradeStatus: Upgraded to zesty on 2017-04-28 (1 days ago)
> >
> > To manage notifications about this bug go to:
> >
> > https://bugs.launchpad.net/ubuntu/+source/ocrmypdf/+bug/
> 1687308/+subscriptions
> >
> >
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1687308
>
> Title:
> ocrmypdf program and man page disagree about options
>
> Status in ocrmypdf package in Ubuntu:
> Incomplete
>
> Bug descripti...

Read more...

Revision history for this message
James R Barlow (jbarlow83) wrote :
Download full text (7.1 KiB)

That scan is quite low resolution so it is hard to say how well any OCR
will work. I'd expect better than garbage, but a lot of errors.

The DPI is quite significant for checking whether a group of pixels is
noise or a glyph. It implies the minimum font size. 72 or 96 is a good
guess for screenshots (or 200 for a retina screen).

One possibility is that ocrmypdf fails to encode Cyrillic under the current
settings and available system fonts. If you have problems with all Cyrillic
images (even high quality scans), you could try adding the
--pdf-renderer=tesseract --output-type=pdf . That seems to work better for
non-Latin languages.

If you want to install the latest version instead of the Ubuntu version,
you could use the --sidecar argument to see what text is being found to
discern if the issue is PDF encoding or the image itself.

Aside: The "just print" feature would not have been helpful here even if it
worked.

On Sun, 4 Jun 2017 at 05:11 david braun <email address hidden> wrote:

> Sorry for the delay.
> I'm trying to translate the text in the attached to english. I have loaded
> the tesseract RUS language and executing
> $ ocrmypdf -l rus --image-dpi 64 111684498_large_2.jpg
> 111684498_large_2.pdf
> completes with the following messages
> INFO - Input file is not a PDF, checking if it is an image...
> INFO - Input file is an image
> INFO - Input image has no ICC profile, assuming sRGB
> INFO - Image seems valid. Try converting to PDF...
> INFO - Successfully converted to PDF, processing...
> WARNING - 1: [tesseract] unsure about page orientation
> INFO - Output file is a PDF/A-2B (as expected)
> But Google translate produces garbage.
> I was hoping to see what was being done by ocrmypdf to see if I could
> figure out what might be the cause.
>
> BTW - I chose the DPI randomly - how significant is this parameter?
>
>
> On Fri, May 26, 2017 at 12:51 AM, James R Barlow <
> <email address hidden>
> > wrote:
>
> > The code makes decisions at runtime based on the input file, so an
> argument
> > to skip executing all intermediates doesn't give an accurate picture of
> > what will happen. There is a --flowchart argument that produces a SVG
> file
> > showing the processing path which helps development a lot, but it's
> > probably not helpful to anyone else.
> >
> > What sort of use did you have for it?
> > On Thu, May 25, 2017 at 17:56 david braun <email address hidden>
> > wrote:
> >
> > > ​
> > > ​That's unfortunate!​ Any reason why you removed the options?
> > >
> > > --
> > > You received this bug notification because you are subscribed to
> Ubuntu.
> > > https://bugs.launchpad.net/bugs/1687308
> > >
> > > Title:
> > > ocrmypdf program and man page disagree about options
> > >
> > > Status in ocrmypdf package in Ubuntu:
> > > Incomplete
> > >
> > > Bug description:
> > > The man page for ocrmypdf claimes there is a "--just-print" option
> but
> > > the program rejects this. Also the man page claims the "-n" does the
> > > same. It doesn't. The option is accepted but nothing obvious happens.
> > >
> > > ProblemType: Bug
> > > DistroRelease: Ubuntu 17.04
> > > Package: ocrmypdf 4.3.5-...

Read more...

Revision history for this message
david braun (braunster) wrote :

Now that you point it out I see what you mean. I looked at the image with
gimp and see that the resolution is ... not so great. I tried to sharpen
the text up but, not being too skilled with gimp, didn't succeed.

thanks for your help. I'll try to get a better image and try again.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for ocrmypdf (Ubuntu) because there has been no activity for 60 days.]

Changed in ocrmypdf (Ubuntu):
status: Incomplete → Expired
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ocrmypdf - 5.4-1

---------------
ocrmypdf (5.4-1) unstable; urgency=medium

  * New upstream release.
  * Drop Testsuite: field.
    See Lintian tag unnecessary-testsuite-autopkgtest-header.
  * Bump standards version to 4.1.1 (no changes required).

 -- Sean Whitton <email address hidden> Sat, 14 Oct 2017 10:46:45 -0700

Changed in ocrmypdf (Ubuntu):
status: Expired → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.