Load PDF with more than 1000 pages

Bug #1527935 reported by Tim Ritberg on 2015-12-19
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
gscan2pdf (Ubuntu)
Undecided
Unassigned

Bug Description

gscan2pdf does not load more than 1000 pages of a PDF.

I tried to load a PDF with 1200 Pages. Only 1000 were loaded.

I don't have a PDF with 1000 pages to test. Please start gscan2pdf from the command line with the --log option, reproduce the problem, quit, and post the log file:

gscan2pdf --log=log

Tim Ritberg (xpert-reactos) wrote :

Here first and the last log lines:
INFO - 1526 pages
....
INFO - Added /tmp/gscan2pdf-R5U6/FHcsY_awxQ.png at page 998 with resolution 72
INFO - New page filename /tmp/gscan2pdf-R5U6/6N4XIcC8s3.png, format Portable Network Graphics
INFO - New page filename x-999.ppm, format Portable pixmap format (color)
INFO - Added /tmp/gscan2pdf-R5U6/H4IStYd_IF.png at page 999 with resolution 72
INFO - New page filename /tmp/gscan2pdf-R5U6/YOwFGtUKz0.png, format Portable Network Graphics
INFO - Added /tmp/gscan2pdf-R5U6/IiVBoSdJcP.png at page 1000 with resolution 72
DEBUG - import_files finished /media/user/BigDocument.pdf
DEBUG - Started setting page_number_start from 1 to 1001
DEBUG - Finished setting page_number_start from 1 to 1001

The loading dialog asked me to load 1 to 1526 pages. I pressed ok.

Tim Ritberg (xpert-reactos) wrote :

I found another strage thing with this document. I tried to load the first 10 pages. But the first 30 pages were loaded:

DEBUG - import_files queued /media/user/BigDocument.pdf
INFO - Getting info for/media/user/BigDocument.pdf
INFO - Format: 'PDF document, version 1.5'
INFO - pdfinfo "/media/user/BigDocument.pdf"
INFO - Creator: LuraDocument PDF Compressor Server 5.7.69.47
Producer: LuraDocument PDF v2.47
CreationDate: Thu Aug 23 16:08:35 2012
ModDate: Thu Aug 23 16:08:35 2012
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1526
Encrypted: no
Page size: 1177.56 x 1487.52 pts
Page rot: 0
File size: 118923732 bytes
Optimized: yes
PDF version: 1.5

INFO - 1526 pages

DEBUG - import_files queued /media/user/BigDocument.pdf
INFO - pdfimages -f 1 -l 10 "/media/user/BigDocument.pdf" x
DEBUG - import_files started /media/user/BigDocument.pdf

INFO - New page filename x-000.ppm, format Portable pixmap format (color)
INFO - New page filename /tmp/gscan2pdf-9L29/S_ptSkIwvO.png, format Portable Network Graphics
INFO - New page filename x-001.ppm, format Portable pixmap format (color)
INFO - Added /tmp/gscan2pdf-9L29/RaNwlrqSnG.png at page 1 with resolution 72
INFO - New page filename /tmp/gscan2pdf-9L29/XJdVAR63iK.png, format Portable Network Graphics
INFO - New page filename x-002.pbm, format Portable bitmap format (black and white)
INFO - Added /tmp/gscan2pdf-9L29/fQOr8JWFUj.png at page 2 with resolution 72
...
DEBUG - import_files finished /media/user/BigDocument.pdf
DEBUG - Started setting page_number_start from 1 to 31
DEBUG - Finished setting page_number_start from 1 to 31

gscan2pdf uses pdfimages to extract the images from the PDF. It looks to me that pdfimage can only write 1000 images per call. I assume that the workaround would be to extract the images in batches of 1000.

Please confirm that having imported pages 1-1000 in the first step, you can them import pages 1001-1526 in a second step.

If this works, I will:

a. raise a bug against pdfimages
b. code the workaround into gscan2pdf so that it does this internally.

If there are 30 images in the first 10 pages, then I would expect this behaviour.

Just out of interest, how many files are produced by (and what are they named)

pdfimages -f 1 -l 1526 "/media/user/BigDocument.pdf" x

?

Tim Ritberg (xpert-reactos) wrote :

take a look at this:
https://sourceforge.net/p/gscan2pdf/mailman/message/32904652/
-Feature request 82 (Scanning documents that are 1000 pages or more)
Seems to be, that there was/is a limit to 1000.

The import with two steps didn't work. Now I have 2000 Images loaded.

On 20 December 2015 at 22:53, Tim Ritberg <email address hidden> wrote:
> -Feature request 82 (Scanning documents that are 1000 pages or more)
> Seems to be, that there was/is a limit to 1000.

I wrote that. It was about scanning and has nothing to do with importing a PDF.

> The import with two steps didn't work. Now I have 2000 Images loaded.

What about

pdfimages -f 1 -l 1526 "/media/user/BigDocument.pdf" x

?

Tim Ritberg (xpert-reactos) wrote :

The result is 1526 pbm-files and 3052 ppm-files.

Evidently every page has the image itself, and two masks.

How are the images named/numbered?

Tim Ritberg (xpert-reactos) wrote :

x-000.ppm - x-4576.ppm
x-002.pbm - x-4577.pbm

Tim Ritberg (xpert-reactos) wrote :

BTW this two mask images are shown in overview at the left.

On 21 December 2015 at 09:51, Tim Ritberg <email address hidden> wrote:
> x-000.ppm - x-4576.ppm
> x-002.pbm - x-4577.pbm

Ah. OK. In that case I should be able to fix the problem. Just have to
create a test case first.

I've just fixed this in the uncoming version. However, it exposed a fundemental problem with the program architecture. Each page is stored as a temporary file. Perl does not close the file handles for temporary files until they are out of scope, so on my machine, after around 1000 pages, my machine runs out of file handles.

The medium term solution is to store the data in an SQLite database, rather than lots of temporary files, but that will require major surgery.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers