Load PDF with more than 1000 pages

Bug #1527935 reported by Tim Ritberg
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
gscan2pdf (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

gscan2pdf does not load more than 1000 pages of a PDF.

I tried to load a PDF with 1200 Pages. Only 1000 were loaded.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

I don't have a PDF with 1000 pages to test. Please start gscan2pdf from the command line with the --log option, reproduce the problem, quit, and post the log file:

gscan2pdf --log=log

Revision history for this message
Tim Ritberg (xpert-reactos) wrote :

Here first and the last log lines:
INFO - 1526 pages
....
INFO - Added /tmp/gscan2pdf-R5U6/FHcsY_awxQ.png at page 998 with resolution 72
INFO - New page filename /tmp/gscan2pdf-R5U6/6N4XIcC8s3.png, format Portable Network Graphics
INFO - New page filename x-999.ppm, format Portable pixmap format (color)
INFO - Added /tmp/gscan2pdf-R5U6/H4IStYd_IF.png at page 999 with resolution 72
INFO - New page filename /tmp/gscan2pdf-R5U6/YOwFGtUKz0.png, format Portable Network Graphics
INFO - Added /tmp/gscan2pdf-R5U6/IiVBoSdJcP.png at page 1000 with resolution 72
DEBUG - import_files finished /media/user/BigDocument.pdf
DEBUG - Started setting page_number_start from 1 to 1001
DEBUG - Finished setting page_number_start from 1 to 1001

The loading dialog asked me to load 1 to 1526 pages. I pressed ok.

Revision history for this message
Tim Ritberg (xpert-reactos) wrote :

I found another strage thing with this document. I tried to load the first 10 pages. But the first 30 pages were loaded:

DEBUG - import_files queued /media/user/BigDocument.pdf
INFO - Getting info for/media/user/BigDocument.pdf
INFO - Format: 'PDF document, version 1.5'
INFO - pdfinfo "/media/user/BigDocument.pdf"
INFO - Creator: LuraDocument PDF Compressor Server 5.7.69.47
Producer: LuraDocument PDF v2.47
CreationDate: Thu Aug 23 16:08:35 2012
ModDate: Thu Aug 23 16:08:35 2012
Tagged: no
UserProperties: no
Suspects: no
Form: none
JavaScript: no
Pages: 1526
Encrypted: no
Page size: 1177.56 x 1487.52 pts
Page rot: 0
File size: 118923732 bytes
Optimized: yes
PDF version: 1.5

INFO - 1526 pages

DEBUG - import_files queued /media/user/BigDocument.pdf
INFO - pdfimages -f 1 -l 10 "/media/user/BigDocument.pdf" x
DEBUG - import_files started /media/user/BigDocument.pdf

INFO - New page filename x-000.ppm, format Portable pixmap format (color)
INFO - New page filename /tmp/gscan2pdf-9L29/S_ptSkIwvO.png, format Portable Network Graphics
INFO - New page filename x-001.ppm, format Portable pixmap format (color)
INFO - Added /tmp/gscan2pdf-9L29/RaNwlrqSnG.png at page 1 with resolution 72
INFO - New page filename /tmp/gscan2pdf-9L29/XJdVAR63iK.png, format Portable Network Graphics
INFO - New page filename x-002.pbm, format Portable bitmap format (black and white)
INFO - Added /tmp/gscan2pdf-9L29/fQOr8JWFUj.png at page 2 with resolution 72
...
DEBUG - import_files finished /media/user/BigDocument.pdf
DEBUG - Started setting page_number_start from 1 to 31
DEBUG - Finished setting page_number_start from 1 to 31

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

gscan2pdf uses pdfimages to extract the images from the PDF. It looks to me that pdfimage can only write 1000 images per call. I assume that the workaround would be to extract the images in batches of 1000.

Please confirm that having imported pages 1-1000 in the first step, you can them import pages 1001-1526 in a second step.

If this works, I will:

a. raise a bug against pdfimages
b. code the workaround into gscan2pdf so that it does this internally.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

If there are 30 images in the first 10 pages, then I would expect this behaviour.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

Just out of interest, how many files are produced by (and what are they named)

pdfimages -f 1 -l 1526 "/media/user/BigDocument.pdf" x

?

Revision history for this message
Tim Ritberg (xpert-reactos) wrote :

take a look at this:
https://sourceforge.net/p/gscan2pdf/mailman/message/32904652/
-Feature request 82 (Scanning documents that are 1000 pages or more)
Seems to be, that there was/is a limit to 1000.

The import with two steps didn't work. Now I have 2000 Images loaded.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote : Re: [Bug 1527935] Re: Load PDF with more than 1000 pages

On 20 December 2015 at 22:53, Tim Ritberg <email address hidden> wrote:
> -Feature request 82 (Scanning documents that are 1000 pages or more)
> Seems to be, that there was/is a limit to 1000.

I wrote that. It was about scanning and has nothing to do with importing a PDF.

> The import with two steps didn't work. Now I have 2000 Images loaded.

What about

pdfimages -f 1 -l 1526 "/media/user/BigDocument.pdf" x

?

Revision history for this message
Tim Ritberg (xpert-reactos) wrote :

The result is 1526 pbm-files and 3052 ppm-files.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

Evidently every page has the image itself, and two masks.

How are the images named/numbered?

Revision history for this message
Tim Ritberg (xpert-reactos) wrote :

x-000.ppm - x-4576.ppm
x-002.pbm - x-4577.pbm

Revision history for this message
Tim Ritberg (xpert-reactos) wrote :

BTW this two mask images are shown in overview at the left.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

On 21 December 2015 at 09:51, Tim Ritberg <email address hidden> wrote:
> x-000.ppm - x-4576.ppm
> x-002.pbm - x-4577.pbm

Ah. OK. In that case I should be able to fix the problem. Just have to
create a test case first.

Revision history for this message
Jeffrey Ratcliffe (jeffreyratcliffe) wrote :

I've just fixed this in the uncoming version. However, it exposed a fundemental problem with the program architecture. Each page is stored as a temporary file. Perl does not close the file handles for temporary files until they are out of scope, so on my machine, after around 1000 pages, my machine runs out of file handles.

The medium term solution is to store the data in an SQLite database, rather than lots of temporary files, but that will require major surgery.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in gscan2pdf (Ubuntu):
status: New → Confirmed
Revision history for this message
larrybradley (larry-w-bradley) wrote :

My "fix"is to load and ocr 1/2 of the file and save it, then scan the remainder and save it, then piece together the two using PDF Arranger.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.