Generated PDF documents too large

Bug #534122 reported by Robert Ancell on 2010-03-08
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Simple Scan
Fix Released
Medium
Unassigned

Bug Description

The PDF documents contain the scanned images are GZIP compressed in RGB format. They should really be in DCT (JPEG) format, however this is not currently supported in Cairo. (I heard at LCA2010 that this was coming in the next release but I can't find any bug/link to point to this).

An adequate solution until Cairo supports this is to write the PDFs using another method (they are very simple, just one page per image). I haven't been able to find and adequate C accessible PDF library or been confident that I understand the PDF specification [1] enough to just write with printf().

[1] http://www.adobe.com/devnet/pdf/pdf_reference.html

Changed in simple-scan:
status: New → Triaged
importance: Undecided → Medium
summary: - PDF images not compressed
+ Generated PDF documents too large
Colin Macdonald (cbm755) wrote :

Can you dep on ImageMagick and call system("convert temp.jpg my_output.pdf")?

That gives a pdf file only fractionally larger than the input jpg.

There are probably C bindings for imagemagick.

Cairo supports jpeg in pdf since 1.9.2, using the cairo_surface_set_mime_data API:

http://cairographics.org/news/cairo-1.9.2/

But there's still no 1.10.0 stable version yet.

Robert Ancell (robert-ancell) wrote :

Implemented by calling ImageMagick "convert" from Simple Scan if it is installed. It's a bit of a hack but what it does is:
1. Use Cairo if "convert" is not in the path
2. Save all pages as JPEG and Deflate compressed TIFF images (this was required as text is not well compressed in JPEG)
3. Build a PDF using convert and the smallest of each page
4. Delete all the temporary files

The saving is a bit slow and blocks the UI while doing it but it seems to work ok.

Changed in simple-scan:
status: Triaged → Fix Committed
Changed in simple-scan:
status: Fix Committed → Fix Released
gbowden (gregbowden2000) wrote :

Hi, since simple scan has used ImageMagick to create the pdf files I have seen the file sizes of the pdf files become much larger.

Using the text resolution at 150 dpi I am getting files created with ImageMagick at 1MB and ones created with Cairo at 185-200KB.

That is for a scan of two pages only text.

Marduk Bolaños (mardukbp) wrote :

As of Simple Scan 3.16.1.1 the size of color PDF files (at 300 dpi) is, in my experience, larger than 6 MB. I am talking business letters with a colored logo. Interestingly, if I post-process the PDF with Ghostscript:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output.pdf input.pdf

the file size is reduced to 10% of the original. According to pdfimages -list output.pdf, the only difference between both files is the compression ratio. For the original it is 27% and for the reduced PDF it is 2.8%.

I ignore what is the method that Simple Scan uses for generating PDF files, but it is obviously sub-optimal. Hopefully the developers will able to improve it.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers