xsane PDF file sizes could be optimized

Bug #75384 reported by Tommy Trussell on 2006-12-11
60
This bug affects 12 people
Affects Status Importance Assigned to Milestone
xsane (Ubuntu)
Wishlist
Unassigned

Bug Description

I've been using xsane for years. I was very happy to see version 0.991 in Ubuntu Edgy 6.10 including PDF and multi-page output options enabled. These features work, but the PDF file sizes could be more optimal.

For example, a one-page document I scanned using xsane and converted two different ways produced a noticeably smaller file than xsane's PDF version. (Approximately 8% smaller.)

Example: I scanned a US Letter size document three different ways in xsane.

101769 -- test_0001.pdf -- scanned to pdf
1016760 -- test_0002.pnm -- scanned to pnm
93843 -- test_0002x.pdf -- converted using my procedure
101107 -- test_0003.ps -- scanned to ps
94081 -- test_0003x.pdf -- converted using ps2pdf

As you can see, xsane saved a single-page pdf file 101769 bytes long, but the same document scanned as a pnm or a ps file then converted produced a 93843 byte or 94081 byte file, respectively.

When scanning a multi-page document this difference in size adds up.

Here's the procedure I have used for years to get optimally-sized pdf files:

1) scan all pages to pnm lineart at 300 dpi (tiff also works but you must wait for xsane to convert each file)

2) convert -density 300x300 file*.pnm temp.ps

3) ps2pdf -dPDFSETTINGS=/printer -sPAPERSIZE=letter temp.ps file.pdf

4) rm temp.ps

This procedure depends upon imagemagick's convert command and ghostscript's ps2pdf command. It requires specifying the density in convert and setting the page size in the ps2pdf. (If you don't add these refinements, the resulting PDF image may be the wrong density or may have the wrong size bounding box. I suspect these issues may depend upon bugs/features of specific versions of imagemagick and ghostscript.)

Revision history for this message
Hew (hew) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. I have reproduced this issue by using xsane to produce both a pdf (1.4MB) and ps (1.8MB) scan, and then converting the ps to a pdf (103.8KB). Upon visual inspection, it is apparent that this large difference in file size is due to image compression (there are clear visual artifacts on the low quality version), and not because of poor optimization. I am therefore marking this bug as invalid. Please don't hesitate to submit bug reports in the future. Thanks again!

Changed in xsane:
status: New → Invalid
Revision history for this message
Antonio J. de Oliveira (ajoliveira) wrote :

Hi! :-)

I am returning to this because I think the bug exists indeed, my secretary has been complaining about this, making me lose time checking why she has hit the wall, or if there is a fix it is not obvious to me...
scanned a color document, 150dpi, full color, xsane 0.994, Jaunty 32 and 64-bit (tried on the 2 versions).

jpg 518k
pdf 4.9M (!)
jpg converted to pdf with gm convert (graphics Magick, default options) 414k.

Note that convert (image magick) with default optios produces a 6.5MB file, it is even worse.

The difference is pdf directly created by xsane is about 10 times larger than the one converted with gm, and it looks the same, even at closer look. If this is an invalid bug, I am puzzled...ok, maybe not a bug, but a nagging annoyance, needing intermediate passes to perform properly in color, and invalidating multi-page color pdf scans, everything must be scanned to jpg and then batcht-converted.

If you need that, the command

#gm convert *.jpg -adjoin output.pdf

is quite clean and efficient, just scan your pdfs as 001.jpg, 002.jpg...nnn.jpg

Please advise

Greetings from sunny Portugal

Revision history for this message
Antonio J. de Oliveira (ajoliveira) wrote :

re-opened it, I think it is even worse than described in Jaunty.

Changed in xsane (Ubuntu):
status: Invalid → New
Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

Hi-- thanks for your support of my ancient bug. Now that there are two of us who want this, maybe we can petition for it as a wishlist item, at least.

I completely understand why they closed it -- what you and I are doing with imagemagick is to use a "worse" or "lossier" compression scheme to compress the file to be smaller than xsane can possibly make using any of its options. The person closing the bug saw that it produced artifacts in the resulting image, which is to be expected.

Since I almost exclusively scan black and white documents, I have been satisfied with my technique for scanning the document as black and white 300dpi multi-page .ps files, then using ps2pdf to convert the single multi-page file to pdf. This produces an acceptably small pdf with extremely acceptable quality for my particular circumstance.

IF, however, you are scanning color documents, you probably want much more control over the settings. As the person who closed the bug implied, compression artifacts might be unacceptable for some circumstances. Also if your documents use solid color, such as blue or red ink on black and white forms, scanning to a format with a limited palette BEFORE compressing would create a small file and eliminate the compression artifacts. SO it's not necessarily a simple issue.

I was hoping for some more options, maybe a way for xsane to call the imagemagick convert command during the save process so it would be possible to add some additional tweaking, while still preserving the ease of use xsane offers.

It would be REALLY nice to configure and save more options for scanning, such as compression algorithm and level, pdf display name, and maybe the PDF page size size and resolution, if it needs separate control.

NOTE: xsane can save multi-page .ps files, and in Gnome, and you can add ps2pdf as a right-click command in Nautilus (the GNOME file viewer) by right click --> Open With --> Custom command... . SO I have not missed this requested feature, EXCEPT when I am dealing with a file that doesn't work with the defaults in ps2pdf. (For example, a legal or Tabloid size document would get truncated by ps2pdf unless I pass some more arguments. For those situations I usually just scan each image to a file, open and place the images in an OpenOffice document, and save the pdf from there.)

Revision history for this message
Antonio J. de Oliveira (ajoliveira) wrote :

Hi

The worst part was that I could not see a bit of a difference between a scanned jpg image and a converted graphics magick (not image magick, note) in pdf format or a native scanned xsane pdf. No artifacts, nothing. That was the reason I re-opened this bug. I can send over some examples, but the developers can easily duplicate this. It seems that it is the engine behind pdf conversion that is causing trouble, if we are granted with an engine choice, since I suspect the engine is not a part of the basic package, maybe everything may be cleanly bypassed.

Cheers

Antonio

Revision history for this message
Antonio J. de Oliveira (ajoliveira) wrote :

Hi Tommy

Thanks for the ps multipage feature tip. Today we used that together with gm convert (Graphics Magick) to create a splendid pdf. I installed 'context' so as to try pstopdf, but still, the final size is almost doubled in relation to the size of the file created with gm convert, and the resolution is clearly similar.
Well, we do have some shortcuts, the highway is up to the product developers.

Cheers

Antonio

Revision history for this message
Tommy Trussell (tommy-trussell) wrote :

Thank you for the clarification -- I had forgotten about GraphicsMagick (a "fork" of ImageMagick). GraphicsMagick should be superior for a stable solution, and I should redo my tests using it.

I have not dug into the code of xsane. I suspect a command-line or script option could be "slapped in" relatively easily, but a useful graphical user interface would likely be much harder. I haven't looked for a list or forum for xsane issues, but that might be a good place to start.

Oh, and of course it always helps to be completely familiar with the existing software -- there are details and helpful hints at http://www.xsane.org/

P.S.: There are other front-ends to SANE, but xsane has always been very stable and predictable. I haven't looked at any other options lately, but one of them is called quiteinsane and I see it's still available in Ubuntu. I'm mentioning it because it's always possible another project has the features we are looking for.

Revision history for this message
Antonio J. de Oliveira (ajoliveira) wrote :

Got it, and have installed quiteinsane, I was aware of its existence, but as you, when some stuff performs daily duties properly, why change...well, going to give it a try and dig a little more.

Revision history for this message
Micah Gersten (micahg) wrote :

Seems like a reasonable request.

Changed in xsane (Ubuntu):
importance: Undecided → Wishlist
Changed in xsane (Ubuntu):
status: New → Confirmed
Revision history for this message
Antonio J. de Oliveira (ajoliveira) wrote :

Hi

Good, when anything may be handled possibly by a minor script change, why going through major changes? Let me know if I may be of help.

Cheers

Antonio

Revision history for this message
Timmo Henseler (timmo-henseler) wrote :

Hi guys,

Following your discussion I think I am in a different (lower) league but as an end user, trying to get similar results as those I get from the HP scanning software under windows I very much recognise your issues which have troubled me since I started using Unbuntu last year. Today I hope I made some progress:

- I had gscan2pdf installed for a long time already but somehow it never satisfied my needs. Today however I got quite an acceptable color result (knowing a little bit more about jpg-compression than before). Perhaps worth a try and let me know your experience.

- QuiteInsane as far as I can see is a plugin for GIMP but has no pdf-capability (I only scan to pdf).

XSANE could be better but gscan2pdf is simpler and does what I need.

cheers,
Timmo

Revision history for this message
Boris Burtin (boris-burtin) wrote :

I'd also like to see improved support of image compression when generating PDF. When you're scanning multi-page documents as opposed to art-work, small file size is more important than optimal image quality.

Revision history for this message
Dylan Justice (dsjstc) wrote :

Confirmed in Lucid.

Revision history for this message
Tom Louwrier (tom-louwrier) wrote :

Same here.
I use a HP C7280 and almost all my scans are 1-multiple page and 2-output to pdf.

A standard A4 business letter, scanned grayscale to pdf in 200 dpi will get me about 1,2MB per page. That's quite ridiculous really and it makes mailing those docs awkward. Not all my relations have wideband internet and gig-size mailboxes.
I did not find any options in the setup to control the image size/compression when producing pdf's, just the basic scan resolution setting.

Also it seems a bit strange that I can't scan at a lower resolution than 192 dpi, which was possible earlier.

Running 10.04 amd64 (and definitely not going back to XP or Vista to use the HP original software!)

cheers
Tom

Revision history for this message
Tessa Lau (tlau) wrote :

I also suffer from this problem. A ten-page scan of a printed document results in a 43MB PDF.

What I found which works well is to apt-get install libtiff-tools and then use xsane to print to TIFF, and tiff2pdf to convert to PDF. The result is only 5MB and appears to have similar quality to the original file.

Moreover, the PDF generated via the original process results in thousands of errors like these when viewed in GNOME Document Viewer:

Error (588813): Illegal character <56> in hex string
Error (588814): Illegal character <a6> in hex string
Error (588815): Illegal character <af> in hex string
Error (588816): Illegal character <5c> in hex string

Revision history for this message
boblinux (robert-grasso) wrote :

May I add my own contribution - I am running on Maverick, x86_64; I just scanned a single sheet, with few numbers typed, some of them in color - a pretty raw and almost empty document; I scanned it in 150dpi, full color, the result was a 2.8 MB pdf file; then I applied a trick I discovered by chance a few days ago, as I needed to merge scanned pdfs; so today I used it in order to shrink my single-page pdf :

gs -q -sPAPERSIZE=a4 -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=out.pdf in.pdf

result : out.pdf size is 187 KB ! and note that this is not a conversion : this is used to MERGE pdfs.

Additionally, when opening the pdf produced by xsane with evince from a terminal, I get 2399 lines such as :

Error (358497): Illegal character <d9> in hex string
Error (358498): Illegal character <16> in hex string
Error (358499): Illegal character <d1> in hex string
Error (358500): Illegal character <9f> in hex string
Error (358501): Illegal character <5b> in hex string
Error (358502): Illegal character <fe> in hex string

when I open the one generated by gs, 0 (zero) error is generated - btw, it has been years that xsane pdfs yield such errors when opened by evince - but I guess that very few people open them from a terminal ...

Revision history for this message
A S (zephyr707) wrote :

was this ever resolved?

I'm evaluating scanning programs for linux and xsane comes highly recommended, but a typical color scan of a test document I'm using generated a huge 17440k pdf vs. a 824k from simple-scan and visually I am not seeing any difference. The grayscale scan was 5596k and seems to suffer from some kind of aliasing issue around letters that are on a diagonal, which doesn't happen with simple-scan and results in a much smaller (b&w though) 280k document, albeit with its own issues since it is b&w vs gray. All scans were done at 300dpi.

I've been trying to figure out why xsane generates such massive pdfs and ended up here, is the bug still relevant? Seems very old.

Revision history for this message
A S (zephyr707) wrote :

ok I take it back, there are definitely some jpeg artifacts in the simple-scan color scan when I zoom in over 500%, but the file size difference still seems quite dramatic.

still cannot produce a grayscale image that does not have this weird stepping/aliasing artifact and now that I've zoomed in very closely it seems to adversely affect the diagonal letters, but is actually present in all letters/images

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers