Comment 5 for bug 1778988

Revision history for this message
Bob Swanson (wwi) wrote : Re: [Bug 1778988] Re: After PDF Files created by cups-pdf, cannot extract text from them

I tested on my current system and the results show per my previous comment.
Previously, I was unable to determine the version of
CUPS. Following is output of "dpkg-query -l" filtered for "cups" only:

ii cups 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
PPD/driver support, web interface
ii cups-browsed 1.27.4-1
                     amd64 OpenPrinting CUPS Filters -
cups-browsed
ii cups-bsd 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
BSD commands
ii cups-client 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
client programs (SysV)
ii cups-common 2.3.1-9ubuntu1.1
                     all Common UNIX Printing System(tm) -
common files
ii cups-core-drivers 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
driverless printing
ii cups-daemon 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
daemon

Dated 2/14/2021 on my computer.

So, per your message, I am not running CUPS 3.0, but rather
2.3.1 as packaged by Ubuntu.

On 2/14/21, Martin-Éric Racine <email address hidden> wrote:
> Thanks. Let's close it.
>
> ** Changed in: cups-pdf (Ubuntu)
> Status: New => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1778988
>
> Title:
> After PDF Files created by cups-pdf, cannot extract text from them
>
> Status in cups-pdf package in Ubuntu:
> Invalid
>
> Bug description:
> PDF Creation Problem
>
> Bob Swanson
> <email address hidden>
> 26 June 2018
>
> This file is part of the test package:
>
> http://swansongrp.com/misc/testcase.zip
>
>
> I have been able to demonstrate PDF
> printing issues with LibreOffice and web
> browsers. (For contrast, I have also
> used the "wkhtmltopdf" command-line utility
> output.)
>
>
> USING LIBREOFFICE
> -----------------
>
> (Base file: mytest.odt)
>
> This problem was originally associated
> with a LibreOffice file containing mixed
> font usages. When printed with "cups-pdf",
> most of the displayed text could not be
> selected in "evince" and extracted text was
> garbage (The PDFBox Java code could not extract
> reasonable text from the PDF file.) See:
>
> https://issues.apache.org/jira/browse/PDFBOX-4250
>
> (working environment described in that
> bug report)
>
> It is much easier to demonstrate this
> problem without using PDFBox Java.
>
> To demonstrate, view any of the resulting PDF
> files with "evince". When viewing a PDF
> test file, simply press CTRL-A to select all
> text. Then "paste" the selected area into a text
> editor (Gedit or VIM, for instance) to see the
> resulting plain text. (Okular did no better
> than evince)
>
> The following results occur:
>
> o) With original testcase: mytest3_cups_pdf.pdf,
> created from LibreOffice using cups-pdf
>
> Only one line highlights, but its text is correct.
> This particular line uses a "standard" PDF font. The
> other lines are not highlighted, and are not placed on
> clipboard. This was the test that failed in PDFBox Java
> text extraction.
>
> Evince shows the many embedded fonts.
>
>
> o) With original testcase: mytest_libreoffice_direct.pdf
> created from LibreOffice using its "built in" PDF
> creation option
>
> ALL lines highlight, and when pasted, all text is
> present.
>
> Evince shows the many embedded fonts.
>
>
> USING CHROMIUM BROWSER
> ----------------------
>
> (Base file: mytest.html)
>
> I created several lines using different fonts,
> as an HTML file. Viewed in Chromium browser,
> then printed.
>
> o) File: mytest_html_cups_pdf.pdf,
> was printed from Chromium using the "cups-pdf"
> "printer". All lines appear in the PDF, and
> can be selected. But when pasted all resulting text
> is garbage. Only one font embedded: "No name".
>
> o) File: mytest_html_save_as_pdf.pdf,
> was printed from Chromium using the "save as file"
> option. All lines appear in the PDF, and can be
> selected. All text (including text added by
> the PDF creator) are present.
>
> Evince shows the many embedded fonts.
>
> (In the HTML cases, fonts used are no doubt
> those already installed on my Ubuntu system.
> The HTML code asked for fonts that may not
> be present, and probably were substituted.)
>
>
> USING BRAVE BROWSER
> -------------------
>
> (Base file: mytest.html)
>
> Same testcase as for Chromium browser. Viewed
> in Brave browser, then printed.
>
> o) File: mytest_html_brave_cups_pdf.pdf,
> was printed from Brave using the "cups-pdf"
> "printer". All lines appear in the PDF. However,
> when all selected, every character is highlighted
> EXCEPT the initial "T" on the first line. When
> pasted, all resulting text is garbage. Only one
> font embedded: "No name".
>
> o) File: mytest_html_brave_save_as_pdf.pdf,
> was printed from Brave using the "save as file"
> option. All lines appear in the PDF, and can be
> selected. All text is present. (No text added
> by Brave).
>
> Evince shows the many embedded fonts.
>
> (Same notes may apply regarding fonts installed
> on Ubuntu system)
>
>
> USING WKHTMLTOPDF COMMAND
> -------------------------
>
> (Base file: mytest.html)
>
> Same testcase as for browsers. I'm using
> this example to show that multiple font
> test output can be created in different ways.
>
> Command:
>
> wkhtmltopdf mytest.html mytest_wk.pdf
>
> o) File: mytest_wk.pdf,
> All lines appear in the PDF, and fonts
> are sometimes quite different than those shown in
> the browsers (they may actually be more correct).
>
> All content can be highlighted and can be pasted
> as text. Several text lines, however, contain
> additional whitespace (tabs).
>
> Evince shows the many embedded fonts.
>
>
> NOTES
> -----
>
> The "creator" name embedded in the metadata for
> these PDF files varies considerably, and it is
> unclear to me whether the same engine is being
> used by these various packages. It is clear, at
> least that cups-pdf is using Ghostscript for
> PDF creation.
>
> ProblemType: Bug
> DistroRelease: Ubuntu 16.04
> Package: cups-pdf 2.6.1-21
> ProcVersionSignature: Ubuntu 4.13.0-45.50~16.04.1-generic 4.13.16
> Uname: Linux 4.13.0-45-generic x86_64
> ApportVersion: 2.20.1-0ubuntu2.18
> Architecture: amd64
> CurrentDesktop: Unity
> Date: Wed Jun 27 15:36:32 2018
> InstallationDate: Installed on 2017-05-16 (406 days ago)
> InstallationMedia: Ubuntu 16.04.2 LTS "Xenial Xerus" - Release amd64
> (20170215.2)
> SourcePackage: cups-pdf
> UpgradeStatus: No upgrade log present (probably fresh install)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/cups-pdf/+bug/1778988/+subscriptions
>