Ubuntu
cups-pdf package

After PDF Files created by cups-pdf, cannot extract text from them

Bug #1778988 reported by Bob Swanson on 2018-06-27

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	cups-pdf (Ubuntu)	Invalid	Undecided	Unassigned

Bug Description

PDF Creation Problem

Bob Swanson
<email address hidden>
26 June 2018

This file is part of the test package:

http://swansongrp.com/misc/testcase.zip

I have been able to demonstrate PDF
printing issues with LibreOffice and web
browsers. (For contrast, I have also
used the "wkhtmltopdf" command-line utility
output.)

USING LIBREOFFICE
-----------------

(Base file: mytest.odt)

This problem was originally associated
with a LibreOffice file containing mixed
font usages. When printed with "cups-pdf",
most of the displayed text could not be
selected in "evince" and extracted text was
garbage (The PDFBox Java code could not extract
reasonable text from the PDF file.) See:

https://issues.apache.org/jira/browse/PDFBOX-4250

(working environment described in that
bug report)

It is much easier to demonstrate this
problem without using PDFBox Java.

To demonstrate, view any of the resulting PDF
files with "evince". When viewing a PDF
test file, simply press CTRL-A to select all
text. Then "paste" the selected area into a text
editor (Gedit or VIM, for instance) to see the
resulting plain text. (Okular did no better
than evince)

The following results occur:

o) With original testcase: mytest3_cups_pdf.pdf,
created from LibreOffice using cups-pdf

Only one line highlights, but its text is correct.
This particular line uses a "standard" PDF font. The
other lines are not highlighted, and are not placed on
clipboard. This was the test that failed in PDFBox Java
text extraction.

Evince shows the many embedded fonts.

o) With original testcase: mytest_libreoffice_direct.pdf
created from LibreOffice using its "built in" PDF
creation option

ALL lines highlight, and when pasted, all text is
present.

Evince shows the many embedded fonts.

USING CHROMIUM BROWSER
----------------------

(Base file: mytest.html)

I created several lines using different fonts,
as an HTML file. Viewed in Chromium browser,
then printed.

o) File: mytest_html_cups_pdf.pdf,
was printed from Chromium using the "cups-pdf"
"printer". All lines appear in the PDF, and
can be selected. But when pasted all resulting text
is garbage. Only one font embedded: "No name".

o) File: mytest_html_save_as_pdf.pdf,
was printed from Chromium using the "save as file"
option. All lines appear in the PDF, and can be
selected. All text (including text added by
the PDF creator) are present.

Evince shows the many embedded fonts.

(In the HTML cases, fonts used are no doubt
those already installed on my Ubuntu system.
The HTML code asked for fonts that may not
be present, and probably were substituted.)

USING BRAVE BROWSER
-------------------

(Base file: mytest.html)

Same testcase as for Chromium browser. Viewed
in Brave browser, then printed.

o) File: mytest_html_brave_cups_pdf.pdf,
was printed from Brave using the "cups-pdf"
"printer". All lines appear in the PDF. However,
when all selected, every character is highlighted
EXCEPT the initial "T" on the first line. When
pasted, all resulting text is garbage. Only one
font embedded: "No name".

o) File: mytest_html_brave_save_as_pdf.pdf,
was printed from Brave using the "save as file"
option. All lines appear in the PDF, and can be
selected. All text is present. (No text added
by Brave).

Evince shows the many embedded fonts.

(Same notes may apply regarding fonts installed
on Ubuntu system)

USING WKHTMLTOPDF COMMAND
-------------------------

(Base file: mytest.html)

Same testcase as for browsers. I'm using
this example to show that multiple font
test output can be created in different ways.

Command:

wkhtmltopdf mytest.html mytest_wk.pdf

o) File: mytest_wk.pdf,
All lines appear in the PDF, and fonts
are sometimes quite different than those shown in
the browsers (they may actually be more correct).

All content can be highlighted and can be pasted
as text. Several text lines, however, contain
additional whitespace (tabs).

Evince shows the many embedded fonts.

NOTES
-----

The "creator" name embedded in the metadata for
these PDF files varies considerably, and it is
unclear to me whether the same engine is being
used by these various packages. It is clear, at
least that cups-pdf is using Ghostscript for
PDF creation.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: cups-pdf 2.6.1-21
ProcVersionSignature: Ubuntu 4.13.0-45.50~16.04.1-generic 4.13.16
Uname: Linux 4.13.0-45-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Jun 27 15:36:32 2018
InstallationDate: Installed on 2017-05-16 (406 days ago)
InstallationMedia: Ubuntu 16.04.2 LTS "Xenial Xerus" - Release amd64 (20170215.2)
SourcePackage: cups-pdf
UpgradeStatus: No upgrade log present (probably fresh install)

Tags:

Revision history for this message

Bob Swanson (wwi) wrote on 2018-06-27:

ZIP file with README and several testcases Edit (331.0 KiB, application/zip)
Dependencies.txt Edit (7.8 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.2 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (103 bytes, text/plain; charset="utf-8")

Revision history for this message

Martin-Éric Racine (q-funk) wrote on 2020-12-17:

Does this issue still affect cups-pdf 3.0.1 packages?

Revision history for this message

Bob Swanson (wwi) wrote on 2020-12-17:

I have revisited this issue, using UBuntu 20. I cannot figure out how to get the version of CUPS, but the test environment I now have includes:

Ghostscript 9.50
Evince 3.36.7
Cairo 1.16.0
wkhtmltopdf 0.12.5
LibreOffice 6.4.6.2
Okular 1.9.3

I reran all tests indicated in this report, today 17 December 2020, with
the system and products at the levels indicated.

With one exception, all tests now work. The exception is wkhtmltopdf,
which is NOT addressed in this original bug report. If anyone wishes to
issue a bug report against that product, it will obviously be a
separate matter.

- - - - -

I installed "cups-pdf" on my Ubuntu system.

Using LibreOffice, I reloaded the test ODT file. I created:

1) PDF output using the cups-pdf (aka "PDF") printer. In this
   environment all worked correctly. That is, all text was
   selectable in the viewers (Okular, Evince), and all selected
   text pasted correctly into VIM.

2) LibreOffice direct export of PDF. All worked correctly

Using Brave (Chromium) browser, I loaded the HTML test file.

1) Printing from Brave using cups-pdf seemed to work. The
   first time I failed to notice that the output was
   landscape. Of course, this information was not useable.
   I changed to Portrait, and output was correct. All text
   selected in PDF viewer(s) and all pasted correctly in
   VIM.

2) From Brave, I selected the "save as PDF". All output was
correct.

When the HTML was tested with wkhtmltopdf, the output appeared
odd. A check of the PDF file showed that that tool had
replaced one font with Dingbats. The text did not appear
correctly in the PDF viewers, but when copied and pasted
into VIM, the original roman text was present and correct.
The selection of Dingbats seems odd. HOWEVER, this bug report
is not intended to be a report on the wkhtmltopdf command.

I consider this issue to be closed.

Revision history for this message

Martin-Éric Racine (q-funk) wrote on 2021-02-14:

Thanks. Let's close it.

Changed in cups-pdf (Ubuntu):
status:	New → Invalid

Revision history for this message

Bob Swanson (wwi) wrote on 2021-02-14: Re: [Bug 1778988] Re: After PDF Files created by cups-pdf, cannot extract text from them

Download full text (7.1 KiB)

I tested on my current system and the results show per my previous comment.
Previously, I was unable to determine the version of
CUPS. Following is output of "dpkg-query -l" filtered for "cups" only:

Dated 2/14/2021 on my computer.

So, per your message, I am not running CUPS 3.0, but rather
2.3.1 as packaged by Ubuntu.

On 2/14/21, Martin-Éric Racine <email address hidden> wrote:
> Thanks. Let's close it.
>
> ** Changed in: cups-pdf (Ubuntu)
> Status: New => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1778988
>
> Title:
> After PDF Files created by cups-pdf, cannot extract text from them
>
> Status in cups-pdf package in Ubuntu:
> Invalid
>
> Bug description:
> PDF Creation Problem
>
> Bob Swanson
> <email address hidden>
> 26 June 2018
>
> This file is part of the test package:
>
> http://swansongrp.com/misc/testcase.zip
>
>
> I have been able to demonstrate PDF
> printing issues with LibreOffice and web
> browsers. (For contrast, I have also
> used the "wkhtmltopdf" command-line utility
> output.)
>
>
> USING LIBREOFFICE
> -----------------
>
> (Base file: mytest.odt)
>
> This problem was originally associated
> with a LibreOffice file containing mixed
> font usages. When printed with "cups-pdf",
> most of the displayed text could not be
> selected in "evince" and extracted text was
> garbage (The PDFBox Java code could not extract
> reasonable text from the PDF file.) See:
>
> https://issues.apache.org/jira/browse/PDFBOX-4250
>
> (working environment described in that
> bug report)
>
> It is much easier to demonstrate this
> problem without using PDFBox Java.
>
> To demonstrate, view any of the resulting PDF
> files with "evince". When viewing a PDF
> test file, simply press CTRL-A to select all
> text. Then "paste" the selected area into a text
> editor (Gedit or VIM, for instance) to see the
> resulting plain text. (Okular did no better
> than evince)
>
> The following results occur:
>
> o) With ori...

I tested on my current system and the results show per my previous comment.
Previously, I was unable to determine the version of
CUPS. Following is output of "dpkg-query -l" filtered for "cups" only:

ii  cups                                              2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
PPD/driver support, web interface
ii  cups-browsed                                      1.27.4-1
                     amd64        OpenPrinting CUPS Filters -
cups-browsed
ii  cups-bsd                                          2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
BSD commands
ii  cups-client                                       2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
client programs (SysV)
ii  cups-common                                       2.3.1-9ubuntu1.1
                     all          Common UNIX Printing System(tm) -
common files
ii  cups-core-drivers                                 2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
driverless printing
ii  cups-daemon                                       2.3.1-9ubuntu1.1
                     amd64        Common UNIX Printing System(tm) -
daemon

Dated 2/14/2021 on my computer.

So, per your message, I am not running CUPS 3.0, but rather
2.3.1 as packaged by Ubuntu.

On 2/14/21, Martin-Éric Racine <1778988@bugs.launchpad.net> wrote:
> Thanks. Let's close it.
>
> ** Changed in: cups-pdf (Ubuntu)
>        Status: New => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1778988
>
> Title:
>   After PDF Files created by cups-pdf, cannot extract text from them
>
> Status in cups-pdf package in Ubuntu:
>   Invalid
>
> Bug description:
>   PDF Creation Problem
>
>   Bob Swanson
>   bobswansong@gmail.com
>   26 June 2018
>
>   This file is part of the test package:
>
>   http://swansongrp.com/misc/testcase.zip
>
>
>   I have been able to demonstrate PDF
>   printing issues with LibreOffice and web
>   browsers. (For contrast, I have also
>   used the "wkhtmltopdf" command-line utility
>   output.)
>
>
>   USING LIBREOFFICE
>   -----------------
>
>   (Base file: mytest.odt)
>
>   This problem was originally associated
>   with a LibreOffice file containing mixed
>   font usages. When printed with "cups-pdf",
>   most of the displayed text could not be
>   selected in "evince" and extracted text was
>   garbage (The PDFBox Java code could not extract
>   reasonable text from the PDF file.) See:
>
>   https://issues.apache.org/jira/browse/PDFBOX-4250
>
>   (working environment described in that
>   bug report)
>
>   It is much easier to demonstrate this
>   problem without using PDFBox Java.
>
>   To demonstrate, view any of the resulting PDF
>   files with "evince". When viewing a PDF
>   test file, simply press CTRL-A to select all
>   text. Then "paste" the selected area into a text
>   editor (Gedit or VIM, for instance) to see the
>   resulting plain text. (Okular did no better
>   than evince)
>
>   The following results occur:
>
>   o) With original testcase: mytest3_cups_pdf.pdf,
>   created from LibreOffice using cups-pdf
>
>   Only one line highlights, but its text is correct.
>   This particular line uses a "standard" PDF font. The
>   other lines are not highlighted, and are not placed on
>   clipboard. This was the test that failed in PDFBox Java
>   text extraction.
>
>   Evince shows the many embedded fonts.
>
>
>   o) With original testcase: mytest_libreoffice_direct.pdf
>   created from LibreOffice using its "built in" PDF
>   creation option
>
>   ALL lines highlight, and when pasted, all text is
>   present.
>
>   Evince shows the many embedded fonts.
>
>
>   USING CHROMIUM BROWSER
>   ----------------------
>
>   (Base file: mytest.html)
>
>   I created several lines using different fonts,
>   as an HTML file. Viewed in Chromium browser,
>   then printed.
>
>   o) File: mytest_html_cups_pdf.pdf,
>   was printed from Chromium using the "cups-pdf"
>   "printer". All lines appear in the PDF, and
>   can be selected. But when pasted all resulting text
>   is garbage. Only one font embedded: "No name".
>
>   o) File: mytest_html_save_as_pdf.pdf,
>   was printed from Chromium using the "save as file"
>   option. All lines appear in the PDF, and can be
>   selected. All text (including text added by
>   the PDF creator) are present.
>
>   Evince shows the many embedded fonts.
>
>   (In the HTML cases, fonts used are no doubt
>   those already installed on my Ubuntu system.
>   The HTML code asked for fonts that may not
>   be present, and probably were substituted.)
>
>
>   USING BRAVE BROWSER
>   -------------------
>
>   (Base file: mytest.html)
>
>   Same testcase as for Chromium browser. Viewed
>   in Brave browser, then printed.
>
>   o) File: mytest_html_brave_cups_pdf.pdf,
>   was printed from Brave using the "cups-pdf"
>   "printer". All lines appear in the PDF. However,
>   when all selected, every character is highlighted
>   EXCEPT the initial "T" on the first line.  When
>   pasted, all resulting text is garbage. Only one
>   font embedded: "No name".
>
>   o) File: mytest_html_brave_save_as_pdf.pdf,
>   was printed from Brave using the "save as file"
>   option. All lines appear in the PDF, and can be
>   selected. All text is present. (No text added
>   by Brave).
>
>   Evince shows the many embedded fonts.
>
>   (Same notes may apply regarding fonts installed
>   on Ubuntu system)
>
>
>   USING WKHTMLTOPDF COMMAND
>   -------------------------
>
>   (Base file: mytest.html)
>
>   Same testcase as for browsers. I'm using
>   this example to show that multiple font
>   test output can be created in different ways.
>
>   Command:
>
>   wkhtmltopdf mytest.html mytest_wk.pdf
>
>   o) File: mytest_wk.pdf,
>   All lines appear in the PDF, and fonts
>   are sometimes quite different than those shown in
>   the browsers (they may actually be more correct).
>
>   All content can be highlighted and can be pasted
>   as text. Several text lines, however, contain
>   additional whitespace (tabs).
>
>   Evince shows the many embedded fonts.
>
>
>   NOTES
>   -----
>
>   The "creator" name embedded in the metadata for
>   these PDF files varies considerably, and it is
>   unclear to me whether the same engine is being
>   used by these various packages. It is clear, at
>   least that cups-pdf is using Ghostscript for
>   PDF creation.
>
>   ProblemType: Bug
>   DistroRelease: Ubuntu 16.04
>   Package: cups-pdf 2.6.1-21
>   ProcVersionSignature: Ubuntu 4.13.0-45.50~16.04.1-generic 4.13.16
>   Uname: Linux 4.13.0-45-generic x86_64
>   ApportVersion: 2.20.1-0ubuntu2.18
>   Architecture: amd64
>   CurrentDesktop: Unity
>   Date: Wed Jun 27 15:36:32 2018
>   InstallationDate: Installed on 2017-05-16 (406 days ago)
>   InstallationMedia: Ubuntu 16.04.2 LTS "Xenial Xerus" - Release amd64
> (20170215.2)
>   SourcePackage: cups-pdf
>   UpgradeStatus: No upgrade log present (probably fresh install)
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/cups-pdf/+bug/1778988/+subscriptions
>

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntucups-pdf package

After PDF Files created by cups-pdf, cannot extract text from them

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
cups-pdf package