After PDF Files created by cups-pdf, cannot extract text from them

Bug #1778988 reported by Bob Swanson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cups-pdf (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

PDF Creation Problem

Bob Swanson
<email address hidden>
26 June 2018

This file is part of the test package:

http://swansongrp.com/misc/testcase.zip

I have been able to demonstrate PDF
printing issues with LibreOffice and web
browsers. (For contrast, I have also
used the "wkhtmltopdf" command-line utility
output.)

USING LIBREOFFICE
-----------------

(Base file: mytest.odt)

This problem was originally associated
with a LibreOffice file containing mixed
font usages. When printed with "cups-pdf",
most of the displayed text could not be
selected in "evince" and extracted text was
garbage (The PDFBox Java code could not extract
reasonable text from the PDF file.) See:

https://issues.apache.org/jira/browse/PDFBOX-4250

(working environment described in that
bug report)

It is much easier to demonstrate this
problem without using PDFBox Java.

To demonstrate, view any of the resulting PDF
files with "evince". When viewing a PDF
test file, simply press CTRL-A to select all
text. Then "paste" the selected area into a text
editor (Gedit or VIM, for instance) to see the
resulting plain text. (Okular did no better
than evince)

The following results occur:

o) With original testcase: mytest3_cups_pdf.pdf,
created from LibreOffice using cups-pdf

Only one line highlights, but its text is correct.
This particular line uses a "standard" PDF font. The
other lines are not highlighted, and are not placed on
clipboard. This was the test that failed in PDFBox Java
text extraction.

Evince shows the many embedded fonts.

o) With original testcase: mytest_libreoffice_direct.pdf
created from LibreOffice using its "built in" PDF
creation option

ALL lines highlight, and when pasted, all text is
present.

Evince shows the many embedded fonts.

USING CHROMIUM BROWSER
----------------------

(Base file: mytest.html)

I created several lines using different fonts,
as an HTML file. Viewed in Chromium browser,
then printed.

o) File: mytest_html_cups_pdf.pdf,
was printed from Chromium using the "cups-pdf"
"printer". All lines appear in the PDF, and
can be selected. But when pasted all resulting text
is garbage. Only one font embedded: "No name".

o) File: mytest_html_save_as_pdf.pdf,
was printed from Chromium using the "save as file"
option. All lines appear in the PDF, and can be
selected. All text (including text added by
the PDF creator) are present.

Evince shows the many embedded fonts.

(In the HTML cases, fonts used are no doubt
those already installed on my Ubuntu system.
The HTML code asked for fonts that may not
be present, and probably were substituted.)

USING BRAVE BROWSER
-------------------

(Base file: mytest.html)

Same testcase as for Chromium browser. Viewed
in Brave browser, then printed.

o) File: mytest_html_brave_cups_pdf.pdf,
was printed from Brave using the "cups-pdf"
"printer". All lines appear in the PDF. However,
when all selected, every character is highlighted
EXCEPT the initial "T" on the first line. When
pasted, all resulting text is garbage. Only one
font embedded: "No name".

o) File: mytest_html_brave_save_as_pdf.pdf,
was printed from Brave using the "save as file"
option. All lines appear in the PDF, and can be
selected. All text is present. (No text added
by Brave).

Evince shows the many embedded fonts.

(Same notes may apply regarding fonts installed
on Ubuntu system)

USING WKHTMLTOPDF COMMAND
-------------------------

(Base file: mytest.html)

Same testcase as for browsers. I'm using
this example to show that multiple font
test output can be created in different ways.

Command:

wkhtmltopdf mytest.html mytest_wk.pdf

o) File: mytest_wk.pdf,
All lines appear in the PDF, and fonts
are sometimes quite different than those shown in
the browsers (they may actually be more correct).

All content can be highlighted and can be pasted
as text. Several text lines, however, contain
additional whitespace (tabs).

Evince shows the many embedded fonts.

NOTES
-----

The "creator" name embedded in the metadata for
these PDF files varies considerably, and it is
unclear to me whether the same engine is being
used by these various packages. It is clear, at
least that cups-pdf is using Ghostscript for
PDF creation.

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: cups-pdf 2.6.1-21
ProcVersionSignature: Ubuntu 4.13.0-45.50~16.04.1-generic 4.13.16
Uname: Linux 4.13.0-45-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Jun 27 15:36:32 2018
InstallationDate: Installed on 2017-05-16 (406 days ago)
InstallationMedia: Ubuntu 16.04.2 LTS "Xenial Xerus" - Release amd64 (20170215.2)
SourcePackage: cups-pdf
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Bob Swanson (wwi) wrote :
Revision history for this message
Martin-Éric Racine (q-funk) wrote :

Does this issue still affect cups-pdf 3.0.1 packages?

Revision history for this message
Bob Swanson (wwi) wrote :

I have revisited this issue, using UBuntu 20. I cannot figure out how to get the version of CUPS, but the test environment I now have includes:

Ghostscript 9.50
Evince 3.36.7
Cairo 1.16.0
wkhtmltopdf 0.12.5
LibreOffice 6.4.6.2
Okular 1.9.3

I reran all tests indicated in this report, today 17 December 2020, with
the system and products at the levels indicated.

With one exception, all tests now work. The exception is wkhtmltopdf,
which is NOT addressed in this original bug report. If anyone wishes to
issue a bug report against that product, it will obviously be a
separate matter.

- - - - -

I installed "cups-pdf" on my Ubuntu system.

Using LibreOffice, I reloaded the test ODT file. I created:

1) PDF output using the cups-pdf (aka "PDF") printer. In this
   environment all worked correctly. That is, all text was
   selectable in the viewers (Okular, Evince), and all selected
   text pasted correctly into VIM.

2) LibreOffice direct export of PDF. All worked correctly

Using Brave (Chromium) browser, I loaded the HTML test file.

1) Printing from Brave using cups-pdf seemed to work. The
   first time I failed to notice that the output was
   landscape. Of course, this information was not useable.
   I changed to Portrait, and output was correct. All text
   selected in PDF viewer(s) and all pasted correctly in
   VIM.

2) From Brave, I selected the "save as PDF". All output was
   correct.

When the HTML was tested with wkhtmltopdf, the output appeared
odd. A check of the PDF file showed that that tool had
replaced one font with Dingbats. The text did not appear
correctly in the PDF viewers, but when copied and pasted
into VIM, the original roman text was present and correct.
The selection of Dingbats seems odd. HOWEVER, this bug report
is not intended to be a report on the wkhtmltopdf command.

I consider this issue to be closed.

Revision history for this message
Martin-Éric Racine (q-funk) wrote :

Thanks. Let's close it.

Changed in cups-pdf (Ubuntu):
status: New → Invalid
Revision history for this message
Bob Swanson (wwi) wrote : Re: [Bug 1778988] Re: After PDF Files created by cups-pdf, cannot extract text from them
Download full text (7.1 KiB)

I tested on my current system and the results show per my previous comment.
Previously, I was unable to determine the version of
CUPS. Following is output of "dpkg-query -l" filtered for "cups" only:

ii cups 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
PPD/driver support, web interface
ii cups-browsed 1.27.4-1
                     amd64 OpenPrinting CUPS Filters -
cups-browsed
ii cups-bsd 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
BSD commands
ii cups-client 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
client programs (SysV)
ii cups-common 2.3.1-9ubuntu1.1
                     all Common UNIX Printing System(tm) -
common files
ii cups-core-drivers 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
driverless printing
ii cups-daemon 2.3.1-9ubuntu1.1
                     amd64 Common UNIX Printing System(tm) -
daemon

Dated 2/14/2021 on my computer.

So, per your message, I am not running CUPS 3.0, but rather
2.3.1 as packaged by Ubuntu.

On 2/14/21, Martin-Éric Racine <email address hidden> wrote:
> Thanks. Let's close it.
>
> ** Changed in: cups-pdf (Ubuntu)
> Status: New => Invalid
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1778988
>
> Title:
> After PDF Files created by cups-pdf, cannot extract text from them
>
> Status in cups-pdf package in Ubuntu:
> Invalid
>
> Bug description:
> PDF Creation Problem
>
> Bob Swanson
> <email address hidden>
> 26 June 2018
>
> This file is part of the test package:
>
> http://swansongrp.com/misc/testcase.zip
>
>
> I have been able to demonstrate PDF
> printing issues with LibreOffice and web
> browsers. (For contrast, I have also
> used the "wkhtmltopdf" command-line utility
> output.)
>
>
> USING LIBREOFFICE
> -----------------
>
> (Base file: mytest.odt)
>
> This problem was originally associated
> with a LibreOffice file containing mixed
> font usages. When printed with "cups-pdf",
> most of the displayed text could not be
> selected in "evince" and extracted text was
> garbage (The PDFBox Java code could not extract
> reasonable text from the PDF file.) See:
>
> https://issues.apache.org/jira/browse/PDFBOX-4250
>
> (working environment described in that
> bug report)
>
> It is much easier to demonstrate this
> problem without using PDFBox Java.
>
> To demonstrate, view any of the resulting PDF
> files with "evince". When viewing a PDF
> test file, simply press CTRL-A to select all
> text. Then "paste" the selected area into a text
> editor (Gedit or VIM, for instance) to see the
> resulting plain text. (Okular did no better
> than evince)
>
> The following results occur:
>
> o) With ori...

Read more...

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.