After PDF Files created by cups-pdf, cannot extract text from them
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
cups-pdf (Ubuntu) |
Invalid
|
Undecided
|
Unassigned |
Bug Description
PDF Creation Problem
Bob Swanson
<email address hidden>
26 June 2018
This file is part of the test package:
http://
I have been able to demonstrate PDF
printing issues with LibreOffice and web
browsers. (For contrast, I have also
used the "wkhtmltopdf" command-line utility
output.)
USING LIBREOFFICE
-----------------
(Base file: mytest.odt)
This problem was originally associated
with a LibreOffice file containing mixed
font usages. When printed with "cups-pdf",
most of the displayed text could not be
selected in "evince" and extracted text was
garbage (The PDFBox Java code could not extract
reasonable text from the PDF file.) See:
https:/
(working environment described in that
bug report)
It is much easier to demonstrate this
problem without using PDFBox Java.
To demonstrate, view any of the resulting PDF
files with "evince". When viewing a PDF
test file, simply press CTRL-A to select all
text. Then "paste" the selected area into a text
editor (Gedit or VIM, for instance) to see the
resulting plain text. (Okular did no better
than evince)
The following results occur:
o) With original testcase: mytest3_
created from LibreOffice using cups-pdf
Only one line highlights, but its text is correct.
This particular line uses a "standard" PDF font. The
other lines are not highlighted, and are not placed on
clipboard. This was the test that failed in PDFBox Java
text extraction.
Evince shows the many embedded fonts.
o) With original testcase: mytest_
created from LibreOffice using its "built in" PDF
creation option
ALL lines highlight, and when pasted, all text is
present.
Evince shows the many embedded fonts.
USING CHROMIUM BROWSER
-------
(Base file: mytest.html)
I created several lines using different fonts,
as an HTML file. Viewed in Chromium browser,
then printed.
o) File: mytest_
was printed from Chromium using the "cups-pdf"
"printer". All lines appear in the PDF, and
can be selected. But when pasted all resulting text
is garbage. Only one font embedded: "No name".
o) File: mytest_
was printed from Chromium using the "save as file"
option. All lines appear in the PDF, and can be
selected. All text (including text added by
the PDF creator) are present.
Evince shows the many embedded fonts.
(In the HTML cases, fonts used are no doubt
those already installed on my Ubuntu system.
The HTML code asked for fonts that may not
be present, and probably were substituted.)
USING BRAVE BROWSER
-------------------
(Base file: mytest.html)
Same testcase as for Chromium browser. Viewed
in Brave browser, then printed.
o) File: mytest_
was printed from Brave using the "cups-pdf"
"printer". All lines appear in the PDF. However,
when all selected, every character is highlighted
EXCEPT the initial "T" on the first line. When
pasted, all resulting text is garbage. Only one
font embedded: "No name".
o) File: mytest_
was printed from Brave using the "save as file"
option. All lines appear in the PDF, and can be
selected. All text is present. (No text added
by Brave).
Evince shows the many embedded fonts.
(Same notes may apply regarding fonts installed
on Ubuntu system)
USING WKHTMLTOPDF COMMAND
-------
(Base file: mytest.html)
Same testcase as for browsers. I'm using
this example to show that multiple font
test output can be created in different ways.
Command:
wkhtmltopdf mytest.html mytest_wk.pdf
o) File: mytest_wk.pdf,
All lines appear in the PDF, and fonts
are sometimes quite different than those shown in
the browsers (they may actually be more correct).
All content can be highlighted and can be pasted
as text. Several text lines, however, contain
additional whitespace (tabs).
Evince shows the many embedded fonts.
NOTES
-----
The "creator" name embedded in the metadata for
these PDF files varies considerably, and it is
unclear to me whether the same engine is being
used by these various packages. It is clear, at
least that cups-pdf is using Ghostscript for
PDF creation.
ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: cups-pdf 2.6.1-21
ProcVersionSign
Uname: Linux 4.13.0-45-generic x86_64
ApportVersion: 2.20.1-0ubuntu2.18
Architecture: amd64
CurrentDesktop: Unity
Date: Wed Jun 27 15:36:32 2018
InstallationDate: Installed on 2017-05-16 (406 days ago)
InstallationMedia: Ubuntu 16.04.2 LTS "Xenial Xerus" - Release amd64 (20170215.2)
SourcePackage: cups-pdf
UpgradeStatus: No upgrade log present (probably fresh install)
Does this issue still affect cups-pdf 3.0.1 packages?