Convert epub to pdf, pdf appearance looks correct, but some of the copied text is incorrect

Bug #1857886 reported by moka on 2019-12-30
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Undecided
Unassigned

Bug Description

* The calibre version (get this by looking at the bottom of the main calibre screen)
I've tried both 4.6.0 and 4.7.0

* The operating system you are running calibre on (Windows, OS X, Linux)
I tried windows 10, but I think other Windows versions should have samme problems.

* My issue
For the ePub file with Chinese text converted to PDF, some text will be garbled. I suspect that there may be some problems in the processing of CMAP / CID by PDF. So I expect to add several parameters to the PDF output options for debugging the output problems. I would rather accept that the document is bigger than that the text is incorrect.

The following code is taken from "src\calibre\ebooks\pdf\html_writer.py"

if opts.pdf_merge_fonts:
    merge_fonts(pdf_doc)

if opts.pdf_dedup_type3_fonts:
    num_removed = dedup_type3_fonts(pdf_doc)
    if num_removed:
        log('Removed', num_removed, 'duplicated Type3 glyphs')

if opts.pdf_remove_unused_fonts:
    num_removed = remove_unused_fonts(pdf_doc)
    if num_removed:
        log('Removed', num_removed, 'unused fonts')

if opts.pdf_dedup_images:
    num_removed = pdf_doc.dedup_images()
    if num_removed:
        log('Removed', num_removed, 'duplicate images')

* If you are reporting a conversion problem, attach the input file and the output file and describe exactly what the problem is.

On the left side of the attachment is the PDF reader, there is no problem with the appearance, on the right side is the text selected on page 4 and copied to the Notepad. All the text marked in red in the figure has problems.

Below is my command line and output:

D:\software\calibre\calibre-4.7.0\Calibre Portable\Calibre>ebook-convert.exe D:\software\calibre\epub\test1.epub D:\software\calibre\epub\test1.pdf --base-font-size=14 --pdf-sans-family=微软雅黑 --pdf-serif-family=微软雅黑
Conversion options changed from defaults:
  base_font_size: 14.0
  pdf_serif_family: u'\u5fae\u8f6f\u96c5\u9ed1'
  pdf_sans_family: u'\u5fae\u8f6f\u96c5\u9ed1'
1% 将输入转换为HTML中...
InputFormatPlugin: EPUB Input running
on D:\software\calibre\epub\test1.epub
Found HTML cover titlepage.xhtml
Parsing all content...
34% 正在对电子书进行转换...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 13.20000pt
Removing fake margins...
Cleaning up manifest...
Trimming unused files from manifest...
Creating PDF Output...
67% 正在运行 PDF Output 插件
D:\software\calibre\calibre-4.7.0\Calibre Portable\Calibre\\app\pylib.zip\dateutil\parser\_parser.py:1177: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
The cover image has an id != "cover". Renaming to work around bug in Nook Color
68% Parsed all content for markup transformation
70% Completed markup transformation
90% Rendered all HTML as PDF
91% Added links to PDF content
100% Updated metadata in PDF
PDF output written to D:\software\calibre\epub\test1.pdf
输出保存到 D:\software\calibre\epub\test1.pdf

moka (mokacao) wrote :
moka (mokacao) wrote :
moka (mokacao) wrote :

Embed the fonts you are using in the epub file and attach that, so I can
reproduce.

 status incomplete

Changed in calibre:
status: New → Incomplete
moka (mokacao) wrote :

The attachment is the Microsoft YaHei font I used.

Kovid Goyal (kovid) wrote :

That looks like a bug with whatever PDf viewing software you are using. The PDF you attached renders fine in both acrobat reader and okular on my system.

Changed in calibre:
status: Incomplete → Invalid
Kovid Goyal (kovid) wrote :

This is the text as copied using okular on my system: 6月26日星期日 大风天 亲眼看 我男朋友 着他新欢的
手,在新光天地里 喷香水的那 刻,

Kovid Goyal (kovid) wrote :

Never mind, I think I see the issue

Changed in calibre:
status: Invalid → New

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

 status fixreleased

Changed in calibre:
status: New → Fix Released
moka (mokacao) wrote :

Thank you for your quick response to my question.

I've verified it on branch master. The copied text is completely correct.

After that, I will do more tests on ePub documents. If there is any problem, I will continue to feed back here.
Thank you again.

Xavier Berger (xsiberger) wrote :

I have a similar problem. I convert my kindle book to PDF to use it on my iPad with MargineNote for studying. The highlighted text is converted automatically in flash cards and a mind map but some letters are replaced with "?" (questions mark). The same happens when I copy/paste text from the converted PDF from SumatraPDF viewer to Notepad++. Letters are replaced with "?" in my case all "P" and "Q" are replaced with "?". Not sure if it is a viewer problem or something with the PDF. Acrobact Reader does not have a problem when I copy/paste text.

Not sure if I am missing a setting but I tried everything with embedding the fonts but I did not have any luck so far. It seems to be a little bit random as well. Depending on which font I embeded. Sometimes it is the letter "P", "J", or "V" which gets replaced with the "?".

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers