calibre

Convert epub to pdf, pdf appearance looks correct, but some of the copied text is incorrect

Bug #1857886 reported by moka on 2019-12-30

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	calibre	Fix Released	Undecided	Unassigned

Bug Description

* The calibre version (get this by looking at the bottom of the main calibre screen)
I've tried both 4.6.0 and 4.7.0

* The operating system you are running calibre on (Windows, OS X, Linux)
I tried windows 10, but I think other Windows versions should have samme problems.

* My issue
For the ePub file with Chinese text converted to PDF, some text will be garbled. I suspect that there may be some problems in the processing of CMAP / CID by PDF. So I expect to add several parameters to the PDF output options for debugging the output problems. I would rather accept that the document is bigger than that the text is incorrect.

The following code is taken from "src\calibre\ebooks\pdf\html_writer.py"

if opts.pdf_merge_fonts:
merge_fonts(pdf_doc)

if opts.pdf_dedup_type3_fonts:
    num_removed = dedup_type3_fonts(pdf_doc)
    if num_removed:
        log('Removed', num_removed, 'duplicated Type3 glyphs')

if opts.pdf_remove_unused_fonts:
    num_removed = remove_unused_fonts(pdf_doc)
    if num_removed:
        log('Removed', num_removed, 'unused fonts')

if opts.pdf_dedup_images:
    num_removed = pdf_doc.dedup_images()
    if num_removed:
        log('Removed', num_removed, 'duplicate images')

* If you are reporting a conversion problem, attach the input file and the output file and describe exactly what the problem is.

On the left side of the attachment is the PDF reader, there is no problem with the appearance, on the right side is the text selected on page 4 and copied to the Notepad. All the text marked in red in the figure has problems.

Below is my command line and output:

D:\software\calibre\calibre-4.7.0\Calibre Portable\Calibre>ebook-convert.exe D:\software\calibre\epub\test1.epub D:\software\calibre\epub\test1.pdf --base-font-size=14 --pdf-sans-family=微软雅黑 --pdf-serif-family=微软雅黑
Conversion options changed from defaults:
  base_font_size: 14.0
  pdf_serif_family: u'\u5fae\u8f6f\u96c5\u9ed1'
  pdf_sans_family: u'\u5fae\u8f6f\u96c5\u9ed1'
1% 将输入转换为HTML中...
InputFormatPlugin: EPUB Input running
on D:\software\calibre\epub\test1.epub
Found HTML cover titlepage.xhtml
Parsing all content...
34% 正在对电子书进行转换...
Merging user specified metadata...
Detecting structure...
Flattening CSS and remapping font sizes...
Source base font size is 13.20000pt
Removing fake margins...
Cleaning up manifest...
Trimming unused files from manifest...
Creating PDF Output...
67% 正在运行 PDF Output 插件
D:\software\calibre\calibre-4.7.0\Calibre Portable\Calibre\\app\pylib.zip\dateutil\parser\_parser.py:1177: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
The cover image has an id != "cover". Renaming to work around bug in Nook Color
68% Parsed all content for markup transformation
70% Completed markup transformation
90% Rendered all HTML as PDF
91% Added links to PDF content
100% Updated metadata in PDF
PDF output written to D:\software\calibre\epub\test1.pdf
输出保存到 D:\software\calibre\epub\test1.pdf