calibre

ebook-convert bug to and from word (docx)

Bug #1829246 reported by klaus schallhorn on 2019-05-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	calibre	New	Undecided	Unassigned

Bug Description

I'm working on parsing HTML created by word and other word compatible processors. To get a large body of word I converted - among other things - epubs I have here in order to then convert the generated docx back into html files using various routes.

When I use ebook-convert to convert an epub into a docx and back, links that contain a SPAN in the clickable part, are processed incorrectly. Instead of a single html link I get three, all pointing to the URI of the original link. This is easier shown than described:

Original copy:

[p]This demonstrates what kind of [a href="http://www.example.com"]oddities linktext [span class="stdspamp"]&[/span] ampersands[/a] can produce.[/p]

After using ebook-convert to convert _to_ and the resulting docx _from_ word using the options below I have:

[p id="calibre_link-2" class="block_2"][span class="text_1"]This demonstrates what kind of [/span]
[a href="http://www.example.com" class="text_2"]oddities linktext [/a]
[a href="http://www.example.com" class="text_3"]&[/a]
[a href="http://www.example.com" class="text_2"] ampersands[/a][span class="text_1"] can produce.[/span][/p]

(I've inserted linefeeds for clarity only.)

Options I've used to ebook-convert:

ebook-convert INFILE Outfile.docx --docx-no-toc --unsmarten-punctuation --preserve-cover-aspect-ratio

And to do docx to htmlz:

ebook-convert INFILE.docx Outfile.htmlz --docx-inline-subsup

The link is handled/exported correctly when I export from the calibre generated docx using word, libre office or other compatible programs, thus it seems the error occurs when converting docx to htmlz.

I enclose the original epub, the docx and htmlz created by ebook-convert.