ebook-convert bug to and from word (docx)
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
calibre |
New
|
Undecided
|
Unassigned |
Bug Description
I'm working on parsing HTML created by word and other word compatible processors. To get a large body of word I converted - among other things - epubs I have here in order to then convert the generated docx back into html files using various routes.
When I use ebook-convert to convert an epub into a docx and back, links that contain a SPAN in the clickable part, are processed incorrectly. Instead of a single html link I get three, all pointing to the URI of the original link. This is easier shown than described:
Original copy:
[p]This demonstrates what kind of [a href="http://
After using ebook-convert to convert _to_ and the resulting docx _from_ word using the options below I have:
[p id="calibre_link-2" class="
[a href="http://
[a href="http://
[a href="http://
(I've inserted linefeeds for clarity only.)
Options I've used to ebook-convert:
ebook-convert INFILE Outfile.docx --docx-no-toc --unsmarten-
And to do docx to htmlz:
ebook-convert INFILE.docx Outfile.htmlz --docx-
The link is handled/exported correctly when I export from the calibre generated docx using word, libre office or other compatible programs, thus it seems the error occurs when converting docx to htmlz.
I enclose the original epub, the docx and htmlz created by ebook-convert.