EPUB to DOCX Conversation

Bug #1455502 reported by Armin Geller on 2015-05-15
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Undecided
Unassigned

Bug Description

Relating to our conversation at MR:

http://www.mobileread.com/forums/showthread.php?p=3102382#post3102382

I just made some tests with an cookbook (a hard example ). I recognize some things:

a) Inline Toc links are "external" links where I need to use [ctrl+enter] to go to the text in.
b) There is a a mix of sometime correct read headlines and sometimes recognize as paragraphs
b) pictures loosing the original aspect ratio. These seems to be depending on output profiles too.
c) I have page breaks within recipes. It seems, that there is some hotpotch in translating styles

Here is the EPUB. Password will come via Mail at MR due to copyright.

Best regards,
Armin

Kovid Goyal (kovid) wrote :

That file is rather large and I really dont feel like reading through it to find the issues, can you point to some specific places in the file that show the issues. Feel free to delete the file, as I have downloaded it.

The only issue I could spot scrolling quickly through the first few pages was that some centered images we being rendered partially off the screen because they were rendered as inline images in a centered block, instead of floating images. This will be fixed in the next release.

Armin Geller (armingeller) wrote :

Hi Kovid,

Yes, it is. Sorry for that. Here are some issues:

If you look to the DOCX-File (please make a conversion first), you will find on page one the cover and following then a kind of inline TOC for the EPUB-TOC on page 2 and 3. But when you look on the left side of the word window, there is only a part of the TOC listing identified as headline entries. Everything above "Honigsenf" and below "Honig-Balsamico-Dressing mit Sommersalat" is missing. Please see picture TOC_1_docx.JPG for TOC and Headlines. It looks like that you are loosing some of the h1, h2 ... during the conversion because of this horrible <div> constructions in the file (pls. take a look on the EPUB file Kap_001.xhtml, [ <h2 class="uservice" id="ID443000530">Küchenhelfer für großes Saucenglück</h2> ] as para style in DOCX ).

As I recognize too, there are a lot not correct declared headlines (e.g. [ <div class="ueberschrift_service1" id="ID443000460">Die wunderbare Welt der Würzsaucen</div> ] in file Kap_001.xhtml). I guess, this need to be corrected in a second step manually with a Xpath adjustment, as I have no idea, how you can identify this in general as headline entries (maybe it is possible to use the real TOC from the EPUB, if there is a discrepancy between the TOC elements in the Text and the real TOC). Pleas see picture "TOC_2_and_aspect_ratio.JPG".
In this picture you will see as well the problem with aspect ration with one of the pictures [<img alt="IMG" src="../Images/8338-4430-0-006-01.jpg"/>] or [<img alt="IMG" src="../Images/8338-4430-0-006-07.jpg"/>] in file Kap_001.xhtml. You will find more in other files like in the next issue. It looks like this happen mainly with pictures in portrait format.

Page breaks within recipes. This you will find in file Kap_002.xhtml called "Honigsenf". This recipe was split in Word into 4 pages. One each for the picture and part of the preparation, followed by a page with ingredients ant as last page the cooking instructions. Please see Recipe_Honigsenf_DOCX.JPG and Recipe_Honigsenf_EPUB.JPG.

For conversion I switched-off all automatics. Pleas see pictures "Conv_Setup1.JPG" to "Conv_Setup7.JPG". Output format was A4.

I made in addition a test with the new calibre v 2.29. There is a new issue with pictures, what now place a picture within an irritating floating text. The text starts now in the left, then cut the text and place in the picture, followed with the rest of the line on the right. Pleas see picture "Recipe_Feigensenf_DOCX.JPG" and Recipe_Feigensenf_EPUB.JPG.

Best regards,
Armin

Sorry, I just relaized that I deleted my copy of your file accidentally,
during a cleanup, so can you attach it again.

Armin Geller (armingeller) wrote :

Here it is. PW is the same as before in ma PM via MobileRead. Let me know if you need it.
Best regards,
Armin

Kovid Goyal (kovid) wrote :

Regarding headings, those are not taken from the ToC, as there is no
guarantee that toc items point to actual headings in the file. Instead,
they are taken from the html tags (heading tags like <h1-6> are
converted to word heading styles).

Regarding the floating images you see in 2.29, I cannot reproduce that.
When I run the conversion, the resulting images for Feigensenf etc. are
all by themselves on a single page in Word 2007.

The other issues have all been fixed.

 status fixreleased

Changed in calibre:
status: New → Fix Released
Armin Geller (armingeller) wrote :

Thanks for the update and looking for this. :)
Since there is a Holiday today here in germany, I had some time to look at this a bit deeper this morning.

Regarding floating pictures, I guess I found the issue. There is a change in the layout settings for pictures from 2.28 to 2.29 If I compare the tow docx versions.

Please see attached file. They show all picture layout settings for both versions.
In the old version you switch in layout form absolute position to an alignment and for text warping from inline with text to square (guess it is named that way in English). There is as well something wrong with sizing. Sometimes you are loosing the aspect ratio and you use different scalings in high and width.
I am not sure, but maybe that happen if there is no defined setup for the picture template and Word try to find a own guessed value for missing data.

All the differences I saw with Word 2013. System is Windows 8.1 64bit.

I will test the issues again with the new calibre version and come back with a feedback.

Best regards,
Armin

Armin Geller (armingeller) wrote :

Hi Kovid,
a lot of issues are gone. Except the one with the one with placing the pictures. Now the layout is set to warp text around pictures. This is maybe not the best setting. The setup top and bottom is maybe better as a standard output layout. I attach the converted docx, so you can see the output as it is today. I use again the same PW.

One thought that came up to me was, it would be great to have a entry in the output dialog for this kind of problems so that the user is a bit more flexible if there is no rule coming with CSS. Something like a selection in line with text, tight, throug, square and top and bottom

Thanks again for the corrections.

Best regards,
Armin

Ukie (ukiews) wrote :

I second armingeller, the images should be "in-line" by default.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers