break line/italic - wrong conversion

Bug #1205637 reported by Tomasz B
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
Lee

Bug Description

Calibre 0.9.41 and earlier versions
There is a problem with conversion html files with italic text with BR tags inside -> all text in document AFTER this is converted as italic.
Important: when you have only one "section" with <i> text <br> </i> text - it's OK, but when you have more than 2 => the problem is. Please, take a look at png file (attachment).

It's a good idea make conversion from docx file or filtered html file - from Microsoft Word, but this program close "italic tag" after break line tag. If You want/can correct this Word's bug... ;)

I noticed it before (conversion to AZW3) but it was ocassionaly and I can correct this in html code. Unfortunately I have to convert the book with different styles so it's easy do it wit Word in docx file.

Sample html file listing is below the footer.

Calibre is the best - thank You.
Best regards,

Tomasz Ceglinski

------------------ test html file ----------------------------

<html><head><title>a-test</title></head>
<body lang=PL link=blue vlink=purple>

<h2>chapter</h2>

<p><i>italic. <br>
</i>normal after break line<br>
<br>
</p>

<p>normal text normal text normal text normal text normal text normal text</p>

<h2>chapter</h2>

<p><i>italic. <br>
</i>normal after break line<br>
<br>
</p>

<p>normal text normal text normal text normal text normal text normal text</p>

<h2>chapter</h2>

<p><i>italic. <br>
</i>normal after break line<br>
<br>
</p>

<p>normal text normal text normal text normal text normal text normal text</p>

</body>
</html>

--------------------------------

Revision history for this message
Tomasz B (c-tomasz) wrote :
Revision history for this message
Kovid Goyal (kovid) wrote :

I cannot reproduce this, converting your test.html to azw3 gives correct behavior for italics with calibre 0.9.41, see attached file.

Changed in calibre:
status: New → Invalid
Revision history for this message
Tomasz B (c-tomasz) wrote :

> I cannot reproduce this

I found THIS bug.

The problem is in heuristic processing.
When You enable this section and check "delete blank lines between paragraphs". The bug appears also with checked other options in this section - maybe this problem is compound.

I can reproduce this bug even on pure portable version checking these option.

Best regards
Tomasz Ceglinski

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1205637

Changing the component for this bug.

 assignee user-none
 assignee ldolse
 tag preprocessing
 status triaged

Changed in calibre:
assignee: nobody → Lee (ldolse)
status: Invalid → Triaged
Revision history for this message
Tomasz B (c-tomasz) wrote :

Thank You very much ;)

Greetings from Poland.

Revision history for this message
Marc Na (ub40) wrote :

The expression seems to be https://github.com/kovidgoyal/calibre/blob/v3.2.1/src/calibre/ebooks/conversion/utils.py#L440

The problematic fact is <br/> matches <b[^>]*> (which was meant for capturing <b>)

test case:

import re
html = "<p>We use <i>italics with a self-closing br<br/></i> element.</p>"
html = re.sub(
            r"\s*<(font|[ibu]|em|strong)[^>]*>\s*(<(font|[ibu]|em|strong)[^>]*>\s*</(font|[ibu]|em|strong)>\s*){0,2}\s*</(font|[ibu]|em|strong)>", " ", html)
print html

Result:

<p>We use <i>italics with a self-closing br element.</p>

(Match = "<br/></i>")

test case #2:

import re
html = "<p>We use <i>italics with a self-closing span<span/></i> element.</p>"
html = re.sub(
            r"\s*<(font|[ibu]|em|strong)[^>]*>\s*(<(font|[ibu]|em|strong)[^>]*>\s*</(font|[ibu]|em|strong)>\s*){0,2}\s*</(font|[ibu]|em|strong)>", " ", html)
print html

Result:

<p>We use <i>italics with a self-closing span<span/></i> element.</p>

(No match)

Revision history for this message
Kovid Goyal (kovid) wrote : Fixed in master

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

 status fixreleased

Changed in calibre:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.