Heuristics breaking id="_anchor_"

Bug #986298 reported by Kovid Goyal
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
Lee

Bug Description

Heuristics causes a tag like this:

<p id="_anchor_">xxx</p>

to become:

<p id="&lt;i&gt;anchor&lt;/i&gt;" class="calibre1">xxx</p>

Although id="_anchor_" is not valid HTML, apparently Word generates it, so it's worth fixing, see http://www.mobileread.com/forums/showthread.php?t=175915

Of course, given the way heuristics works, this may be difficult. I've attached a test case that reproduces the bug with

ebook-convert heu.html oeb --enable-heuristics && cat oeb/heu.html

Related branches

Revision history for this message
Kovid Goyal (kovid) wrote :
Changed in calibre:
assignee: nobody → Lee (ldolse)
status: New → Triaged
Revision history for this message
Lee (ldolse) wrote :

Huh - I hadn't seen it do that before, but agree it's definitely something that merits fixing. I'll dig into it.

Revision history for this message
Lee (ldolse) wrote :

Copying in John, this is caused by the italicise feature that implements similar logic to the markdown/textile processors - I'm not sure if there is an easy way to optimize those patterns to avoid this case.

Revision history for this message
Lee (ldolse) wrote :

I've submitted a potential fix to my branch - I've attempted to work around the problem entirely by searching a text only version of the document to find matches, and then replace those matches in the html version of the document. This should generally eliminate any future variants of this issue. The only exception would be if a user had an anchor that happened to use the exact same text as an italicize pattern.

I also re-ordered the patterns so that longest preceded shortest - this was causing a problem where the patterns weren't matching properly.

As this particular function was originally written by John I defer to him on whatever a final solution may be.

Revision history for this message
Lee (ldolse) wrote :

Updated the original file to include all the possible patterns for testing.

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 986298

Would wrapping the regex in >[^<]*original pattern[^<]*< do the trick?

Revision history for this message
Lee (ldolse) wrote :

I couldn't get your suggestion working, but I was able to get it working by adding a negative lookahead to the end:
(?![^<]*?>)

e.g.:
(?<=[\s>"“\'‘])_(?P<words>[^_]+)_(?![^<]*?>)

My concern is I know John worked through many regex variations of these patterns trying to quash problems like this, not sure if this one will introduce some other unforeseen issue. This may have already been tried.

Revision history for this message
Kovid Goyal (kovid) wrote :

I'm ok with your original solution, I agree that playing with regexes is potentially fragile. I'll go ahead and merge.

Revision history for this message
Kovid Goyal (kovid) wrote : Fixed in lp:calibre

Fixed in branch lp:calibre. The fix will be in the next release. calibre is usually released every Friday.

 status fixreleased

Changed in calibre:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.