Unwrapping fails on non-latin stripts

Bug #1753533 reported by zefciu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Fix Released
Undecided
Unassigned

Bug Description

Calibre 2.55
Ubuntu

Attached a PDF file in Georgian. Steps to reproduce the problem:
* Import the PDF file.
* Open conversion dialog.
* Select conversion to Mobi
* Enable the heuristic conversions

Effect:
We get a Mobi file with next to none unwrapping

Now I found a line in calibre code where the alphanumeric regex is defined. In calibre/ebooks/conversion/utils.py

        lookahead = "(?<=.{"+unicode(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity

When I put ა-ჰ next to other alphabetic characters in this regex, the result was satisfactory. So this seems to be a culprit and unwrapping would probably fail with other non-latin scripts. I wonder what is the reason for hardcoding what is alphabetic instead of relying on Unicode data. Couldn't we just use \w with unicode mode?

Tags: conversion
Revision history for this message
zefciu (zefciu) wrote :
Revision history for this message
zefciu (zefciu) wrote :

Satisfactory effect when added ა-ჰ to the regex

Revision history for this message
zefciu (zefciu) wrote :

Almost no unwrapping with current code

Revision history for this message
Kovid Goyal (kovid) wrote :

Not my code, but if I had to guess, it would be because \w overmatches, leading to false positives. From eyeballing that regex, it does not include digits, for example, which \w would match.

Revision history for this message
zefciu (zefciu) wrote :

In that case, there's [^\W\d]

>>> pat = re.compile('^[^\W\d]+$')
>>> pat.match('Cześć123')
>>> pat.match('Cześć')
<_sre.SRE_Match object at 0x7fe132ae22a0>
>>> pat.match('გამარჯობა')
<_sre.SRE_Match object at 0x7fe132ae2308>

Revision history for this message
Kovid Goyal (kovid) wrote :

I never said the only over match was for numbers, just that numbers was one obvious example from eyeballing the regex. You'd have to ask the person that wrote the regex why they chose to do it that way, unfortunately they aren't around any more and this part of the code is largely unmaintained.

Revision history for this message
Kovid Goyal (kovid) wrote :

Oh and by the way [^\W\d] is not correct. IIRC \w includes numerals ina wide range of scripts, not just arabic numerals, while \d only matches Arabic numerals.

Revision history for this message
zefciu (zefciu) wrote :

No. If we use \d with the Unicode flag it will correctly recognize numeric characters that are not Arabic numerals

>>> pat = re.compile('^[^\W\d]+$', re.UNICODE)
>>> pat.match('१२३')
>>> pat = re.compile('^[^\W]+$', re.UNICODE)
>>> pat.match('१२३')
<_sre.SRE_Match object at 0x7f16dc7392a0>

Revision history for this message
Kovid Goyal (kovid) wrote : Re: calibre bug 1753533

Yes, I forgot that \d is also unicode respecting. But, as I said, the
point was not really about numbers. I think the safest change is to
simply add more ranges of alphabets as and when needed.

Revision history for this message
Kovid Goyal (kovid) wrote : Fixed in master

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

 status fixreleased

Changed in calibre:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.