Unwrapping fails on non-latin stripts

Bug #1753533 reported by zefciu on 2018-03-05
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
calibre
Undecided
Unassigned

Bug Description

Calibre 2.55
Ubuntu

Attached a PDF file in Georgian. Steps to reproduce the problem:
* Import the PDF file.
* Open conversion dialog.
* Select conversion to Mobi
* Enable the heuristic conversions

Effect:
We get a Mobi file with next to none unwrapping

Now I found a line in calibre code where the alphanumeric regex is defined. In calibre/ebooks/conversion/utils.py

        lookahead = "(?<=.{"+unicode(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity

When I put ა-ჰ next to other alphabetic characters in this regex, the result was satisfactory. So this seems to be a culprit and unwrapping would probably fail with other non-latin scripts. I wonder what is the reason for hardcoding what is alphabetic instead of relying on Unicode data. Couldn't we just use \w with unicode mode?

zefciu (zefciu) wrote :
zefciu (zefciu) wrote :

Satisfactory effect when added ა-ჰ to the regex

zefciu (zefciu) wrote :

Almost no unwrapping with current code

Kovid Goyal (kovid) wrote :

Not my code, but if I had to guess, it would be because \w overmatches, leading to false positives. From eyeballing that regex, it does not include digits, for example, which \w would match.

zefciu (zefciu) wrote :

In that case, there's [^\W\d]

>>> pat = re.compile('^[^\W\d]+$')
>>> pat.match('Cześć123')
>>> pat.match('Cześć')
<_sre.SRE_Match object at 0x7fe132ae22a0>
>>> pat.match('გამარჯობა')
<_sre.SRE_Match object at 0x7fe132ae2308>

Kovid Goyal (kovid) wrote :

I never said the only over match was for numbers, just that numbers was one obvious example from eyeballing the regex. You'd have to ask the person that wrote the regex why they chose to do it that way, unfortunately they aren't around any more and this part of the code is largely unmaintained.

Kovid Goyal (kovid) wrote :

Oh and by the way [^\W\d] is not correct. IIRC \w includes numerals ina wide range of scripts, not just arabic numerals, while \d only matches Arabic numerals.

zefciu (zefciu) wrote :

No. If we use \d with the Unicode flag it will correctly recognize numeric characters that are not Arabic numerals

>>> pat = re.compile('^[^\W\d]+$', re.UNICODE)
>>> pat.match('१२३')
>>> pat = re.compile('^[^\W]+$', re.UNICODE)
>>> pat.match('१२३')
<_sre.SRE_Match object at 0x7f16dc7392a0>

Yes, I forgot that \d is also unicode respecting. But, as I said, the
point was not really about numbers. I think the safest change is to
simply add more ranges of alphabets as and when needed.

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

 status fixreleased

Changed in calibre:
status: New → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers