calibre

Unwrapping fails on non-latin stripts

Bug #1753533 reported by zefciu on 2018-03-05

6

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	calibre	Fix Released	Undecided	Unassigned

Bug Description

Calibre 2.55
Ubuntu

Attached a PDF file in Georgian. Steps to reproduce the problem:
* Import the PDF file.
* Open conversion dialog.
* Select conversion to Mobi
* Enable the heuristic conversions

Effect:
We get a Mobi file with next to none unwrapping

Now I found a line in calibre code where the alphanumeric regex is defined. In calibre/ebooks/conversion/utils.py

lookahead = "(?<=.{"+unicode(length)+u"}([a-zäëïöüàèìòùáćéíĺóŕńśúýâêîôûçąężıãõñæøþðßěľščťžňďřů,:)\IA\u00DF]|(?<!\&\w{4});))" # (?<!\&\w{4});) is a semicolon not part of an entity

When I put ა-ჰ next to other alphabetic characters in this regex, the result was satisfactory. So this seems to be a culprit and unwrapping would probably fail with other non-latin scripts. I wonder what is the reason for hardcoding what is alphabetic instead of relying on Unicode data. Couldn't we just use \w with unicode mode?

Tags:

Revision history for this message

zefciu (zefciu) wrote on 2018-03-05:

#1

gzaze_erti.pdf Edit (3.6 MiB, application/pdf)

Revision history for this message

zefciu (zefciu) wrote on 2018-03-05:

#2

gzaze_erti_good.mobi Edit (1.4 MiB, application/octet-stream)

Satisfactory effect when added ა-ჰ to the regex

Revision history for this message

zefciu (zefciu) wrote on 2018-03-05:

#3

gzaze_erti_bad.mobi Edit (1.5 MiB, application/octet-stream)

Almost no unwrapping with current code

Revision history for this message

Kovid Goyal (kovid) wrote on 2018-03-05:

#4

Not my code, but if I had to guess, it would be because \w overmatches, leading to false positives. From eyeballing that regex, it does not include digits, for example, which \w would match.

Revision history for this message

zefciu (zefciu) wrote on 2018-03-06:

#5

In that case, there's [^\W\d]

>>> pat = re.compile('^[^\W\d]+$')
>>> pat.match('Cześć123')
>>> pat.match('Cześć')
<_sre.SRE_Match object at 0x7fe132ae22a0>
>>> pat.match('გამარჯობა')
<_sre.SRE_Match object at 0x7fe132ae2308>

Revision history for this message

Kovid Goyal (kovid) wrote on 2018-03-06:

#6

I never said the only over match was for numbers, just that numbers was one obvious example from eyeballing the regex. You'd have to ask the person that wrote the regex why they chose to do it that way, unfortunately they aren't around any more and this part of the code is largely unmaintained.

Revision history for this message

Kovid Goyal (kovid) wrote on 2018-03-06:

#7

Oh and by the way [^\W\d] is not correct. IIRC \w includes numerals ina wide range of scripts, not just arabic numerals, while \d only matches Arabic numerals.

Revision history for this message

zefciu (zefciu) wrote on 2018-03-06:

#8

No. If we use \d with the Unicode flag it will correctly recognize numeric characters that are not Arabic numerals

>>> pat = re.compile('^[^\W\d]+$', re.UNICODE)
>>> pat.match('१२३')
>>> pat = re.compile('^[^\W]+$', re.UNICODE)
>>> pat.match('१२३')
<_sre.SRE_Match object at 0x7f16dc7392a0>

Revision history for this message

Kovid Goyal (kovid) wrote on 2018-03-06: Re: calibre bug 1753533

#9

Yes, I forgot that \d is also unicode respecting. But, as I said, the
point was not really about numbers. I think the safest change is to
simply add more ranges of alphabets as and when needed.

Revision history for this message

Kovid Goyal (kovid) wrote on 2018-03-08: Fixed in master

#10

Fixed in branch master. The fix will be in the next release. calibre is usually released every alternate Friday.

status fixreleased

Changed in calibre:
status:	New → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.