Unwrapping fails on non-latin stripts
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
calibre |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Calibre 2.55
Ubuntu
Attached a PDF file in Georgian. Steps to reproduce the problem:
* Import the PDF file.
* Open conversion dialog.
* Select conversion to Mobi
* Enable the heuristic conversions
Effect:
We get a Mobi file with next to none unwrapping
Now I found a line in calibre code where the alphanumeric regex is defined. In calibre/
lookahead = "(?<=.{
When I put ა-ჰ next to other alphabetic characters in this regex, the result was satisfactory. So this seems to be a culprit and unwrapping would probably fail with other non-latin scripts. I wonder what is the reason for hardcoding what is alphabetic instead of relying on Unicode data. Couldn't we just use \w with unicode mode?
Satisfactory effect when added ა-ჰ to the regex