Automatic link creation and CamelCase don't work with non-latin characters

Reported by Robert Zelnik on 2010-02-07
64
This bug affects 9 people
Affects Status Importance Assigned to Milestone
Zim
Medium
Unassigned

Bug Description

Automatic link creation doesn't work while using accented characters like "á", "é", "í"... inside the link.
This affects many ways of link creation like links starting with ":", "+", CamelCase links...

Examples:

CamelCase <- creates link
CámélCase <- doesn't create link

+link <- creates link
+línk <- doesn't create link

:link <- creates link
:línk <- doesn't create link

ZIM version: 0.43 Linux

Robert Zelnik (rzelnik) on 2010-02-07
description: updated
Robert Zelnik (rzelnik) on 2010-02-07
description: updated
Oliver Joos (oliver-joos) wrote :

I can confirm this bug for german "Umlauts": ä ö ü
Link creation by menu item is not affected - it works as expected.

I use zim 0.43 with Ubuntu 9.10 fully updated.

Raphaël HUCK (raphael-huck) wrote :

Same problem with zim 0.43 under Debian unstable

Fixed the cases for page links, see rev201.

For camelcase there is a problem that the python regex engine does not have a class for unicode uppercase and lowercase letters. Supposedly they are now taken from the locale, but if you run under english locale it will not work.

The following command will show you which letters are included under your locale.

   $ python -c 'import string; print string.lowercase'

If there is an other way in python to get a list of lowercase and uppercase chars (or test a char for being uppercase) that is unicode compatible, I can fix it.

Changed in zim:
status: New → Confirmed
importance: Undecided → Medium
Robert Zelnik (rzelnik) wrote :

I am not familiar with python and unicode, but I know that in Drupal's Pathauto module this case is solved by manually created list of accented characters (and their replacements without accent). It seems like this:

; global transliteration
[default]
À = "A"
Á = "A"
 = "A"
à = "A"
Ä = "Ae"
Å = "A"
Æ = "A"
Ā = "A"
Ą = "A"
Ă = "A"
Ç = "C"
Ć = "C"
Č = "C"
Ĉ = "C"
Ċ = "C"
... etc.

@Robert: This is not practical because you would need to maintain a list of all unicode scripts, distinguishing upper case versus lower case.

Robert Zelnik (rzelnik) wrote :

@Jaap: I am not sure if I understand you - if not, please correct me.
I don't think so, because the list of special characters is constant, so we can just copy it from the Drupal's code and incoroprate into Zim's code.
Am I right?

@Robert: assuming the list is complete yes, but the drupal list is not
sufficient since it does not tell us if letters are capitals or not
(of course we could derive that again from the "translation" being
capital or not but the list was not made with that intent).

I found a more official list here
http://www.unicode.org/Public/5.1.0/ucd/UCD.html which includes
information about letters being capitals or not. Will try to compile
that into a big regular expression. Not sure about the performance
though if we need to check each word with such a self-made regex.

BTW it would be good to investigate how this worked in older versions. I have recognized this bug after upgrade to Zim 0.43, but I don't remember exactly which version I had before upgrade - it was probably 0.28 or 0.29. In this older version the automatic link creation worked well also with accented characters.

@Robert Zelnik: older versions of zim were written in perl. The perl
regex engine has a special class for upper case letters which is
unicode aware. Unfortunately this feature is missing in the python
regex engine, so we need a work around.

I have a similar problems with Russian languege. I use Ubuntu 9.10 and Zim 0.44.

Changed in zim:
status: Confirmed → In Progress
ras (ras82x) wrote :

> If there is an other way in python to get a list
> of lowercase and uppercase chars (or test a
> char for being uppercase) that is unicode
> compatible, I can fix it.
There is ponyguruma (http://sandbox.pocoo.org/),
python wrapper to the oniguruma regular expression
engine, that can handle unicode properties.

summary: - Automatic link creation doesn't work with accented characters
+ Automatic link creation and CamelCase don't work with non-latin
+ characters
Nick (nick222-yandex) wrote :

Confirm problems with Russian language (where ALL symbols are non-latin)!

For example:
TomBoy works with Russian symbols good.

P.S.: And in TomBoy I can rename link and page after auto-creation - from "CamelCase" to "normal" word (without breaking any links).

Jiří Janoušek (fenryxo) wrote :

I have been doing some experiments and Python regex engine seems to support unicode if unicode arguments and re.U flag are provided (example 3).

$ python
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
>>> import re
>>> print re.search("\w+", "aaaáÁá...").group() #1
aaa
>>> print re.search(u"\w+", u"aaaáÁá...").group() #2
aaa
>>> print re.search(u"\w+", u"aaaáÁá...", re.U).group() #3
aaaáÁá
>>> print re.search("\w+", "aaaáÁá...", re.U).group() #4
aaa

2011/5/31 Jiří Janoušek <email address hidden>

> I have been doing some experiments and Python regex engine seems to
> support unicode if unicode arguments and re.U flag are provided (example
> 3).
>

Yes it does for \w, however there is no way to match uppercase versus lower
case (unlike e.g. the perl regex engine which supports matching unicode
classes).

I have recently been thinking that it can work if we use the string methods
to determine which characters are uppercase and which are not and find
camelcase that way looking for an pattern of "upper lower upper" by
searching character by character.

-- Jaap

Jiří Janoušek (fenryxo) wrote :

On Tue, May 31, 2011 at 21:59, Jaap Karssenberg
<email address hidden> wrote:
> 2011/5/31 Jiří Janoušek <email address hidden>
>
>> I have been doing some experiments and Python regex engine seems to
>> support unicode if unicode arguments and re.U flag are provided (example
>> 3).
>>
>
> Yes it does for \w, however there is no way to match uppercase versus lower
> case (unlike e.g. the perl regex engine which supports matching unicode
> classes).

I see, I missed the point before.

> I have recently been thinking that it can work if we use the string methods
> to determine which characters are uppercase and which are not and find
> camelcase that way looking for an pattern of "upper lower upper" by
> searching character by character.

There are also alternative regex libraries with unicode classes
support [1], but your solution may work well and don't require another
dependency (for one small feature).

[1] http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties/

> --
> You received this bug notification because you are subscribed to Zim.
> https://bugs.launchpad.net/bugs/518323
>
> Title:
>  Automatic link creation and CamelCase don't work with non-latin
>  characters
>
> Status in Zim desktop wiki:
>  In Progress
>
> Bug description:
>  Automatic link creation doesn't work while using accented characters like "á", "é", "í"... inside the link.
>  This affects many ways of link creation like links starting with  ":", "+", CamelCase links...
>
>  Examples:
>
>  CamelCase <- creates link
>  CámélCase <- doesn't create link
>
>  +link <- creates link
>  +línk <- doesn't create link
>
>  :link <- creates link
>  :línk <- doesn't create link
>
>  ZIM version: 0.43 Linux
>

Speranskiy (sprnza) wrote :

Bug still exist! It's impossible to create cyrillic links in Zim. Please fix the problem! Windows 7 and Ubuntu 12.04 (from standart repo).

On Thu, Jan 10, 2013 at 7:44 AM, Сперанский <email address hidden> wrote:
> Bug still exist! It's impossible to create cyrillic links in Zim.
> Please fix the problem! Windows 7 and Ubuntu 12.04 (from standart repo).

Yes it still exist. I think solutions are outlined in the comments
above, but nobody working on this at the moment. So if you feel like
looking into it, please go ahead.

Btw. you can create cyrillic links using <Ctrl>L

Regards,

Jaap

1.John@seznam.cz (neozvuck) wrote :

would this work?

>>> unicode.islower(u'ěščřžýáíéúů')
True
>>> unicode.islower(u'ĚŠČŘŽÝÁÍÉÚŮ')
False
>>> unicode.isupper(u'ĚŠČŘŽÝÁÍÉÚŮ')
True
>>> unicode.isupper(u'ěščřžýáíéúů')
False

1.John@seznam.cz (neozvuck) wrote :

fix

Applied fix in rev674. May need some user testing from users with non-lating input methods.

Changed in zim:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Patches