Zim

Automatic link creation and CamelCase don't work with non-latin characters

Bug #518323 reported by Robert Zelnik
78
This bug affects 10 people
Affects Status Importance Assigned to Milestone
Zim
Fix Committed
Medium
Unassigned

Bug Description

Automatic link creation doesn't work while using accented characters like "á", "é", "í"... inside the link.
This affects many ways of link creation like links starting with ":", "+", CamelCase links...

Examples:

CamelCase <- creates link
CámélCase <- doesn't create link

+link <- creates link
+línk <- doesn't create link

:link <- creates link
:línk <- doesn't create link

ZIM version: 0.43 Linux

Tags: 2min
Robert Zelnik (rzelnik)
description: updated
Robert Zelnik (rzelnik)
description: updated
Revision history for this message
Oliver Joos (oliver-joos) wrote :

I can confirm this bug for german "Umlauts": ä ö ü
Link creation by menu item is not affected - it works as expected.

I use zim 0.43 with Ubuntu 9.10 fully updated.

Revision history for this message
rhk (rhk) wrote :

Same problem with zim 0.43 under Debian unstable

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Fixed the cases for page links, see rev201.

For camelcase there is a problem that the python regex engine does not have a class for unicode uppercase and lowercase letters. Supposedly they are now taken from the locale, but if you run under english locale it will not work.

The following command will show you which letters are included under your locale.

   $ python -c 'import string; print string.lowercase'

If there is an other way in python to get a list of lowercase and uppercase chars (or test a char for being uppercase) that is unicode compatible, I can fix it.

Changed in zim:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Robert Zelnik (rzelnik) wrote :

I am not familiar with python and unicode, but I know that in Drupal's Pathauto module this case is solved by manually created list of accented characters (and their replacements without accent). It seems like this:

; global transliteration
[default]
À = "A"
Á = "A"
 = "A"
à = "A"
Ä = "Ae"
Å = "A"
Æ = "A"
Ā = "A"
Ą = "A"
Ă = "A"
Ç = "C"
Ć = "C"
Č = "C"
Ĉ = "C"
Ċ = "C"
... etc.

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

@Robert: This is not practical because you would need to maintain a list of all unicode scripts, distinguishing upper case versus lower case.

Revision history for this message
Robert Zelnik (rzelnik) wrote :

@Jaap: I am not sure if I understand you - if not, please correct me.
I don't think so, because the list of special characters is constant, so we can just copy it from the Drupal's code and incoroprate into Zim's code.
Am I right?

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote : Re: [Bug 518323] Re: Automatic link creation doesn't work with accented characters

@Robert: assuming the list is complete yes, but the drupal list is not
sufficient since it does not tell us if letters are capitals or not
(of course we could derive that again from the "translation" being
capital or not but the list was not made with that intent).

I found a more official list here
http://www.unicode.org/Public/5.1.0/ucd/UCD.html which includes
information about letters being capitals or not. Will try to compile
that into a big regular expression. Not sure about the performance
though if we need to check each word with such a self-made regex.

Revision history for this message
Robert Zelnik (rzelnik) wrote : Re: Automatic link creation doesn't work with accented characters

BTW it would be good to investigate how this worked in older versions. I have recognized this bug after upgrade to Zim 0.43, but I don't remember exactly which version I had before upgrade - it was probably 0.28 or 0.29. In this older version the automatic link creation worked well also with accented characters.

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote : Re: [Bug 518323] Re: Automatic link creation doesn't work with accented characters

@Robert Zelnik: older versions of zim were written in perl. The perl
regex engine has a special class for upper case letters which is
unicode aware. Unfortunately this feature is missing in the python
regex engine, so we need a work around.

Revision history for this message
Vladimir Krasikov (9864332-gmail) wrote : Re: Automatic link creation doesn't work with accented characters

I have a similar problems with Russian languege. I use Ubuntu 9.10 and Zim 0.44.

Changed in zim:
status: Confirmed → In Progress
Revision history for this message
ras (ras82x) wrote :

> If there is an other way in python to get a list
> of lowercase and uppercase chars (or test a
> char for being uppercase) that is unicode
> compatible, I can fix it.
There is ponyguruma (http://sandbox.pocoo.org/),
python wrapper to the oniguruma regular expression
engine, that can handle unicode properties.

summary: - Automatic link creation doesn't work with accented characters
+ Automatic link creation and CamelCase don't work with non-latin
+ characters
Revision history for this message
Nick (nick222-yandex) wrote :

Confirm problems with Russian language (where ALL symbols are non-latin)!

For example:
TomBoy works with Russian symbols good.

P.S.: And in TomBoy I can rename link and page after auto-creation - from "CamelCase" to "normal" word (without breaking any links).

Revision history for this message
Jiří Janoušek (fenryxo) wrote :

I have been doing some experiments and Python regex engine seems to support unicode if unicode arguments and re.U flag are provided (example 3).

$ python
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
>>> import re
>>> print re.search("\w+", "aaaáÁá...").group() #1
aaa
>>> print re.search(u"\w+", u"aaaáÁá...").group() #2
aaa
>>> print re.search(u"\w+", u"aaaáÁá...", re.U).group() #3
aaaáÁá
>>> print re.search("\w+", "aaaáÁá...", re.U).group() #4
aaa

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote : Re: [Bug 518323] Re: Automatic link creation and CamelCase don't work with non-latin characters

2011/5/31 Jiří Janoušek <email address hidden>

> I have been doing some experiments and Python regex engine seems to
> support unicode if unicode arguments and re.U flag are provided (example
> 3).
>

Yes it does for \w, however there is no way to match uppercase versus lower
case (unlike e.g. the perl regex engine which supports matching unicode
classes).

I have recently been thinking that it can work if we use the string methods
to determine which characters are uppercase and which are not and find
camelcase that way looking for an pattern of "upper lower upper" by
searching character by character.

-- Jaap

Revision history for this message
Jiří Janoušek (fenryxo) wrote :

On Tue, May 31, 2011 at 21:59, Jaap Karssenberg
<email address hidden> wrote:
> 2011/5/31 Jiří Janoušek <email address hidden>
>
>> I have been doing some experiments and Python regex engine seems to
>> support unicode if unicode arguments and re.U flag are provided (example
>> 3).
>>
>
> Yes it does for \w, however there is no way to match uppercase versus lower
> case (unlike e.g. the perl regex engine which supports matching unicode
> classes).

I see, I missed the point before.

> I have recently been thinking that it can work if we use the string methods
> to determine which characters are uppercase and which are not and find
> camelcase that way looking for an pattern of "upper lower upper" by
> searching character by character.

There are also alternative regex libraries with unicode classes
support [1], but your solution may work well and don't require another
dependency (for one small feature).

[1] http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties/

> --
> You received this bug notification because you are subscribed to Zim.
> https://bugs.launchpad.net/bugs/518323
>
> Title:
>  Automatic link creation and CamelCase don't work with non-latin
>  characters
>
> Status in Zim desktop wiki:
>  In Progress
>
> Bug description:
>  Automatic link creation doesn't work while using accented characters like "á", "é", "í"... inside the link.
>  This affects many ways of link creation like links starting with  ":", "+", CamelCase links...
>
>  Examples:
>
>  CamelCase <- creates link
>  CámélCase <- doesn't create link
>
>  +link <- creates link
>  +línk <- doesn't create link
>
>  :link <- creates link
>  :línk <- doesn't create link
>
>  ZIM version: 0.43 Linux
>

Revision history for this message
Speranskiy (sprnza) wrote :

Bug still exist! It's impossible to create cyrillic links in Zim. Please fix the problem! Windows 7 and Ubuntu 12.04 (from standart repo).

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

On Thu, Jan 10, 2013 at 7:44 AM, Сперанский <email address hidden> wrote:
> Bug still exist! It's impossible to create cyrillic links in Zim.
> Please fix the problem! Windows 7 and Ubuntu 12.04 (from standart repo).

Yes it still exist. I think solutions are outlined in the comments
above, but nobody working on this at the moment. So if you feel like
looking into it, please go ahead.

Btw. you can create cyrillic links using <Ctrl>L

Regards,

Jaap

Revision history for this message
1.John@seznam.cz (neozvuck) wrote :

would this work?

>>> unicode.islower(u'ěščřžýáíéúů')
True
>>> unicode.islower(u'ĚŠČŘŽÝÁÍÉÚŮ')
False
>>> unicode.isupper(u'ĚŠČŘŽÝÁÍÉÚŮ')
True
>>> unicode.isupper(u'ěščřžýáíéúů')
False

Revision history for this message
1.John@seznam.cz (neozvuck) wrote :

fix

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Applied fix in rev674. May need some user testing from users with non-lating input methods.

Changed in zim:
status: In Progress → Fix Committed
Revision history for this message
Alvenhar (alvenhar) wrote :

The Python unicode.islower() and unicode.isupper() functions should work fine, I used them successfully in a very similar context in another project. However, it still doesn't work in the currently released version (0.60), so I suppose the fix has not been committed yet? It's been a year :)
I would generally suggest to use the unicode class for all text processing as long as the code is still in Python 2 (Python 3 has native unicode support and makes life with international text so much easier!!)

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Fix is committed in the dev branch. Status will change to "released" when
it is in a released version.

Yes, latest release is a year old. Feel a bit guilty about that but will
not waste your time with all my excuses.Working on getting all fixes
released this spring.

REgards,

Jaap

On Mon, Apr 21, 2014 at 1:45 PM, Alvenhar <email address hidden> wrote:

> The Python unicode.islower() and unicode.isupper() functions should work
> fine, I used them successfully in a very similar context in another
> project. However, it still doesn't work in the currently released version
> (0.60), so I suppose the fix has not been committed yet? It's been a year :)
> I would generally suggest to use the unicode class for all text processing
> as long as the code is still in Python 2 (Python 3 has native unicode
> support and makes life with international text so much easier!!)
>
> --
> You received this bug notification because you are subscribed to Zim.
> https://bugs.launchpad.net/bugs/518323
>
> Title:
> Automatic link creation and CamelCase don't work with non-latin
> characters
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/zim/+bug/518323/+subscriptions
>

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Fixed in release 0.61

Changed in zim:
status: Fix Committed → Fix Released
Revision history for this message
CrabMan (cocacooler) wrote :

Automatic link creation still does not work for Russian characters in zim 0.62. This bug should be reopened.

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

See bug #1417677 for examples

Changed in zim:
status: Fix Released → Confirmed
tags: added: 2min
Revision history for this message
Speranskiy (sprnza) wrote :

The same goes to +CamelCase syntax. So
ВаняИванов: the link is being created
+ВаняИванов: skipping

Plus:
If I'm creating a link typing in an address manually, the link is being created fine
[[http://www.example.com|ВаняИванов]]
If I copy and paste the address, I got the link highlighted with blue, the rest of the text is getting highlighted as well while I keep typing the link's label and I get
%D0%92%D0%B0%D0%BD%D1%8F%D0%98%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2
instead of ВаняИванов

Revision history for this message
Jaap Karssenberg (jaap.karssenberg) wrote :

Fix in 7e58f2a0784b1439b84f8523e180ac8029738142

Changed in zim:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.