Zim

Automatic link creation and CamelCase don't work with non-latin characters

Bug #518323 reported by Robert Zelnik on 2010-02-07

This bug affects 10 people

Affects		Status	Importance	Assigned to	Milestone
	Zim	Fix Committed	Medium	Unassigned

Bug Description

Automatic link creation doesn't work while using accented characters like "á", "é", "í"... inside the link.
This affects many ways of link creation like links starting with ":", "+", CamelCase links...

Examples:

CamelCase <- creates link
CámélCase <- doesn't create link

+link <- creates link
+línk <- doesn't create link

:link <- creates link
:línk <- doesn't create link

ZIM version: 0.43 Linux

See original description

Tags:

Robert Zelnik (rzelnik) on 2010-02-07

description:

updated

Robert Zelnik (rzelnik) on 2010-02-07

description:

updated

Revision history for this message

Oliver Joos (oliver-joos) wrote on 2010-02-08:

I can confirm this bug for german "Umlauts": ä ö ü
Link creation by menu item is not affected - it works as expected.

I use zim 0.43 with Ubuntu 9.10 fully updated.

Revision history for this message

rhk (rhk) wrote on 2010-02-08:

Same problem with zim 0.43 under Debian unstable

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2010-02-12:

Fixed the cases for page links, see rev201.

For camelcase there is a problem that the python regex engine does not have a class for unicode uppercase and lowercase letters. Supposedly they are now taken from the locale, but if you run under english locale it will not work.

The following command will show you which letters are included under your locale.

$ python -c 'import string; print string.lowercase'

If there is an other way in python to get a list of lowercase and uppercase chars (or test a char for being uppercase) that is unicode compatible, I can fix it.

Changed in zim:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Robert Zelnik (rzelnik) wrote on 2010-02-13:

I am not familiar with python and unicode, but I know that in Drupal's Pathauto module this case is solved by manually created list of accented characters (and their replacements without accent). It seems like this:

; global transliteration
[default]
À = "A"
Á = "A"
Â = "A"
Ã = "A"
Ä = "Ae"
Å = "A"
Æ = "A"
Ā = "A"
Ą = "A"
Ă = "A"
Ç = "C"
Ć = "C"
Č = "C"
Ĉ = "C"
Ċ = "C"
... etc.

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2010-02-16:

@Robert: This is not practical because you would need to maintain a list of all unicode scripts, distinguishing upper case versus lower case.

Revision history for this message

Robert Zelnik (rzelnik) wrote on 2010-02-17:

@Jaap: I am not sure if I understand you - if not, please correct me.
I don't think so, because the list of special characters is constant, so we can just copy it from the Drupal's code and incoroprate into Zim's code.
Am I right?

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2010-02-17: Re: [Bug 518323] Re: Automatic link creation doesn't work with accented characters

@Robert: assuming the list is complete yes, but the drupal list is not
sufficient since it does not tell us if letters are capitals or not
(of course we could derive that again from the "translation" being
capital or not but the list was not made with that intent).

I found a more official list here
http://www.unicode.org/Public/5.1.0/ucd/UCD.html which includes
information about letters being capitals or not. Will try to compile
that into a big regular expression. Not sure about the performance
though if we need to check each word with such a self-made regex.

Revision history for this message

Robert Zelnik (rzelnik) wrote on 2010-02-19: Re: Automatic link creation doesn't work with accented characters

BTW it would be good to investigate how this worked in older versions. I have recognized this bug after upgrade to Zim 0.43, but I don't remember exactly which version I had before upgrade - it was probably 0.28 or 0.29. In this older version the automatic link creation worked well also with accented characters.

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2010-02-19: Re: [Bug 518323] Re: Automatic link creation doesn't work with accented characters

@Robert Zelnik: older versions of zim were written in perl. The perl
regex engine has a special class for upper case letters which is
unicode aware. Unfortunately this feature is missing in the python
regex engine, so we need a work around.

Revision history for this message

Vladimir Krasikov (9864332-gmail) wrote on 2010-03-02: Re: Automatic link creation doesn't work with accented characters

#10

I have a similar problems with Russian languege. I use Ubuntu 9.10 and Zim 0.44.

Jaap Karssenberg (jaap.karssenberg) on 2010-05-04

Changed in zim:
status:	Confirmed → In Progress

Revision history for this message

ras (ras82x) wrote on 2010-11-01:

#11

> If there is an other way in python to get a list
> of lowercase and uppercase chars (or test a
> char for being uppercase) that is unicode
> compatible, I can fix it.
There is ponyguruma (http://sandbox.pocoo.org/),
python wrapper to the oniguruma regular expression
engine, that can handle unicode properties.

Jaap Karssenberg (jaap.karssenberg) on 2011-05-04

summary:

- Automatic link creation doesn't work with accented characters
+ Automatic link creation and CamelCase don't work with non-latin
+ characters

Revision history for this message

Nick (nick222-yandex) wrote on 2011-05-05:

#12

Confirm problems with Russian language (where ALL symbols are non-latin)!

For example:
TomBoy works with Russian symbols good.

P.S.: And in TomBoy I can rename link and page after auto-creation - from "CamelCase" to "normal" word (without breaking any links).

Revision history for this message

Jiří Janoušek (fenryxo) wrote on 2011-05-31:

#13

I have been doing some experiments and Python regex engine seems to support unicode if unicode arguments and re.U flag are provided (example 3).

$ python
Python 2.7.1+ (r271:86832, Apr 11 2011, 18:05:24)
>>> import re
>>> print re.search("\w+", "aaaáÁá...").group() #1
aaa
>>> print re.search(u"\w+", u"aaaáÁá...").group() #2
aaa
>>> print re.search(u"\w+", u"aaaáÁá...", re.U).group() #3
aaaáÁá
>>> print re.search("\w+", "aaaáÁá...", re.U).group() #4
aaa

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2011-05-31: Re: [Bug 518323] Re: Automatic link creation and CamelCase don't work with non-latin characters

#14

2011/5/31 Jiří Janoušek <email address hidden>

> I have been doing some experiments and Python regex engine seems to
> support unicode if unicode arguments and re.U flag are provided (example
> 3).
>

Yes it does for \w, however there is no way to match uppercase versus lower
case (unlike e.g. the perl regex engine which supports matching unicode
classes).

I have recently been thinking that it can work if we use the string methods
to determine which characters are uppercase and which are not and find
camelcase that way looking for an pattern of "upper lower upper" by
searching character by character.

-- Jaap

Revision history for this message

Jiří Janoušek (fenryxo) wrote on 2011-05-31:

#15

On Tue, May 31, 2011 at 21:59, Jaap Karssenberg
<email address hidden> wrote:
> 2011/5/31 Jiří Janoušek <email address hidden>
>
>> I have been doing some experiments and Python regex engine seems to
>> support unicode if unicode arguments and re.U flag are provided (example
>> 3).
>>
>
> Yes it does for \w, however there is no way to match uppercase versus lower
> case (unlike e.g. the perl regex engine which supports matching unicode
> classes).

I see, I missed the point before.

> I have recently been thinking that it can work if we use the string methods
> to determine which characters are uppercase and which are not and find
> camelcase that way looking for an pattern of "upper lower upper" by
> searching character by character.

There are also alternative regex libraries with unicode classes
support [1], but your solution may work well and don't require another
dependency (for one small feature).

[1] http://stackoverflow.com/questions/1832893/python-regex-matching-unicode-properties/

> --
> You received this bug notification because you are subscribed to Zim.
> https://bugs.launchpad.net/bugs/518323
>
> Title:
> Automatic link creation and CamelCase don't work with non-latin
> characters
>
> Status in Zim desktop wiki:
> In Progress
>
> Bug description:
> Automatic link creation doesn't work while using accented characters like "á", "é", "í"... inside the link.
> This affects many ways of link creation like links starting with ":", "+", CamelCase links...
>
> Examples:
>
> CamelCase <- creates link
> CámélCase <- doesn't create link
>
> +link <- creates link
> +línk <- doesn't create link
>
> :link <- creates link
> :línk <- doesn't create link
>
> ZIM version: 0.43 Linux
>

Revision history for this message

Speranskiy (sprnza) wrote on 2013-01-10:

#16

Bug still exist! It's impossible to create cyrillic links in Zim. Please fix the problem! Windows 7 and Ubuntu 12.04 (from standart repo).

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2013-01-10:

#17

On Thu, Jan 10, 2013 at 7:44 AM, Сперанский <email address hidden> wrote:
> Bug still exist! It's impossible to create cyrillic links in Zim.
> Please fix the problem! Windows 7 and Ubuntu 12.04 (from standart repo).

Yes it still exist. I think solutions are outlined in the comments
above, but nobody working on this at the moment. So if you feel like
looking into it, please go ahead.

Btw. you can create cyrillic links using <Ctrl>L

Regards,

Jaap

Revision history for this message

1.John@seznam.cz (neozvuck) wrote on 2013-05-06:

#18

would this work?

>>> unicode.islower(u'ěščřžýáíéúů')
True
>>> unicode.islower(u'ĚŠČŘŽÝÁÍÉÚŮ')
False
>>> unicode.isupper(u'ĚŠČŘŽÝÁÍÉÚŮ')
True
>>> unicode.isupper(u'ěščřžýáíéúů')
False

Revision history for this message

1.John@seznam.cz (neozvuck) wrote on 2013-05-12:

#19

patched Edit (210.2 KiB, text/x-python)

fix

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2013-08-18:

#20

Applied fix in rev674. May need some user testing from users with non-lating input methods.

Changed in zim:
status:	In Progress → Fix Committed

Revision history for this message

Alvenhar (alvenhar) wrote on 2014-04-21:

#21

The Python unicode.islower() and unicode.isupper() functions should work fine, I used them successfully in a very similar context in another project. However, it still doesn't work in the currently released version (0.60), so I suppose the fix has not been committed yet? It's been a year :)
I would generally suggest to use the unicode class for all text processing as long as the code is still in Python 2 (Python 3 has native unicode support and makes life with international text so much easier!!)

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2014-04-29:

#22

Fix is committed in the dev branch. Status will change to "released" when
it is in a released version.

Yes, latest release is a year old. Feel a bit guilty about that but will
not waste your time with all my excuses.Working on getting all fixes
released this spring.

REgards,

Jaap

On Mon, Apr 21, 2014 at 1:45 PM, Alvenhar <email address hidden> wrote:

> The Python unicode.islower() and unicode.isupper() functions should work
> fine, I used them successfully in a very similar context in another
> project. However, it still doesn't work in the currently released version
> (0.60), so I suppose the fix has not been committed yet? It's been a year :)
> I would generally suggest to use the unicode class for all text processing
> as long as the code is still in Python 2 (Python 3 has native unicode
> support and makes life with international text so much easier!!)
>
> --
> You received this bug notification because you are subscribed to Zim.
> https://bugs.launchpad.net/bugs/518323
>
> Title:
> Automatic link creation and CamelCase don't work with non-latin
> characters
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/zim/+bug/518323/+subscriptions
>

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2014-08-19:

#23

Fixed in release 0.61

Changed in zim:
status:	Fix Committed → Fix Released

Revision history for this message

CrabMan (cocacooler) wrote on 2015-02-03:

#24

Automatic link creation still does not work for Russian characters in zim 0.62. This bug should be reopened.

Revision history for this message

Jaap Karssenberg (jaap.karssenberg) wrote on 2016-02-22:

#25

See bug #1417677 for examples

Changed in zim:
status:	Fix Released → Confirmed
tags:	added: 2min

Revision history for this message

Speranskiy (sprnza) wrote on 2016-06-29:

#26

The same goes to +CamelCase syntax. So
ВаняИванов: the link is being created
+ВаняИванов: skipping

Plus:
If I'm creating a link typing in an address manually, the link is being created fine
[[http://www.example.com|ВаняИванов]]
If I copy and paste the address, I got the link highlighted with blue, the rest of the text is getting highlighted as well while I keep typing the link's label and I get
%D0%92%D0%B0%D0%BD%D1%8F%D0%98%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2
instead of ВаняИванов