When we try to use chinese characters in a tag, we an error:

Bug #724819 reported by Edvard
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
KARL3
Invalid
Low
Unassigned

Bug Description

Tag is as follow: ( utf-8):

國語

Error is as follows:

"Adding tag failed: Value contains characters that are not allowed in a tag."

Changed in karl3:
assignee: nobody → Chris Rossi (chris-archimedeanco)
importance: Undecided → Low
milestone: none → m56
Changed in karl3:
status: New → In Progress
Revision history for this message
Chris Rossi (chris-archimedeanco) wrote :

I've looked into this and I think I could fix this if this check were being done server side. Validation for tags is done using this regular expression:

"^[a-zA-Z0-9\-\._]+$"

This could be changed to this, which is equivalent for ascii input:

"^[\w\d\-\._]+$"

The difference is that in Python if the re.UNICODE flag is supplied, the \w and \d metacharacters match any "word" or "digit" characters as defined in Unicode, and not just ascii a-zA-Z and 0-9 respectively.

While there does seem to be an implementation of this log server side, I can't tell exactly when, if ever, it is called. In most cases the equivalent client side logic is called. This logic uses the same regular expression for validation, but per the ecmascript standard, \w and \d are not unicode-aware and match only their ascii equivalents. (Some cursory googling reveals that some browsers break the standard here and implement these as unicode-aware, but behavior is not uniform. Chromium appears to follow the standard.)

I did find as a result of some Googling, this js library which claims to provide Unicode regular expressions:

http://xregexp.com/plugins/

Potentially, also, we could use a server call to do the validation and consolidate all tag validation into a single place in Python code.

Since the solution for this is going to involve the client side code, this might be a good candidate for handing to Balasz.

Changed in karl3:
status: In Progress → Confirmed
Changed in karl3:
milestone: m61 → m62
Revision history for this message
Paul Everitt (paul-agendaless) wrote :

Balazs, could you read Chris's last comment and see what you think?

Changed in karl3:
assignee: Chris Rossi (chris-archimedeanco) → Balazs Ree (ree)
Changed in karl3:
milestone: m62 → m63
Balazs Ree (ree)
Changed in karl3:
status: Confirmed → In Progress
Revision history for this message
Balazs Ree (ree) wrote :

First, a policy question. I may be wrong at this point but the original policy I was aware of, was that the tags do not contain anything else than ascii. This excludes not only Chinese, but any unicode characters. Other examples currently prohibited are the Hungarian unicodes: áéíóöőúüű

Then, the issue of filtering the accepted tags on the server. Although it would be possible to actually do this on the server, because we make a tag search at the same time when adding, even on pages (so ajax is involved anyway) I would be fo

For the implementation, I agree with Chris that the most sane solution is using xregexp. I tested this on the client and it works satisfying the following requirements:

- from the ascii set 0-127, only the alphanumeric and the _-. are accepted.

- from anything else of ascii (any unicode): only the word characters are accepted.

Once Paul confirms that this complies the policy, I can do the remaining fixup on the server.

Revision history for this message
Paul Everitt (paul-agendaless) wrote : Re: [Bug 724819] Re: When we try to use chinese characters in a tag, we an error:

Correct, the policy is that tags are pure ASCII. Stuff that requires no quoting to display in a URL. So the only thing we need to do is, ensure they fail gracefully.

--Paul

On Jul 5, 2011, at 7:51 AM, Balazs Ree wrote:

> First, a policy question. I may be wrong at this point but the original
> policy I was aware of, was that the tags do not contain anything else
> than ascii. This excludes not only Chinese, but any unicode characters.
> Other examples currently prohibited are the Hungarian unicodes:
> áéíóöőúüű
>
>
> Then, the issue of filtering the accepted tags on the server. Although it would be possible to actually do this on the server, because we make a tag search at the same time when adding, even on pages (so ajax is involved anyway) I would be fo
>
> For the implementation, I agree with Chris that the most sane solution
> is using xregexp. I tested this on the client and it works satisfying
> the following requirements:
>
> - from the ascii set 0-127, only the alphanumeric and the _-. are
> accepted.
>
> - from anything else of ascii (any unicode): only the word characters
> are accepted.
>
> Once Paul confirms that this complies the policy, I can do the remaining
> fixup on the server.
>
> --
> You received this bug notification because you are subscribed to KARL3.
> https://bugs.launchpad.net/bugs/724819
>
> Title:
> When we try to use chinese characters in a tag, we an error:
>
> Status in KARL3:
> In Progress
>
> Bug description:
> Tag is as follow: ( utf-8):
>
> 國語
>
> Error is as follows:
>
> "Adding tag failed: Value contains characters that are not allowed in
> a tag."
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/karl3/+bug/724819/+subscriptions

Revision history for this message
Balazs Ree (ree) wrote :

I believe currently they do. It gives the error message that those characters are not allowed: all fine.

Revision history for this message
Paul Everitt (paul-agendaless) wrote :

OSF's decision early on was to only have ASCII tags.

Changed in karl3:
assignee: Balazs Ree (ree) → nobody
status: In Progress → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.