URL linkification not Unicode aware

Bug #78898 reported by Stuart Bishop
64
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
Low
Unassigned

Bug Description

As can be seen with https://launchpad.net/launchpad/+bug/78780 (where the
last character of the example URL is not part of the link), Launchpad thinks
URLs only contain ASCII characters.

See also bug 394908 for another example where the linkification didn't work, team descriptions

Revision history for this message
James Henstridge (jamesh) wrote :

URLs _are_ composed of only ASCII characters.

The link in Matsubara's bug report is an IRI (internationalised resource identifier), which the regexps we use for detecting links don't catch.

I am not sure what characters are allowed in an IRI exactly.

Revision history for this message
Stuart Bishop (stub) wrote :

In this case, the purpose is to mark up the users input. It doesn't matter what characters are technically allowed; if it looks like a URL, we should mark it up like a URL except for trailing punctuation.

eg. I might want to add a bug report:

    When I go to a URL like http://☣.net/, it works except Firefox rewrites the URL to the ASCII form in the URL bar and it looks fugly. I'm not sure if this a bug or an anti-phishing feature.

(as an aside, I don't think Launchpad should use the technical definition of a URL circa 1993, but the real world definition. The only time it needs to encode a URL as US-ASCII is when generating HTTP headers. Mail readers and web browsers do the right thing and correctly encode Unicode URLs embedded in HTML to US-ASCII for transport over HTTP, so there is no reason to display uglified URLs in our HTML output. But this is generally irrelevant as we have standardized on ASCII URL components everywhere except for user inputted external URLs).

Revision history for this message
James Henstridge (jamesh) wrote :

Sure. I am not suggesting that we ignore such links. We just need to work out what the regexps would need to look like to find IRIs in text.

Revision history for this message
Stuart Bishop (stub) wrote : Re: [Bug 78898] Re: URL linkification not Unicode aware

James Henstridge wrote:
> Sure. I am not suggesting that we ignore such links. We just need to
> work out what the regexps would need to look like to find IRIs in text.

I would suggest:

(?ux)((?:telnet:|mailto:|\w+://)[^\s]*[^\s.,()\[\]{}+_=\-\*!'"`;:?<>&|]+)

(which deliberately doesn't match ? and & at the end of strings, which will
be either harmless or what you want in almost all cases).

Do we have any other well known protocols that use just : instead of the
more common :// ? I would hesitate to use just \w+: as the protocol match as
 it would give too many false positives.

Do we care about special urls like about: and blank: ?

--
Stuart Bishop <email address hidden> http://www.canonical.com/
Canonical Ltd. http://www.ubuntu.com/

Revision history for this message
Matthew Paul Thomas (mpt) wrote :

jabber:<email address hidden> has no "//".

affects: malone → launchpad-foundations
description: updated
Curtis Hovey (sinzui)
tags: added: tales
Curtis Hovey (sinzui)
affects: launchpad-foundations → launchpad-web
Curtis Hovey (sinzui)
tags: added: bugjam2010
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.