Launchpad itself

URL linkification not Unicode aware

Bug #78898 reported by Stuart Bishop on 2007-01-12

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	Launchpad itself	Triaged	Low	Unassigned

Bug Description

As can be seen with https://launchpad.net/launchpad/+bug/78780 (where the
last character of the example URL is not part of the link), Launchpad thinks
URLs only contain ASCII characters.

See also bug 394908 for another example where the linkification didn't work, team descriptions

See original description

Tags:

Revision history for this message

James Henstridge (jamesh) wrote on 2007-01-15:

URLs _are_ composed of only ASCII characters.

The link in Matsubara's bug report is an IRI (internationalised resource identifier), which the regexps we use for detecting links don't catch.

I am not sure what characters are allowed in an IRI exactly.

Revision history for this message

Stuart Bishop (stub) wrote on 2007-01-16:

In this case, the purpose is to mark up the users input. It doesn't matter what characters are technically allowed; if it looks like a URL, we should mark it up like a URL except for trailing punctuation.

eg. I might want to add a bug report:

When I go to a URL like http://☣.net/, it works except Firefox rewrites the URL to the ASCII form in the URL bar and it looks fugly. I'm not sure if this a bug or an anti-phishing feature.

(as an aside, I don't think Launchpad should use the technical definition of a URL circa 1993, but the real world definition. The only time it needs to encode a URL as US-ASCII is when generating HTTP headers. Mail readers and web browsers do the right thing and correctly encode Unicode URLs embedded in HTML to US-ASCII for transport over HTTP, so there is no reason to display uglified URLs in our HTML output. But this is generally irrelevant as we have standardized on ASCII URL components everywhere except for user inputted external URLs).

Revision history for this message

James Henstridge (jamesh) wrote on 2007-01-17:

Sure. I am not suggesting that we ignore such links. We just need to work out what the regexps would need to look like to find IRIs in text.

Revision history for this message

Stuart Bishop (stub) wrote on 2007-01-18: Re: [Bug 78898] Re: URL linkification not Unicode aware

James Henstridge wrote:
> Sure. I am not suggesting that we ignore such links. We just need to
> work out what the regexps would need to look like to find IRIs in text.

I would suggest:

(?ux)((?:telnet:|mailto:|\w+://)[^\s]*[^\s.,()\[\]{}+_=\-\*!'"`;:?<>&|]+)

(which deliberately doesn't match ? and & at the end of strings, which will
be either harmless or what you want in almost all cases).

Do we have any other well known protocols that use just : instead of the
more common :// ? I would hesitate to use just \w+: as the protocol match as
it would give too many false positives.

Do we care about special urls like about: and blank: ?

--
Stuart Bishop <email address hidden> http://www.canonical.com/
Canonical Ltd. http://www.ubuntu.com/

Revision history for this message

Matthew Paul Thomas (mpt) wrote on 2007-01-21:

jabber:<email address hidden> has no "//".

Diogo Matsubara (matsubara) on 2009-07-15

affects:	malone → launchpad-foundations
description:	updated

Curtis Hovey (sinzui) on 2010-11-13

tags:

added: tales

Curtis Hovey (sinzui) on 2010-11-15

affects:

launchpad-foundations → launchpad-web

Curtis Hovey (sinzui) on 2010-12-13

tags:

added: bugjam2010

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.