URL's highlighted and opened include trailing punctuation

Bug #810122 reported by John S. Gruber
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Lernid
In Progress
Low
Mohammad AbuShady

Bug Description

For example with "http://upstart.ubuntu.com, " the trailing comma is considered part of the URL and shouldn't be. "Unable to load page" is produced in the browser. Removing the trailing comma produces a good URL.

Related branches

Changed in lernid:
status: New → Confirmed
importance: Undecided → Low
Revision history for this message
Mohammad AbuShady (coalwater) wrote :

Any hints of which files I should look into ? Or where the error appears? I still didn't play with lernid much so i don't know.

Revision history for this message
John S. Gruber (jsjgruber) wrote :

You might want to start with lernid/widgets/Ircwidget.py line 126 where a regular expression is used to spot URL's in text that goes by. I assume ".", ",", and ")", etc. would produce similar problems.

If you decide to work on this bug report you should probably mark in "In Progress" with your self as "Assigned to".

It might be a good idea if you send me an e-mail, too before working on it. I'd love to help you fix this.

Revision history for this message
Mohammad AbuShady (coalwater) wrote :

I'm having problems running lernid from source, first if i run it without sudo i get segmentation error, if i do it fails saying it cant access property 'connect' from type None.
Any thing i should know?

Changed in lernid:
status: Confirmed → In Progress
assignee: nobody → Coalwater (coalwater)
Revision history for this message
Mohammad AbuShady (coalwater) wrote :

The problem wasn't in the highlighting function, it was in the regex that detects the url, i changed the algorithm, the old one was looking for a space to stop seeking, i changed it to by asking it to seek till it finds an invalid character, and so far i'm just using commas, but before i propose a merge i would like to add all invalid characters to that check, so if any body has any invalid characters i should put in the regex check please tell me
The regex i changed:
http://bazaar.launchpad.net/~coalwater/lernid/fix-810122/revision/228#lernid/widgets/IrcWidget.py

summary: - URL's highlighted and opened include trailing comma
+ URL's highlighted and opened include trailing punctuation
Revision history for this message
John S. Gruber (jsjgruber) wrote :

Coalwater:
Before going too far, it may be wise to read the appropriate URI and URL RFC's. It appears to me that there is no fool-proof way to do this for all characters the instructor might be adding as punctuation rather than part of the URL, but I don't know that requiring URL's to be encapsulated in < > will work, it's hard to type.

Since whatever we do will have to be approximate, I wonder if looking for whitespace, and then backing out what seems to be punctuation from the end of the resulting URL would work. While an embedded "." is essential (as in http://ubuntu.com) a trailing one is probably punctuation. We will be risking that trailing "." is really part of the URL and not intended as punctuation, but it's much more probable that a trailing "." will be punctuation. A trailing close parenthesis ")" or "]" should also probably be eliminated, too, as should combinations, such as ")." "]." ".)" ".]" ")," ]," .

I'm interested in the thoughts of all others on this.

Revision history for this message
Mohammad AbuShady (coalwater) wrote :

What about trying to use an already made regex, like i found this link by looking around a bit
http://alanstorm.com/url_regex_explained

Revision history for this message
Mohammad AbuShady (coalwater) wrote :

funny thing it had a url to the owner of the regex, and it has your name.. http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Revision history for this message
Elizabeth K. Joseph (lyz) wrote :

I agree with John, since commas actually are valid I agree that it should only look for a trailing one with a space afterwords (so ", " etc).

Even this isn't fool-proof, but I think the likelihood of having a valid URL ending with a special character seems unlikely.

Revision history for this message
John S. Gruber (jsjgruber) wrote :

@Coalwater,

Believe me, I'm no 'daring fireball' (on either count).

I see that the URL is located for highlighting in the place you are looking, and it is also located for automatically changing the browser page in Browser.py . Also note that the browser will do the appropriate thing if the instructor posts a link of the form http://www.google.com. (note the trailing "."). "http://www.google.com," doesn't work. So there are actually two things that need to work, tagging the text for color and to make URL's clickable, and the automatic browser changing in Browser.py (which already handles trailing periods).

Revision history for this message
Mohammad AbuShady (coalwater) wrote :

http://bazaar.launchpad.net/~coalwater/lernid/fix-810122/revision/229#lernid/widgets/IrcWidget.py
so this version only ignores trailing commas and dots, cause from the conversation i understood that this would be sufficient right?
and i didn't change the browser widget because i thought it's unnecessary because it ignores only trailing dots and it works fine with links that are passed without the dot too, but if someone thinks i should i could do it.

Revision history for this message
John S. Gruber (jsjgruber) wrote :

I think it would be best to ignore trailing punctuation from both Ircwidget.py and Browser.py and I still like the approach I mentioned in https://bugs.launchpad.net/lernid/+bug/810122/comments/5. IIRC, there is some code that now removes trailing periods that could be used as a model for what could be done to remove commas, too.

Revision history for this message
John S. Gruber (jsjgruber) wrote :

coalwater, do you need any help on this one?

Revision history for this message
John S. Gruber (jsjgruber) wrote :

Another example just happened:
<http://myapps.developer.ubuntu.com>.

The period was omitted but the matching > was kept, meaning it went to theh wrong place.

Revision history for this message
Mohammad AbuShady (coalwater) wrote :

How's this one ?
http://www.rubular.com/r/6snbljICDQ
note that the one not beginning with 'http' doesn't work, test it with different cases, and if it's good I'll push it

Revision history for this message
John S. Gruber (jsjgruber) wrote :

coalwater, I tried about everything I could think of and it worked. Please propose it if you are ready. I've been importing your other branches.

Thanks.

Revision history for this message
Mohammad AbuShady (coalwater) wrote :

It's pushed, but i have a question first, currently it doesn't recognize those patterns ( example.com | www.example.com ) should i at least add support for the www one ?

Revision history for this message
John S. Gruber (jsjgruber) wrote : Re: [Bug 810122] Re: URL's highlighted and opened include trailing punctuation

I don't think we should trigger without a complete url. Instructors should
be able to mention a site without starting a load, IMHO.
On Apr 1, 2012 2:01 PM, "Mohammad AbuShady" <email address hidden> wrote:

> It's pushed, but i have a question first, currently it doesn't recognize
> those patterns ( example.com | www.example.com ) should i at least add
> support for the www one ?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/810122
>
> Title:
> URL's highlighted and opened include trailing punctuation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lernid/+bug/810122/+subscriptions
>

Revision history for this message
Mohammad AbuShady (coalwater) wrote :

I've done the merge proposal, you could test the branch, I've tested it with the same cases that were in the rubular test they all worked.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.