URL's highlighted and opened include trailing punctuation

Reported by John S. Gruber on 2011-07-13
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Lernid
Low
Mohammad AbuShady

Bug Description

For example with "http://upstart.ubuntu.com, " the trailing comma is considered part of the URL and shouldn't be. "Unable to load page" is produced in the browser. Removing the trailing comma produces a good URL.

Related branches

lp:~coalwater/lernid/fix-810122
Ready for review for merging into lp:lernid
John S. Gruber: Needs Fixing on 2012-04-19
Mohammad AbuShady (community): Resubmit (not sure :d) on 2012-04-18
Lernid Development Team: Pending requested 2012-04-18
Changed in lernid:
status: New → Confirmed
importance: Undecided → Low
Mohammad AbuShady (coalwater) wrote :

Any hints of which files I should look into ? Or where the error appears? I still didn't play with lernid much so i don't know.

John S. Gruber (jsjgruber) wrote :

You might want to start with lernid/widgets/Ircwidget.py line 126 where a regular expression is used to spot URL's in text that goes by. I assume ".", ",", and ")", etc. would produce similar problems.

If you decide to work on this bug report you should probably mark in "In Progress" with your self as "Assigned to".

It might be a good idea if you send me an e-mail, too before working on it. I'd love to help you fix this.

Mohammad AbuShady (coalwater) wrote :

I'm having problems running lernid from source, first if i run it without sudo i get segmentation error, if i do it fails saying it cant access property 'connect' from type None.
Any thing i should know?

Changed in lernid:
status: Confirmed → In Progress
assignee: nobody → Coalwater (coalwater)
Mohammad AbuShady (coalwater) wrote :

The problem wasn't in the highlighting function, it was in the regex that detects the url, i changed the algorithm, the old one was looking for a space to stop seeking, i changed it to by asking it to seek till it finds an invalid character, and so far i'm just using commas, but before i propose a merge i would like to add all invalid characters to that check, so if any body has any invalid characters i should put in the regex check please tell me
The regex i changed:
http://bazaar.launchpad.net/~coalwater/lernid/fix-810122/revision/228#lernid/widgets/IrcWidget.py

summary: - URL's highlighted and opened include trailing comma
+ URL's highlighted and opened include trailing punctuation
John S. Gruber (jsjgruber) wrote :

Coalwater:
Before going too far, it may be wise to read the appropriate URI and URL RFC's. It appears to me that there is no fool-proof way to do this for all characters the instructor might be adding as punctuation rather than part of the URL, but I don't know that requiring URL's to be encapsulated in < > will work, it's hard to type.

Since whatever we do will have to be approximate, I wonder if looking for whitespace, and then backing out what seems to be punctuation from the end of the resulting URL would work. While an embedded "." is essential (as in http://ubuntu.com) a trailing one is probably punctuation. We will be risking that trailing "." is really part of the URL and not intended as punctuation, but it's much more probable that a trailing "." will be punctuation. A trailing close parenthesis ")" or "]" should also probably be eliminated, too, as should combinations, such as ")." "]." ".)" ".]" ")," ]," .

I'm interested in the thoughts of all others on this.

Mohammad AbuShady (coalwater) wrote :

What about trying to use an already made regex, like i found this link by looking around a bit
http://alanstorm.com/url_regex_explained

Mohammad AbuShady (coalwater) wrote :

funny thing it had a url to the owner of the regex, and it has your name.. http://daringfireball.net/2010/07/improved_regex_for_matching_urls

Elizabeth Krumbach Joseph (lyz) wrote :

I agree with John, since commas actually are valid I agree that it should only look for a trailing one with a space afterwords (so ", " etc).

Even this isn't fool-proof, but I think the likelihood of having a valid URL ending with a special character seems unlikely.

John S. Gruber (jsjgruber) wrote :

@Coalwater,

Believe me, I'm no 'daring fireball' (on either count).

I see that the URL is located for highlighting in the place you are looking, and it is also located for automatically changing the browser page in Browser.py . Also note that the browser will do the appropriate thing if the instructor posts a link of the form http://www.google.com. (note the trailing "."). "http://www.google.com," doesn't work. So there are actually two things that need to work, tagging the text for color and to make URL's clickable, and the automatic browser changing in Browser.py (which already handles trailing periods).

Mohammad AbuShady (coalwater) wrote :

http://bazaar.launchpad.net/~coalwater/lernid/fix-810122/revision/229#lernid/widgets/IrcWidget.py
so this version only ignores trailing commas and dots, cause from the conversation i understood that this would be sufficient right?
and i didn't change the browser widget because i thought it's unnecessary because it ignores only trailing dots and it works fine with links that are passed without the dot too, but if someone thinks i should i could do it.

John S. Gruber (jsjgruber) wrote :

I think it would be best to ignore trailing punctuation from both Ircwidget.py and Browser.py and I still like the approach I mentioned in https://bugs.launchpad.net/lernid/+bug/810122/comments/5. IIRC, there is some code that now removes trailing periods that could be used as a model for what could be done to remove commas, too.

John S. Gruber (jsjgruber) wrote :

coalwater, do you need any help on this one?

John S. Gruber (jsjgruber) wrote :

Another example just happened:
<http://myapps.developer.ubuntu.com>.

The period was omitted but the matching > was kept, meaning it went to theh wrong place.

Mohammad AbuShady (coalwater) wrote :

How's this one ?
http://www.rubular.com/r/6snbljICDQ
note that the one not beginning with 'http' doesn't work, test it with different cases, and if it's good I'll push it

John S. Gruber (jsjgruber) wrote :

coalwater, I tried about everything I could think of and it worked. Please propose it if you are ready. I've been importing your other branches.

Thanks.

Mohammad AbuShady (coalwater) wrote :

It's pushed, but i have a question first, currently it doesn't recognize those patterns ( example.com | www.example.com ) should i at least add support for the www one ?

I don't think we should trigger without a complete url. Instructors should
be able to mention a site without starting a load, IMHO.
On Apr 1, 2012 2:01 PM, "Mohammad AbuShady" <email address hidden> wrote:

> It's pushed, but i have a question first, currently it doesn't recognize
> those patterns ( example.com | www.example.com ) should i at least add
> support for the www one ?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/810122
>
> Title:
> URL's highlighted and opened include trailing punctuation
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/lernid/+bug/810122/+subscriptions
>

Mohammad AbuShady (coalwater) wrote :

I've done the merge proposal, you could test the branch, I've tested it with the same cases that were in the rubular test they all worked.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers