Open link functionality not entirely correct

Bug #1514578 reported by Michael Heuberger
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Terminator
Fix Released
Low
Stephen Boddy

Bug Description

The "open " link functionality in Terminator isn't entirely correct. Probably the regex expression to extract URLs?

For example I have these log lines in the bash

{"level":20,"msg":"Querying videomail data with key 11e5-866e-47887d40-9a32-6fc15908d5b4","time":"2015-11-09T11:10:24.240Z","v":0}
{"level":30,"msg":"(i) User at 121.75.34.60 is reading \"something\" (https://videomail.io/videomail/something-902076667689)","time":"2015-11-09T11:10:24.245Z","v":0}

When I hover the mouse over the https link on the second line, then it opens exactly this URL:
https://videomail.io/videomail/something-902076667689)","time":"2015-11-09T11:10:24.245Z","v":0}

But I expect it to open that URL
https://videomail.io/videomail/something-902076667689

Brackets are not a valid part of URLs and should be excluded. Can you correct this in the regex?

Thanks
Michael

Related branches

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

Actually brackets are valid. See:
http://tools.ietf.org/html/rfc3986#section-2.2
http://tools.ietf.org/html/rfc3986#section-3.3

Specifically the sub-delimiter definition:
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

And the pchar (path characters) definition:
pchar = unreserved / pct-encoded / sub-delims / ":" / "@"

As far as I've understood, these sub-delimiters are valid in a URL as they are, and optionally can be used as a delimiter on a schema by schema basis, in which case an actual character not being used as a delimiter would be encoded.

Ironically the earlier RFC (http://tools.ietf.org/html/rfc1738) superseded by the one above states it more clearly!

  "Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL."

Curiously gnome-terminal does seems to have removed these from their pattern matching from a quick test, but unless you can point to an authoritative source that clearly contradicts my conclusion, the regex will stay as is.

Perhaps you have the ability to change the log format to use a different bracket? Both square and curly brackets work in Terminator.

I'm marking this as opinion for now, so it can remain open to discuss.

Changed in terminator:
status: New → Opinion
importance: Undecided → Low
Revision history for this message
Michael Heuberger (michael.heuberger) wrote :

Good point + research Stephen, thanks. Yeah, I think I ll have to change the log format on my side.

But tell me first, what about the double quote sign " ? It is not mentioned in the RFC, yet your regex includes it.

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

Good spot. That code was added a long, long time ago by the original author, Chris. It does look like that double-quote character snuck in under the radar. I'll elevate this back to an "In Progress" assigned to me, and remove it when I get the chance.

Changed in terminator:
status: Opinion → In Progress
assignee: nobody → Stephen Boddy (stephen-j-boddy)
Revision history for this message
Michael Heuberger (michael.heuberger) wrote :

thx, please do

Revision history for this message
Egmont Koblinger (egmont-gmail) wrote :

gnome-terminal regex suffers from plenty of problems, a complete rewrite is in progress: https://bugzilla.gnome.org/show_bug.cgi?id=756038.

Note that for recognizing URLs, sometimes the intent is to exclude characters that are valid otherwise. E.g. the trailing dot in the previous paragraph could be part of the URL, but noone should assume so.

Revision history for this message
Michael Heuberger (michael.heuberger) wrote :

great, hope all goes well with the rewrite

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

@Michael: The rewrite mentioned if for gnome-***terminal***, not Terminator. I'm not sure if/how the Terminator project will change it's regex's. I do however still need to remove that double-quote. My excuse for not having done it yet is that Egmont has been running me ragged ;-)

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

Curious, I don't entirely follow the regex, but even with the double-quote, this works:
    echo 'under "http://www.w3.org/Addressing/", but '

but this doesn't:
    echo 'under "http://www.w3.org/Addressing/"fff, but '

I'll remove it anyway, as it shouldn't be there. Oh and FYI the RFC also says that URL's should either be wrapped in whitespace, double-quotes ("), or angle brackets (<>). See:
    http://tools.ietf.org/html/rfc3986#appendix-C

Revision history for this message
Egmont Koblinger (egmont-gmail) wrote :

... and what to do with https://en.wikipedia.org/wiki/" ?

It's not an exact science, it's heuristics and there's no single way of doing it right or wrong. Each solution will have corner cases where it breaks.

That RFC was written 10 years ago and practice might have change a lot since then. E.g. nowadays everyone assumes that it's safe to place a trailing dot or comma right after the URL without a whitespace. That's a frequent real-life case. "http://www.w3.org/Addressing/"fff is a rare corner case that I wouldn't care about too much.

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

Closing this down for now. Maybe a future task will be to port what gnome-terminal is doing now. Would need to understand if it could support URL plugins.

Changed in terminator:
status: In Progress → Fix Committed
Changed in terminator:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.