(gtk3) Cleanup regex lboundry/rboundry and posix/gnu

Bug #1521509 reported by Egmont Koblinger
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Terminator
Fix Released
Undecided
MattRose

Bug Description

Bringing over from bug 1030562 comments 59-71.

There's a "FIXME FOR GTK3" in the source: There's a complicated boolean "try_posix_regexp" being platform dependant, and then based on this some weird values are assigned to lboundry/rboundry, which are later overwritten to be empty because that's how it seemed to work for me.

Probably this whole thing can be thrown out.

Or at least it should be fixed – I changed these for a reason, they don't work in their current form.

It's really ugly to guess the engine behind GLib based on the platform, you just can't guarantee to get it right. Regexps that aren't compatible across engines should hence be avoided.

The goal I believe is to make sure the URL is an entire "word", so that http://example.com/ matches but foohttp://example.com/ does not. It's really a minor goal and is hardly a problem if we accept a "false" match here (not the "foo" bits of course, but the "http://example.com/" that's immediately preceded by letters). I think it's better to have such false matches than a platform dependent code.

Another tricky but often occurring case is that the string "See http://example.com/asdf.html." should match up to "asdf.html" but exclude the trailing dot – not that it cannot be part of a URL, but it's not typical, it's way more typical for it to end a sentence.

Actually: now it looks to me that the "posix" definition of l/rboundry works for me (at least on my GLib) too, so we don't need the try_posix_regexp variable and the branch, just this "[[:<:]]" and "[[:>:]]" definition of l/rboundry and we're hopefully good.

See http://www.rexegg.com/regex-boundaries.html#leftright

Also see bug 1514578 and the gnome-terminal regex rewrite bug referred from that. Note that gnome-terminal doesn't care if the URL is prefixed by alphanumerical stuff and still recognizes the URL – hmmmm, I might think about refusing them as per terminator-gtk2.

Action plan for now: Remove the boolean, go for the 'posix'-kind of definition of l/rboundry and see if it breaks.

Related branches

description: updated
Revision history for this message
MattRose (mattrose) wrote :

I tried the action plan. The "posix" definition for l/r boundry breaks how I would think url detection should work.

Got rid of l/r boundry and any kind of try_posix_regexes references.

According to my reading of Glib documentation, this will use gnulib and/or PCRE regexes if they have been compiled into Glib, if not they will fall back to using posix extended regexes.

All the code that recognizes basic urls is written with posix-compatible regexes. There are some plugins that use PCRE style \b regexes, but those would never have worked.

I'll throw a patch in tonight. I've tested it and it works great.

Changed in terminator:
assignee: nobody → MattRose (mattrose)
Revision history for this message
Egmont Koblinger (egmont-gmail) wrote :

> breaks how I would think url detection should work

Could you please give a concrete example, just for the record?

Revision history for this message
MattRose (mattrose) wrote :

well, with the lboundry set to either regex, this:

http://launchpad.net is NOT recognized as a url.
<http://launchpad.net> is recognized as a url, but the url matcher doesn't strip the <> characters when opening up the url in your browser, leading to your browser trying to open <http://launchpad.net> and failing.

We *could* alter this behaviour so that it matches a the posix [[:space:]] regex as l and r boundary, and then strip that off when passing the url on, but I think we just match foohttp://example.com/ and call it a day.

Revision history for this message
MattRose (mattrose) wrote :

This would have the bonus effect of matching "http://example.com" and <http://example.com> and passing it along properly to the receiver of the url.

Revision history for this message
MattRose (mattrose) wrote :

just read that link in the description. I'll try with the posix boundary again before I attach the patch, just to make sure I'm right. You'll see how my tests turned out by reading the patch :)

Revision history for this message
Egmont Koblinger (egmont-gmail) wrote :

> well, with the lboundry set to either regex, this:
> http://launchpad.net is NOT recognized as a url.

Really, a single URL (surrounded by whitespace) is not recognized? That's not what I get.

> <http://launchpad.net> is recognized as a url, but the url matcher doesn't strip the <> characters

Again, it's not the behavior I get. You might have broken something in your patch.

> We *could* alter this behaviour so that it matches a the posix [[:space:]] regex as l and r boundary

Please don't, as vte would still underline those characters.

If you really want to prevent matching the URL inside the string "foohttp://example.com", you'd need either those current l/rboundries (or something similar concept), or negative lookarounds (which is a totally awesome, lesser-known regex feature).

Revision history for this message
Egmont Koblinger (egmont-gmail) wrote :

> You might have broken something in your patch.

Or, sorry, suddenly forgot about this possibility: You might have a different glib (maybe compiled against a different regex engine) causing troubles. I don't know.

Revision history for this message
Egmont Koblinger (egmont-gmail) wrote :

Just occurred to me: gnome-terminal developer Christian once told me that glib's regex engine is PCRE.

Revision history for this message
Egmont Koblinger (egmont-gmail) wrote :

This is what I have now and works for me – does it work for you guys too?

Only the gtk3 branch is relevant.

Okay, actually I'm totally puzzled here. How come the GNU-style word boundaries work with terminator-gtk2 and not with terminator-gtk3, when Gtk is obviously irrelevant and Glib is the same underneath the two??? Could this maybe be a difference in vte's internals how it handles the regex string?

Anyway, I don't necessarily want to understand this, just fix for gtk3 :)

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

Oh, and just to gently roll a potential live grenade into the middle of this, also bear in mind that plugins use the url matching stuff too. See the included manual for more details, but it includes launchpad links for bugs, or ppa's (i.e. lp:123456), apt links for sw (i.e. apt:terminator), Maven (no clue!), and there are third party plugins for Mozilla's bugzilla.

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

Probably an ideal candidate for some automated tests and test driven development?

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

Egmont: your patch gives the following when I try to start 'er up:

sboddy@pinpoint:~/Development/terminator/bugfix-egmontkoblinger$ ./terminator -g deletemeconfig -u
Traceback (most recent call last):
  File "./terminator", line 120, in <module>
    TERMINATOR.create_layout(OPTIONS.layout)
  File "/home/sboddy/Development/terminator/bugfix-egmontkoblinger/terminatorlib/terminator.py", line 281, in create_layout
    window, terminal = self.new_window()
  File "/home/sboddy/Development/terminator/bugfix-egmontkoblinger/terminatorlib/terminator.py", line 208, in new_window
    terminal = maker.make('Terminal')
  File "/home/sboddy/Development/terminator/bugfix-egmontkoblinger/terminatorlib/factory.py", line 94, in make
    output = func(**kwargs)
  File "/home/sboddy/Development/terminator/bugfix-egmontkoblinger/terminatorlib/factory.py", line 106, in make_terminal
    return(terminal.Terminal())
  File "/home/sboddy/Development/terminator/bugfix-egmontkoblinger/terminatorlib/terminal.py", line 145, in __init__
    self.update_url_matches()
  File "/home/sboddy/Development/terminator/bugfix-egmontkoblinger/terminatorlib/terminal.py", line 276, in update_url_matches
    reg = GLib.Regex.new(re, GLib.RegexCompileFlags.OPTIMIZE, 0)
gi._glib.GError: Error while compiling regular expression [[:<:]](news:|telnet:|nntp:|file:/|https?:|ftps?:|webcal:)//([-A-Za-z0-9]+(:[-A-Za-z0-9,?;.:/!%$^*&~"#']+)?@)?[-A-Za-z0-9.]+(:[0-9]+)?(/[-A-Za-z0-9_$.+!*(),;:@&=?/~#%'"]*[^]'.}>)
,\"])?[[:>:]]/? at char 3: unknown POSIX class name

With:
ii gir1.2-glib-2.0 1.40.0-1ubuntu0.2 i386 Introspection data for GLib, GObject, Gio and GModule
ii libglib2.0-0:i386 2.40.2-0ubuntu1 i386 GLib library of C routines

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

So setting the l/r boundaries to "" at least lets Terminator start. Seems like we're quantum entangled with POSIX/GNU/PCRE/Sesame Street regular expressions.

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

OK, So a version that works for me (on top of your patch Egmont) is

lboundary = "\\b"
rboundary = "\\b"

Boundaries (according to my quick read) mark where a transition happens between word chars and non-word chars, i.e.:

bob+@~#123

So a first boundary is marked between b and +, and a second is marked between # and 1.

regex "bob\b"

would match the first three chars of the above string, but not the following:

bobby+@~#123

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

Oh, and the double \ ("\\b") is because of Python string escape sequences. In plain text the regex is "\b".

Revision history for this message
Egmont Koblinger (egmont-gmail) wrote :

> it includes launchpad links for bugs, or ppa's (i.e. lp:123456)

Didn't know we had this. Awesome!

> Probably an ideal candidate for some automated tests and test driven
> development?

See the work-in-progress regex rewrite of g-t at https://bugzilla.gnome.org/show_bug.cgi?id=756038, it's full of tests.

Those, however, only check against whichever regex engine that particular glib uses.

> Egmont: your patch gives the following when I try to start 'er up

This sucks.

> lboundary = "\\b"

Hooray, this works for me too!

Revision history for this message
Stephen Boddy (stephen-j-boddy) wrote :

That "\b" actually came from the GLib manual page, and as GLib is now directly called to produce the regex, and the return value passed to the vte match_add_gregex method it looks like the posix stuff is indeed not used/usable.

@Matt, Can you confirm if Egmont's patch (with my tweak) works for you? If so, I'll commit it (along with removing an incorrect double-quote for lp:1514578

Revision history for this message
MattRose (mattrose) wrote :

Looks good.

ship it!

Revision history for this message
MattRose (mattrose) wrote :

that's with your tweak

Changed in terminator:
status: New → Fix Committed
Changed in terminator:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.