Bookmarklet is not URI safe for including on a page

Bug #782930 reported by David Coles
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
qrify
Fix Committed
Medium
Unassigned

Bug Description

The bookmarklet doesn't escape characters like `"` when generating the minified version. While fine for adding by hand, you can't put it on a page for drag/dropping. Should probably be URI encoded using Python's quote.

(As a slightly separate note maybe we should build a minified version and then use that to build a "qrify.url" file. The "qrify-min.js" isn't really a valid Javascript source file with the "javascript:" out front.)

Related branches

Revision history for this message
David Coles (dcoles) wrote :

Adding Matt since I'm kind of curious to hear his thoughts, since he wrote the bookmarklet.py script.

Revision history for this message
Matt Giuca (mgiuca) wrote :

Don't worry. I read all the bugs anyway.

Yes, I suppose this is true. Note that I originally wrote bookmarklet.py using urllib.quote (of course ;) I wrote it!) But then I noticed that I was escaping a lot of characters you weren't, such as quotes and things. Your version only escaped spaces, so I changed it to simply replace space with %20. I think I looked at some other bookmarklets and they also only escaped space.

But I would be more than happy to use urllib.quote, and I doubt it would cause any problems.

Revision history for this message
Matt Giuca (mgiuca) wrote :

So if you use urllib.parse.quote, you end up with a huge mess of basically everything percent-encoded (including parens). I think we should encode the minimum set of characters necessary, so here is a bit of an analysis.

There are two standards at work here: URI syntax (RFC 2396) and XML attribute value syntax (W3C XML 1.0 Fifth Edition). Note that commonly, people encode URIs and then slap them into a href="" attribute, assuming that a valid URI is "safe", but it actually isn't, because URIs may include ampersands ("&") and single quotes ("'"), which are illegal in XML attribute values (note that single quotes are only illegal in an attribute value delimited by single quotes, but that is still a possibility). Therefore, we should not allow URIs with bare ampersands or single quotes.

As for URI syntax, RFC 2396 divides the ASCII characters into three groups: reserved, unreserved, and excluded. These sets are, in full:

reserved: ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | ","
unreserved: alpha | digit | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
excluded: all control characters (<U+0020 and U+007F) | " " | <"> | "#" | "%" | "\" | "<" | ">" | "[" | "]" | "^" | "`" | "|"

It's weird that "#" and "%" are "excluded" rather than "reserved", since they do appear in URIs. All of the other excluded characters are totally illegal.

Anyway, my reading of "reserved" is "they can appear in URIs but they *may* have special meaning, depending on the scheme and the part of the URI in question, and "unreserved" is "they may freely appear in URIs and have precisely the same meaning as their escaped versions". Since the syntax of our scheme ("javascript") is the JavaScript language, I would assume we don't need to worry about *any* of the reserved characters being present. In other words, if the string "http://" appears unquoted in a javascript URI, it shouldn't matter. Therefore, we should allow all reserved and unreserved characters to appear unescaped, and only escape the excluded characters, BUT with the above caveat that we also escape "&" and "'".

Since Python's urllib.parse.quote function by default escapes everything but alphanumeric characters and "_", "." and "-", we should supply the following safe set:
";/?:@=+$,!~*()"

Revision history for this message
Matt Giuca (mgiuca) wrote :

Fixed in trunk r18 by David Coles. I tweaked it in trunk r20.

Changed in qrify:
status: Triaged → Fix Committed
Revision history for this message
Matt Giuca (mgiuca) wrote :

So the above analysis was based on RFC 2936, which it turns out is actually the old URI spec, obsoleted by RFC 3986. 3986 changes the syntax quite a bit -- for one thing I was wondering above why "#" was "excluded". It turns out in the old document, "#" isn't actually part of the URI, it's separate. In 3986, "#" is a reserved character, same as "?". And the rest of the syntax is defined quite differently. There are also some unreserved characters that are now reserved. So I looked at the above "safe set" again, and it turns out the safe set we chose is still exactly right, but for different reasons.

The relevant syntax is:

   URI = scheme ":" hier-part [ "?" query ] [ "#" fragment ]
   hier-part = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-rootless = segment-nz *( "/" segment )
   path-empty = 0<pchar>
   segment = *pchar
   segment-nz = 1*pchar
   pchar = unreserved / pct-encoded / sub-delims / ":" / "@"
   query = *( pchar / "/" / "?" )
   fragment = *( pchar / "/" / "?" )
   unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~"
   sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

So basically we can fold a lot of cases into one: we consider everything after the "javascript" to be a path (with zero or more slashes), followed by an optional "?" query and an optional "#" fragment. The path part is allowed to contain any of the following characters unescaped (assuming we don't place any special emphasis on "/"):

ALPHA / DIGIT / "-" / "." / "_" / "~" / "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" / ":" / "@" "/"

This notably excludes "?" and "#". But, assuming we aren't going to parse the query string specially (and it looks like browsers don't, for javascript: URIs), we are allowed to have exactly one "?". After which, we can have more characters in the above set, as well as "?". Which means we are effectively allowed as many "?" characters as we like.

Does the same apply for "#"? No. For some reason (perhaps oversight), "fragment" does not include "#" -- you aren't allowed to have a "#" in a URL after the first "#". So we shouldn't allow "#" to appear unescaped in the URI.

Also, as above, we remove "&" and "'" for other reasons -- they can't appear unescaped in an XML attribute.

Therefore, ignoring alphanumeric characters and "_", "." and "-" (always safe in Python-land), we want the following safe set:

;/?:@=+$,!~*()

That's the exact same set we already have.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.